SQX sistematically causes PC crash after a few minutes when running workflow

SQX sistematically causes PC crash after a few minutes when running the attached workflow. The event I find in windows log is attached



I have  tried to activate the debugging, but the crash does not occur any more ... I do not know if the issue could be related to the huge number of strategies created (more than 1.5Million/hour in no-debug mode, which drops to 800K/hour in debug mode). When the crashes occurr, all physical parameters of the PC are OK (Mem, CPU Temp, ...)



Let me know how I can support further the debug of the issue


M

Attachments
Event Log.png
(39.65 KiB)
2 - Standard Forex 88-0-29.cfx
(34.96 KiB)
log_2022_10_15.log
(1.46 MiB)
Build Test 1.cfx
(103.22 KiB)
  • Votes +2
  • Project StrategyQuant X
  • Type Bug
  • Status New
  • Priority Normal

History

MS
#1

mscapin95

15.10.2022 13:19

Task created

b
#2

bentra

15.10.2022 15:57
Which temps did you check? It seems like a hardware issue. You could experiment with using fewer cores. 
MS
#3

mscapin95

15.10.2022 17:51

Attachment Underpower 180W Stress Test.png added

Attachment Underpower 180W Stress Test.png deleted

Attachment Immagine 2022-10-15 162446.png added

Immagine 2022-10-15 162446.png
(174.04 KiB)
I doubt it is an HW issue.

The graph of HW resources during run is attached: indicators are pretty normal.

I have already underpowerd the CPU to keep T stable, and performed a 24h HW stress test using XTU at 100% CPU all Core used withour any problem.


Moreover I checked the XTU logs just before the PC crashed: the HW utilization was very low:


  • 1 Core used
  • CPU Util: 3%
  • Mem: 22GB
  • Core Power 38W
  • No Thermal throtting
  • CPU Temp 51C


I will anyhow try to lower to number of cores in SQX to see what happens

b
#4

bentra

15.10.2022 18:13
Underpowering the cpu can lead to instability but you probably know that and you probably have done some stability testing but SQX can be it's own stability tester I've found. You could still try bumping up the voltage a little to see if it helps stability at all like it did in my case.


Did you test the memory?


MS
#5

mscapin95

15.10.2022 19:39
Yes, I tested the memory, too.


Actually, SQX caused the PC crash several times BEFORE I undepowered it. 


In fact, I thought that the high Core temps I had when I ran SQX could be the root cause. This is why I underpowered the CPU.


The underpowered config is anyhow very stable also under stress test or using other apps. 

b
#6

bentra

15.10.2022 23:45
Voted for this task.
MS
#7

mscapin95

16.10.2022 21:33
Voted for this task.
MS
#8

mscapin95

17.10.2022 18:50
I limited number of Cores of the CPU in the config, restarted the app (*) but it crashed again.


The problem doesn't occurr if the number of strategies/hour is "normal". For example I' ve been running the WFlow with M30 instead of H4 for the last 4 hours @ 400K strategies/hour and everything is going fine.



(*) btw I am not sure the config is really applied. This is what I find in the logs 


16:37:18.127 [Thread-91632] INFO  c.s.p.Task.impl.Retest.RetestTask - Batch size computed to: 5, SingleOptims: false, str to test: 7, totalCores: 24, Backtest mode: 1

b
#9

bentra

18.10.2022 00:00
"Actually, SQX caused the PC crash several times BEFORE I undepowered it. "


I see but had you done an oc already? I'd try a default non oc setup.

It's just so unlikely that sqx can crash windows. I read all the bug reports and have read thousands over the last few years and not seen this before except for maybe 2 hardware issues. Except it did happen to me once when I was unstable OC. I had to reduce the clockspeed AND increase the voltage to get it stable even though it passed some other stability tests fine, sqx still crashed. The fact that you need to hit a number of strats per hour to crash lends credence to the unstable oc hypothesis. So yeah I'd go for lower clock speeds AND increase in voltage, obv minding temp.

MS
#10

mscapin95

18.10.2022 09:17
The problem occurred wirh the standard non OC config.


In fact I firstly had your same exact toughts: "the HW util must be too high and the Temps not well controlled". This is why I started looking for a safer config.


Actually, che custom config is not Oveclocked. I just set the Core Package setting to have a more stable TEMP. This causes the CPU to run to a bit lower Clock Speed than the standard Config (4.6Ghz instead of 4.8).

This config has been checked with a 24h stress test and a 2h memory test without any issue. Temps have been kept well under control (80C vs 100C of the unlimited config).


Very tricky. 


Anyhow, if the issue does not happen under "normal" strategy generation is not that serious. My only doubt is that there could be some "hidden" issue in sqx in using very performing HW and/or a high number of parallel threads, that could become more serious in the future when powerful hw will become more and more available at low price.

MS
#11

mscapin95

18.10.2022 13:17

Attachment Graph 1.png added

Attachment Graph 2.png added

Graph 2.png
(709.84 KiB)
Graph 1.png
(341.80 KiB)
Interesting update. 


I ran a lower strategy throughput (M30, approx 400K/hour) cycle for approx 12h without any issue. 


I then re-set the CPU number to the original value (1 for GUI, 23 for Computing) but changed a little bit the Memory configuration like this:


  • 26GB fixed
  • Let Java decide

I then restarted the high-throughput cycle (H4, approx 1.7M/hour) to re-create the issue. I noticed that CPU util and parameters are lower compared to the low-throughput scenario (left part of graph 1), while the memory util is significantly bigger (right part of graph1).


The cycle has been running regularly for the last 3 hours without any crash, and memory seems well under control (Graph 2). This has never happend before with the standard memory config.


I am then wandering if the issue could be related to the memory allocation under heavy throughput conditions.


I will leave the cycle running for some hours and let you know what happens.., 

MS
#12

mscapin95

18.10.2022 15:10

Attachment log_2022_10_18.log added

Attachment Graph 3.png added

Attachment Graph 4.png added

Graph 3.png
(290.91 KiB)
log_2022_10_18.log
(1.19 MiB)
Graph 4.png
(351.04 KiB)
The PC finally crashed after having run 4 hours of the hgh-performance workflow at an average of 1.71 Million Strategies / hour.


My personal reacords of over 1.8M strategies generated/hour, 140K accepted/hour has been reached (Graph 3)


Al PC resources seemed OK all along the test and also just befoe the crash, including memory which reached the max of 32GB during one of the cycles without any issue (Graph 4)


The following error, which I haven't seen before, appered in the log (attached) several times and also a few seconds before the crash.


ERROR c.s.p.S.impl.Project.ProjectServlet - UI error: DatabankCtrl for Build & Test 1/Backup is overwhelmed, old updates were not processed yet and new updates are coming


The crashes causes the Workflow corruption and all strategies, also those saved, are lost


I hope this can help in the debug.




Votes: +2

Drop files to upload

or

choose files

Max size: 5MB

Not allowed: exe, msi, application, reg, php, js, htaccess, htpasswd, gitignore

...
Wait please