Many errors in new vwesion of application

Message boards : Number crunching : Many errors in new vwesion of application

Author	Message
ElleSolomina Send message Joined: 22 Mar 14 Posts: 43 Credit: 527,549,678 RAC: 2,044 Level Scientific publications	Message 53276 - Posted: 3 Dec 2019 \| 20:53:01 UTC
	One of my host does not process any job without errors, I do not understand why. https://www.gpugrid.net/results.php?hostid=170784. Currently I stop getting new WU. ____________
	ID: 53276 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 53277 - Posted: 3 Dec 2019 \| 21:06:41 UTC - in response to Message 53276.
	The short (10 second) errors are not your fault, but a problem with the work units. However, the long runs that fail after 54 hours probably means that you GTX 650 is overclocked or too hot. I would just use the GTX 1060 anyway. Your GTX 650 will use more electric energy than it is worth in my opinion.
	ID: 53277 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 53278 - Posted: 3 Dec 2019 \| 22:47:11 UTC - in response to Message 53277.
	The errors occur at the end of the run. I'm inclined to think that the problem is with the software, permissions, or something like that.
	ID: 53278 \| Rating: 0 \| rate: / Reply Quote

ElleSolomina Send message Joined: 22 Mar 14 Posts: 43 Credit: 527,549,678 RAC: 2,044 Level Scientific publications	Message 53341 - Posted: 14 Dec 2019 \| 7:20:54 UTC - in response to Message 53278.
	Toni +1, I think so too because older versions of WU worked well. ____________
	ID: 53341 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 53342 - Posted: 14 Dec 2019 \| 9:35:03 UTC - in response to Message 53341.
	Reinstall boinc.
	ID: 53342 \| Rating: 0 \| rate: / Reply Quote

archeye Send message Joined: 10 May 13 Posts: 10 Credit: 6,490,450 RAC: 0 Level Scientific publications	Message 53426 - Posted: 1 Jan 2020 \| 15:41:21 UTC Last modified: 1 Jan 2020 \| 15:41:39 UTC
	I had a similar issue today, just at the end of the run it failed with, Name initial_1730-ELISA_GSN4V1-41-100-RND8024_0 Exit status 195 (0xc3) EXIT_CHILD_FAILED Run time 103,023.27 CPU time 22,582.33 http://www.gpugrid.net/result.php?resultid=21587058 I was hoping for some advice on how to check if my PC/GPU(s) are working correctly. If it was failing at the start of a run I would be perhaps less interested but still concerned for my hardware and also all this computing effort is effectively wasted which is a shame. I have now just suspended all tasks so I can shut down the computer and clean out the filters. I will see what happens with the next 2 GPUGRID tasks.
	ID: 53426 \| Rating: 0 \| rate: / Reply Quote

archeye Send message Joined: 10 May 13 Posts: 10 Credit: 6,490,450 RAC: 0 Level Scientific publications	Message 53427 - Posted: 1 Jan 2020 \| 17:44:11 UTC - in response to Message 53426. Last modified: 1 Jan 2020 \| 18:24:47 UTC
	So I cleaned and washed out my PC fan filter, the fan runs slower now so that's good. Just had 2 more that were running together fail with the same error as previously posted. This time I had just started up my mmo game Elder Scrolls Online and the screen went black, the fan speed dropped right down so I killed the game and checked my GPU tasks and saw both had failed. I have been playing this game and running GPU tasks together for the past week with no problem. I will not run any GPU tasks for a while and see if I have any strange PC behaviour before attempting to run any more. Edit: Just as an afterthought, where a GPU task has already achieved one or more checkpoints and then there is a failure of some kind. Rather than just exiting whey cant it just reload from the last checkpoint? It sort of makes more sense to me and this would also minimise the lost computing effort of the volunteer crunchers :) Edit2: The last checkpoint must be preserved as when you just switch off your computer when tasks are between checkpoints. This saved point is used as the starting point next time you run Boinc manager.
	ID: 53427 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 53432 - Posted: 2 Jan 2020 \| 8:00:46 UTC - in response to Message 53427.
	Just as an afterthought, where a GPU task has already achieved one or more checkpoints and then there is a failure of some kind. Rather than just exiting whey cant it just reload from the last checkpoint? What would make it run smoothly for the 2nd (3rd...1000th) attempt without user intervention? It sort of makes more sense to me and this would also minimise the lost computing effort of the volunteer crunchers :) The tasks on such hosts would never finish this way, they would try to run them until their deadline which would slow down the processing of the given chain of workunits too much. Failed workunits serve as the source of self protection for the project, and they also serve as a warning sign for the user. Does your host have two GTX 980s? Are those cards in SLI mode? (That could be the problem.) The upper card tends to be run hotter in this setup, so you should check the temperatures of your cards by MSI Afterburner (it runs with other manufacturer's cards too, or you can use similar tools provided by the manufacturer of your card). You can also use this tool to lower the temps of your cards by: 1. lowering its power target (=lowering GPU clock frequency and GPU voltage) 2. lowering its memory clock frequency 3. increasing its fan speed These problems usually arise by high clock speeds combined with high temperatures. The previous version of the GPUGrid app didn't tolerate well if it was suspended frequently, perhaps this could be the case for the new app too. If you have two different Nvidia cards in the same system, you should provide a solution for the suspended GPUGrid tasks to restart on the same card on which they were processed previously. (The new app can't restart suspended tasks on a different card.) The easiest solution is to exclude the lesser card from GPUGrid by creating / editing cc_config.xml (see the exclude_GPU section here). The new app utilizes the GPU much more than the previous version, so you may have to re-calibrate (lower) your overclock settings.
	ID: 53432 \| Rating: 0 \| rate: / Reply Quote

archeye Send message Joined: 10 May 13 Posts: 10 Credit: 6,490,450 RAC: 0 Level Scientific publications	Message 53433 - Posted: 2 Jan 2020 \| 17:13:28 UTC - in response to Message 53432.
	Thanks for the detailed reply it helps to know we are well supported. As for you Avatar, it seems you are looking t the universe for answers but I would imagine the answers more likely originate inside your self. Anyway, 1. Hardware, Operating System Windows 10 Pro 64-bit CPU Intel Core i7 @ 4.00GHz Haswell 22nm Technology RAM 32.0GB Motherboard MSI Z97-G45 GAMING (MS-7821) (SOCKET 0) %1 Chipset Graphics ROG PG278Q (2560x1440@144Hz) SAMSUNG (1680x1050@59Hz) 4095MB NVIDIA GeForce GTX 980 (NVIDIA) 37 °C 4095MB NVIDIA GeForce GTX 980 (NVIDIA) 26 °C ForceWare version: 441.20 SLI Enabled Storage 447GB Crucial_CT480M500SSD1 (SATA (SSD)) 931GB TOSHIBA DT01ACA100 (SATA ) 27 °C 8GB Microsoft Virtual Disk (File-backed Virtual (SSD)) Optical Drives HL-DT-ST DVDRAM GH24NSC0 Audio NVIDIA High Definition Audio 2. Info from Boinc Manager, PC 1 01/01/2020 18:54:38 cc_config.xml not found - using defaults 2 01/01/2020 18:54:38 Starting BOINC client version 7.14.2 for windows_x86_64 3 01/01/2020 18:54:38 log flags: file_xfer, sched_ops, task 4 01/01/2020 18:54:38 Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8 5 01/01/2020 18:54:38 Data directory: C:\ProgramData\BOINC 6 01/01/2020 18:54:38 Running under account Chris 7 01/01/2020 18:54:40 CUDA: NVIDIA GPU 0: GeForce GTX 980 (driver version 441.20, CUDA version 10.2, compute capability 5.2, 4096MB, 3292MB available, 4979 GFLOPS peak) 8 01/01/2020 18:54:40 CUDA: NVIDIA GPU 1: GeForce GTX 980 (driver version 441.20, CUDA version 10.2, compute capability 5.2, 4096MB, 3292MB available, 4979 GFLOPS peak) 9 01/01/2020 18:54:40 OpenCL: NVIDIA GPU 0: GeForce GTX 980 (driver version 441.20, device version OpenCL 1.2 CUDA, 4096MB, 3292MB available, 4979 GFLOPS peak) 10 01/01/2020 18:54:40 OpenCL: NVIDIA GPU 1: GeForce GTX 980 (driver version 441.20, device version OpenCL 1.2 CUDA, 4096MB, 3292MB available, 4979 GFLOPS peak) 11 01/01/2020 18:54:40 Host name: PC 12 01/01/2020 18:54:40 Processor: 8 GenuineIntel Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz [Family 6 Model 60 Stepping 3] 13 01/01/2020 18:54:40 Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 tm2 pbe fsgsbase bmi1 smep bmi2 14 01/01/2020 18:54:40 OS: Microsoft Windows 10: Professional x64 Edition, (10.00.18362.00) 15 01/01/2020 18:54:40 Memory: 31.95 GB physical, 36.70 GB virtual 16 01/01/2020 18:54:40 Disk: 366.27 GB total, 121.16 GB free 17 01/01/2020 18:54:40 Local time is UTC +1 hours 18 01/01/2020 18:54:40 No WSL found. 19 01/01/2020 18:54:40 VirtualBox version: 5.2.8 25 3. What would make it run smoothly for the 2nd (3rd...1000th) attempt without user intervention? Well i agree if it was left unchecked but a one time fail may be a glitch associated with other computing activity. 3 times with any consecutive failure then exit for me seems sensible. However you are free to run your project as best suits your needs and while I am a volunteer helper so I also guess I support your decisions too. 4. Afterburner I do use this and have tweaked the curve for the fan speed so it cuts in at a higher rate at lower temps. There is no overclocking I have not looked into, 1. lowering its power target (=lowering GPU clock frequency and GPU voltage) 2. lowering its memory clock frequency I just left that alone as I really don't have enough knowledge to know the impact of any changes there. Suspending gpugrid tasks, I use the EfMer Boinc tasks app and the EfMer TTrottle I set my CPU temp limit to 70deg C and Gpu(s) to 65 dec C sometimes I just need to turn off the computer but if I think a but ahead I will use the setting in EfMer Boinc tasks app to "suspend at checkpoint" I do use SLI mode, maybe an alternative is to just allow the GPU task to run on the second GPU which is providing mostly PhsyX. Apparently I can configure the settings in EfMer Boinc tasks app config file to disable my GPU1 for task Boinc task use. That's enough for now i think :)
	ID: 53433 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1288 Credit: 5,135,281,959 RAC: 9,906,778 Level Scientific publications	Message 53434 - Posted: 2 Jan 2020 \| 18:30:54 UTC
	Since you have two of the same card type, you don't need to worry about restarting a paused/suspended task on another card and erroring out the task. I still would be concerned about the cards being in SLI configuration. Anecdotal evidence at all projects I crunch for says that computing on cards with a SLI connection is problematic. There is too much going on under the covers with the Nvidia driver to gang hardware on both cards together that prevents proper calculations on both cards. If you need to keep the SLI configuration for gaming, I would restrict computation to only one card by excluding a device.
	ID: 53434 \| Rating: 0 \| rate: / Reply Quote

archeye Send message Joined: 10 May 13 Posts: 10 Credit: 6,490,450 RAC: 0 Level Scientific publications	Message 53435 - Posted: 4 Jan 2020 \| 10:53:18 UTC - in response to Message 53434. Last modified: 4 Jan 2020 \| 10:53:43 UTC
	Thanks Keith for the advice I will surely try that and allow only 1 GPU card for processing. After that I will also experiment with disabling SLI while running both cards.
	ID: 53435 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1288 Credit: 5,135,281,959 RAC: 9,906,778 Level Scientific publications	Message 53436 - Posted: 4 Jan 2020 \| 18:41:46 UTC - in response to Message 53435.
	It has been a long while since I ran Windows, but I seem to remember a SLI setting in the Nvidia control panel that toggled SLI on and off. May require a host restart though. But that might be a solution. The physical SLI connector is not the problem, it can be present and has no effect if the software enabling SLI is not turned on. You could enable the SLI connection for gaming and then when finished,reboot the host with SLI disabled for crunching with both cards.
	ID: 53436 \| Rating: 0 \| rate: / Reply Quote

JochenZ Send message Joined: 15 Aug 09 Posts: 2 Credit: 365,499,742 RAC: 465,310 Level Scientific publications	Message 53442 - Posted: 9 Jan 2020 \| 22:32:31 UTC - in response to Message 53426.
	Hello, also nearly 100% of the tasks failed with exit code 195 on my ASUS GTX 1070TI, which I manually overclocked, some tasks after minutes and some task after hours calculating. After reducing speed to normal overclocking mode all 7 tasks I received finished correct. So I think, too high overclocking is the reason for failures in New version of ACEMD tasks. Short or long run tasks had never failures before. Also in other projects like Einstein or Milkyway I got no failures with manually higher overclocking, but appox. 8% longer crunching time with normal overclocking. But the worst is, that I get no new tasks in GPUGRID. Even with manually triggering no tasks are available.... I can't do any further tests. Rampf, mampf my computer wants to crunch ;-) (Denglish rhyme)
	ID: 53442 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Many errors in new vwesion of application

	About	Science	Volunteers	Performance	Forum	Join us	Donate