App restarts after being suspended and restarted.

Message boards : Number crunching : App restarts after being suspended and restarted.

Author	Message
jm7 Send message Joined: 2 Aug 22 Posts: 1 Credit: 6,825,500 RAC: 9 Level Scientific publications	Message 59666 - Posted: 29 Dec 2022 \| 1:58:27 UTC
	I suspended a Python Apps for GPU hosts 4.04 (cuda1131) for a few minutes to allow some other tasks to finish, and the Time counter started at 0 again. It was at 3 days and a couple of hours. You really need to checkpoint more often that once every 3 days. It appears to be stuck at 4% for over a day now. ____________
	ID: 59666 \| Rating: 0 \| rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1313 Credit: 6,011,867,459 RAC: 9,712,384 Level Scientific publications	Message 59667 - Posted: 29 Dec 2022 \| 4:38:56 UTC - in response to Message 59666.
	I suspended a Python Apps for GPU hosts 4.04 (cuda1131) for a few minutes to allow some other tasks to finish, and the Time counter started at 0 again. It was at 3 days and a couple of hours. You really need to checkpoint more often that once every 3 days. It appears to be stuck at 4% for over a day now. The tasks do checkpoint in fact. It takes a few minutes depending on the speed of the system to replay computations back to the last checkpoint. Upon restart the task will display the low % percentage and then jump forward to the last checkpoint percentage. You can check when the last checkpoint was written by viewing the task properties in the Manager sidebar. On Windows hosts I have heard that stopping a task midstream and restarting can often hang the task. You should see this verbage repeating over and over Starting!! Define rollouts storage Define scheme Created CWorker with worker_index 0 Created GWorker with worker_index 0 Created UWorker with worker_index 0 Created training scheme. Define learner Created Learner. Look for a progress_last_chk file - if exists, adjust target_env_steps Define train loop 11:06:52 (6450): wrapper (7.7.26016): starting 11:06:54 (6450): wrapper (7.7.26016): starting 11:06:54 (6450): wrapper: running bin/python (run.py) for every restart in the stderr.txt file in the slot that the running task occupies, then the task is likely hung and you can either try restarting the host and BOINC to see if you can persuade it back into running or abort it and get another task and try not to interrupt it.
	ID: 59667 \| Rating: 0 \| rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 78 Credit: 16,175,793 RAC: 9,787 Level Scientific publications	Message 59669 - Posted: 29 Dec 2022 \| 12:27:02 UTC - in response to Message 59667.
	It also writes logs to wrapper_run.out
	ID: 59669 \| Rating: 0 \| rate: / Reply Quote

jjch Send message Joined: 10 Nov 13 Posts: 98 Credit: 15,429,700,388 RAC: 174,090 Level Scientific publications	Message 59674 - Posted: 3 Jan 2023 \| 4:59:28 UTC - in response to Message 59669.
	jm7 - It's not the checkpoints that are the problem. You tasks are failing with multiple different errors. Your system needs a GPU driver update and a few tune-up adjustments. Regarding the last task 33220676 http://www.gpugrid.net/result.php?resultid=33220676 This task has failed with a RuntimeError: Unable to find a valid cuDNN algorithm to run convolution That error is somewhat inconclusive however it seems to be related to a CUDA error. Your GPU driver is version 512.78 and the latest is version 527.56. Download the driver here: https://www.nvidia.com/download/driverResults.aspx/197460/en-us/ I would recommend fully de-installing the driver using DDU first before you reinstall it. DDU can be found here: https://www.guru3d.com/files-details/display-driver-uninstaller-download.html It's a bit convoluted to find the download but keep drilling down for it. Regarding the previous task 33221790 http://www.gpugrid.net/result.php?resultid=33221790 This one failed with a RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 3612672 bytes. It's telling you that it ran out of memory and needed a bit more than 3Gb additional memory. There are two things that can cause this. One is your physical memory installed in the system. Your laptop has 16GB so it's on the low side. If you have the opportunity I would recommend installing more memory. Other problems can be caused by the default Boinc settings. Check the Disk and memory tab under Options > Computing preferences. Make sure you are not restricting the disk space. You can set this, if you have to, however leaving it unrestricted will allow it to use all the available space. This also requires that the disk where your home directory for Boinc is located has enough free space. Make sure the Memory percentages are enough for the GPUgrid python tasks. You can bump this up a bit if needed. Also, set the Page/swap file to 100%. As for your first task 33222758 that failed http://www.gpugrid.net/result.php?resultid=33222758 It ran out of paging/swap space first ImportError: DLL load failed while importing algos: The paging file is too small for this operation to complete. Then you restarted it and it ran out of memory. RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:81] data. DefaultCPUAllocator: not enough memory: you tried to allocate 6422528 bytes. You will need to increase your paging file. It should be set to at least 50GB. You can monitor this and see if it is enough after setting it. Not quite sure if you can find it the same way in Windows 11 but on Win 10 it's under Settings > System > About > Advanced system settings > Advanced tab > Performance - Settings... button > Advanced tab > Virtual memory - Change... button. Remove the check for Automatically.... Select the Custom size radio button. The Initial size doesn't matter too much. You can set it to whatever was currently allocated. The main thing is to set Maximum size to at least 51200. It can be more but you shouldn't need more than 60GB. Hit the Set button and back out with OK ... etc. A couple more pointers - Try not to stop and restart the GPUgrid tasks too much. They should replay from the last checkpoint but they shouldn't take more than a few minutes. Normally they will complete in a bit more than 24hrs total. If it is staying at 4% or running more than a couple days it is stalled and probably should be aborted. You can check the stderr file in the slot the task is running in. Also, don't run a lot of other programs on your PC at the same time. GPUgrid needs a considerable amount of CPU resources, memory and swap space. If you are competing with other applications it may run short. GLHF JJCH
	ID: 59674 \| Rating: 0 \| rate: / Reply Quote

Message boards : Number crunching : App restarts after being suspended and restarted.

//