Disk usage limit exceeded

Message boards : Number crunching : Disk usage limit exceeded

Author	Message
Zalster Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level Scientific publications	Message 48021 - Posted: 22 Oct 2017 \| 0:43:41 UTC
	So I'm starting to see these errors on the e4s3_e1s13p0f400-ADRIA_FOLDUBQ80_crystal_ss_contacts_50_ubiquitin_1-0-1-RND1448_0 <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> Disk usage limit exceeded </message> # Access violation : progress made, try to restart called boinc_finish Anyone else?
	ID: 48021 \| Rating: 0 \| rate: / Reply Quote

Speedy Send message Joined: 19 Aug 07 Posts: 42 Credit: 28,391,082 RAC: 0 Level Scientific publications	Message 48026 - Posted: 23 Oct 2017 \| 1:27:26 UTC
	Name e2s4_e1s43p0f362-ADRIA_FOLDUBQ80_crystal_ss_contacts_50_ubiquitin_0-0-1-RND8662_3 Application version Short runs (2-3 hours on fastest card) v9.18 (cuda80) <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> Disk usage limit exceeded </message> <stderr_txt> Exit status 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED Entry form out of Boinc manager log 23-Oct-17 1:43:58 PM \| GPUGRID \| Aborting task e2s4_e1s43p0f362-ADRIA_FOLDUBQ80_crystal_ss_contacts_50_ubiquitin_0-0-1-RND8662_3: exceeded disk limit: 286.70MB > 286.10MB Run time 52,759.76 CPU time 9,484.81
	ID: 48026 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1578 Credit: 6,031,311,851 RAC: 12,654,482 Level Scientific publications	Message 48027 - Posted: 23 Oct 2017 \| 7:40:12 UTC
	These seem to be the same sequence of workunits as we've been discussing in Bad batch of tasks?. Speedy's workunit 12788007 has the same error on three different computers.
	ID: 48027 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1094 Credit: 7,162,807,676 RAC: 24,413,424 Level Scientific publications	Message 48028 - Posted: 23 Oct 2017 \| 7:40:39 UTC
	Most of the tasks from this batch are faulty. See here: http://gpugrid.net/forum_thread.php?id=4632
	ID: 48028 \| Rating: 0 \| rate: / Reply Quote

Zalster Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level Scientific publications	Message 48031 - Posted: 23 Oct 2017 \| 14:24:34 UTC - in response to Message 48028.
	Thanks to both you and Richard. I see now that they part of that bad batch. Ok, nothing we can do then....
	ID: 48031 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 124 Level Scientific publications	Message 48034 - Posted: 23 Oct 2017 \| 21:38:55 UTC - in response to Message 48031. Last modified: 23 Oct 2017 \| 21:48:39 UTC
	Thanks to both you and Richard. I see now that they part of that bad batch. Ok, nothing we can do then.... They are the part of that bad batch, but they fail for different reasons. The other (the 'Bad batch of tasks?') thread is about the tasks which fail right after the start with "the simulation became unstable" error. Perhaps the algorithm to check the simulation's stability set to overly sensitive for this part of the batch. This thread is about the tasks which run for hours, until they exceed the disk usage limits set for the tasks on the server and then error out. This is much more annoying than the 'original' one, as it wastes electricity and time, and it can be easily fixed by raising the disk usage limit of a task (if it is really necessary, that is the high disk usage is not a result of another error).
	ID: 48034 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 48045 - Posted: 25 Oct 2017 \| 10:09:55 UTC - in response to Message 48034.
	Dears, I think Adria has cancelled those tasks. Perhaps they had more atoms than usual, which kind of fooled the pre submission checks and caused an higher failure rate. However, some WUs succeeded; when this happens, it's harder for us to tell for know what's going on. Thanks for your patience. T
	ID: 48045 \| Rating: 0 \| rate: / Reply Quote

Zalster Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level Scientific publications	Message 48047 - Posted: 25 Oct 2017 \| 12:44:03 UTC - in response to Message 48045.
	Thank you Toni for the response and update. It happens. Glad they can sort this out and hopefully find out what happen to allow us to process these correctly. Science is both setbacks and successes. We learn from both.
	ID: 48047 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 48048 - Posted: 25 Oct 2017 \| 13:42:06 UTC - in response to Message 48047. Last modified: 26 Oct 2017 \| 14:06:51 UTC
	Thanks... please consider that it's a consequence of the fact that we are interested into a variety of systems and conditions, implying that we can not make all workunits exactly the same.
	ID: 48048 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1578 Credit: 6,031,311,851 RAC: 12,654,482 Level Scientific publications	Message 48074 - Posted: 31 Oct 2017 \| 12:54:00 UTC
	I've just received a brand-new ADRIA short run task - e46s1_e44s2p0f280-ADRIA_FOLDPG80_crystal_ss_contacts_50_proteinG_2-0-1-RND2909_0. Hoping to catch any disk usage errors before they happen, I had a look at the file sizes. The largest single upload file (_9) is allowed to reach 328,000,000 bytes. But the workunit as a whole is only allowed to use 300,000,000 bytes of disk space (<rsc_disk_bound>). That seems a touch inconsistent... I'll try to keep an eye on the file size as it runs, and adjust the disk bound if I need to. So far, 1,308 KB in 15 minutes. Don't forget, we may have to allow for up to 150 MB of program files (like cufft64_80.dll) in the disk usage limit.
	ID: 48074 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1578 Credit: 6,031,311,851 RAC: 12,654,482 Level Scientific publications	Message 48075 - Posted: 31 Oct 2017 \| 14:56:33 UTC - in response to Message 48074.
	Reached 12.3 MB and 37.5% progress after 2 hours 15 minutes - I think this one is going to make it. Looking at long-run tasks on a different machine, they've been given a <rsc_disk_bound> of 4 billion bytes - 4,000,000,000, more than ten times as much. It may have been the difference between the short and long run queue setups, rather than the tasks themselves, which caught Adria out last time.
	ID: 48075 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 48076 - Posted: 31 Oct 2017 \| 15:05:20 UTC - in response to Message 48075.
	Yes, it's inconsistent, but the real problem is that the _9 file should not become that large. I thought the workunits were cancelled... are they still around? Thanks T
	ID: 48076 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 48077 - Posted: 31 Oct 2017 \| 15:22:26 UTC - in response to Message 48076.
	Also, do I mistake or they should have been long WUs?
	ID: 48077 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1578 Credit: 6,031,311,851 RAC: 12,654,482 Level Scientific publications	Message 48078 - Posted: 31 Oct 2017 \| 17:02:47 UTC - in response to Message 48077.
	Also, do I mistake or they should have been long WUs? Speedy's workunit, from the previous bad batch, ran for between 32,429 seconds (GTX 1080) and 115,002 seconds (GTX 960). Yes, I think those should have been 'long queue' values. My current task from the new batch today is on course for a 6-hour run (GTX 970) and a 33 MB final file size - I think we're going to make it :-)
	ID: 48078 \| Rating: 0 \| rate: / Reply Quote

Speedy Send message Joined: 19 Aug 07 Posts: 42 Credit: 28,391,082 RAC: 0 Level Scientific publications	Message 48080 - Posted: 31 Oct 2017 \| 21:31:20 UTC - in response to Message 48078. Last modified: 31 Oct 2017 \| 21:32:23 UTC
	Also, do I mistake or they should have been long WUs? Speedy's workunit, from the previous bad batch, ran for between 32,429 seconds (GTX 1080) and 115,002 seconds (GTX 960). Yes, I think those should have been 'long queue' values. My current task from the new batch today is on course for a 6-hour run (GTX 970) and a 33 MB final file size - I think we're going to make it :-) The task that ran for 150,002 seconds was on a 970. I also agree these should have been in the long queue
	ID: 48080 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1578 Credit: 6,031,311,851 RAC: 12,654,482 Level Scientific publications	Message 48081 - Posted: 31 Oct 2017 \| 22:16:35 UTC - in response to Message 48080.
	The task that ran for 150,002 seconds was on a 970. I did look at the stderr_txt, and the first three starts all say 960. It's only now I look more carefully that I see that the final part of the run was done on a 970.
	ID: 48081 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : Disk usage limit exceeded

	About	Science	Volunteers	Performance	Forum	Join us	Donate