Message boards : Number crunching : All ADRIA_2OV5_CLOSED, ADRIA_2OV5_CONF_CLOSED & ADRIA_2OV5_OPEN failing
Author | Message |
---|---|
All ADRIA_2OV5_CLOSED, ADRIA_2OV5_CONF_CLOSED & ADRIA_2OV5_OPEN are failing. | |
ID: 44321 | Rating: 0 | rate: / Reply Quote | |
I had 10 tasks failed on me today: | |
ID: 44329 | Rating: 0 | rate: / Reply Quote | |
Multiple WU's are failing on several of my systems. I will set no new work until the problem is resolved. | |
ID: 44334 | Rating: 0 | rate: / Reply Quote | |
Some of my GPUs are not getting work because of these flawed WUs. Ridiculous. | |
ID: 44335 | Rating: 0 | rate: / Reply Quote | |
One of my machines with 2 GPUs has gotten 12 of these bad ARIA WUs today and now won't get work, what's worse it says it's now allowed only 1 WU/day. WOULD SOMEBODY PLEASE FIX THIS! | |
ID: 44336 | Rating: 0 | rate: / Reply Quote | |
Same here. | |
ID: 44337 | Rating: 0 | rate: / Reply Quote | |
One of my machines with 2 GPUs has gotten 12 of these bad ARIA WUs today and now won't get work, what's worse it says it's now allowed only 1 WU/day. WOULD SOMEBODY PLEASE FIX THIS! When your quota says 1 WU/day you will actually get 2 a day. You are compounding the problem by "Aborting WU's" as these also reduce your daily qouta. Adria is going for it again with ADRIA_2OV5_AMBER_CLOSED ADRIA_2OV5_AMBER_OPEN Make sure you don't abort these as they may actually work. 3rd time lucky. | |
ID: 44355 | Rating: 0 | rate: / Reply Quote | |
One of my machines with 2 GPUs has gotten 12 of these bad ARIA WUs today and now won't get work, what's worse it says it's now allowed only 1 WU/day. WOULD SOMEBODY PLEASE FIX THIS! GPUGRID 8/30/2016 12:43:17 AM Requesting new tasks for NVIDIA GPU GPUGRID 8/30/2016 12:43:22 AM Scheduler request completed: got 0 new tasks GPUGRID 8/30/2016 12:43:22 AM No tasks sent GPUGRID 8/30/2016 12:43:22 AM No tasks are available for Long runs (8-12 hours on fastest card) GPUGRID 8/30/2016 12:43:22 AM This computer has finished a daily quota of 1 tasks Well at least SETI got some GPU time and their WUs actually run. Just got my first ADRIA_2OV5_AMBER. After the last 2 batches I don't have a lot of confidence but we'll see. It's crazy that they just let these bad batches run without cancelling them when they know they're flawed. 12 bad WUs in a day is ridiculous. | |
ID: 44357 | Rating: 0 | rate: / Reply Quote | |
Just got my first ADRIA_2OV5_AMBER. After 10 minutes it's still running. That's a good sign, Looks to be about the same length as the old ADRIA WUs that would complete (pre OPEN & CLOSED). A bit longer than the recent GERARD WUs on my machines. | |
ID: 44358 | Rating: 0 | rate: / Reply Quote | |
About 1 hour ago, one of my GTX980Ti hosts downloaded: | |
ID: 44359 | Rating: 0 | rate: / Reply Quote | |
It looks like The jobs have been re-worked by Amber and are being sent out again. Here is some info from the Project Status page. | |
ID: 44360 | Rating: 0 | rate: / Reply Quote | |
I have e1s45_3-ADRIA_2OV5_AMBER_OPEN2-0-1-RND2778_1, which seems to be running well (with low CPU usage) on a windows GTX 970. #SWAN: FATAL: cannot find image for module [.nonbonded.cu.] for device version 610 - so no Linux GTX 1080 support yet. | |
ID: 44361 | Rating: 0 | rate: / Reply Quote | |
I have one e1s16_1-ADRIA_2OV5_AMBER_OPEN1-0-1-RND6090_1. It has been running for a little over 7hrs. Says it has a day left to go so it looks good so far. It's CPU usage is .7% average and GPU load is about 70% on a Quadro M4000. | |
ID: 44366 | Rating: 0 | rate: / Reply Quote | |
About 1 hour ago, one of my GTX980Ti hosts downloaded: The tasks are named based on their function, the function of the test or desired result, so OPEN and CLOSED probably refer to the state of a protein or interaction substance or desired effect and not the task type on our end. It is an indicator name for the result so the servers know how to receive and store it and how the scientists then have to classify and analyze it. ____________ 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org | |
ID: 44370 | Rating: 0 | rate: / Reply Quote | |
Just got my first ADRIA_2OV5_AMBER. This AMBER_OPEN finished in just under 27 hours on a Ti750. Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot. The WU on the other GPU crashed on restart and the ADRIA_2OV5_AMBER_CLOSED restarted from zero. I aborted it. Maybe it'll run on a different GPU... | |
ID: 44384 | Rating: 0 | rate: / Reply Quote | |
Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot. I have been debating with myself whether a bad WU can cause a machine to crash. It seems to have happened at times, but it is hard to pin down. I often find that it is really a hardware problem, but you never know. | |
ID: 44385 | Rating: 0 | rate: / Reply Quote | |
Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot. I've had it happen several times across all my systems except my laptop. I think the fact that there is a battery in it makes it more stable and does not shut down. A regular PC, even with a backup, if it comes to a software error, which this might be considered if a WU fails, could surely send an OS into a soft crash which ends as a full reboot. (What I mean by a soft crash is when it affects other parts of the OS like explorer.exe or some dependent system function.) One program affecting the next and some of those programs can just restart as per OS function, like explorer.exe, but others need a reboot and some can't wait, like the Winlogon process. I suspect the same about Linux systems, though I am speaking in Windows terms. Then if it does fail and restarts explorer.exe, for example, the task has failed, but affects no other tasks or the system. If it affects a driver or critical system process that can be reloaded, it could fail all the tasks out or just the other one on the same GPU. And if it is critical to the OS, it could fail all tasks and reboot. Even when a program does all its working 'inside a bubble' that does not affect the rest of the system, the program itself is not a bubble to itself not drawing on other system resources and processes. You can't isolate anything on a system that runs other things, including an OS. About the laptop too, I can't say for certain that it never crashed for a WU. Its just I never noticed it happen that way. It is as vulnerable to a software error as any other PC without a battery. I just don't expect too much from it so I don't check BOINC as much on it as I do the stronger PCs. I can't say I never woke up and it was on a login screen from some shutdown and figured it was a windows update or something, but I can't say for certain it was never a failed task either. | |
ID: 44387 | Rating: 0 | rate: / Reply Quote | |
Unfortunately just had a ADRIA_2OV5_AMBER_CLOSED crash and actually cause the machine to reboot. The battery (as in your laptop) will only protect it from power glitches but not from the "software errors" you mention. Simply as a point of information, I've built hundreds (maybe thousands) of PCs since the days of the original 8088/8086 (not to mention V20 & V30)(and Apple II clones and CPM before that). At the peak sometimes built 2 or 3 per week. Sheesh, I'm showing my age. Point is that I test them. I know how to spot hardware errors. Almost all my crunchers run only a minimal set of programs to support BOINC. The one area that could be a possibility is that new higher usage WUs may be stressing a GPU further than previous WUs. If I suspect that, I downclock the GPU. None of my GPUs are OCed expect for factory OC and many are downclocked. I did have a "validate" error (as you ask about in the other thread) a while ago on one 750Ti and downclocked it a bit. Haven't had one since. My errors are mostly caused by power glitches that are too common around here. The power goes out generally for a second or two, enough to reboot the computers. Hopefully the next app will be more fault tolerant. The apps from most other projects do not cause their WUs to error in my experience. What I did notice in tracking down your box with validate errors (for Richard in the other thread), is that your GTX 980Ti and 980 GPUs are throwing quite a few errors on WUs that should run on them (the GTX 980Ti cards seem to be failing the long GIANNIs, sometimes after running for a long time). You may wish to try downclocking. Voltage increase may also work but that causes more heat and power draw for any extra speed that you get and may also adversely affect GPU life. Best of luck getting it sorted out! | |
ID: 44403 | Rating: 0 | rate: / Reply Quote | |
I would say the most fragile thing in the OS are the nVidia GPU drivers. | |
ID: 44496 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : All ADRIA_2OV5_CLOSED, ADRIA_2OV5_CONF_CLOSED & ADRIA_2OV5_OPEN failing