Much longer run times?

Message boards : Number crunching : Much longer run times?

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 25 Apr 09
Posts: 58
Credit: 1,785,257
RAC: 275
Message 1747 - Posted: 16 Nov 2009, 14:33:11 UTC

I have a task that has now been running for 11 hours, and is showing only 27% done. BOINC is estimating another 34 hours to complete. I have several more with similar times, across several machines. And these are fast machines. Is this expected/normal?
@zombie_67

ID: 1747 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Tim Turner
Avatar

Send message
Joined: 1 May 09
Posts: 570
Credit: 184,322
RAC: 0
Message 1748 - Posted: 16 Nov 2009, 14:53:39 UTC - in response to Message 1747.  

app type? autodock or mdrun?
Tim Turner
Public Relations Admin
Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
If you need help via voice or Convo; PM me and i will give you details on where i will be; Teamspeak, Yahoo Messenger, or Skype.
ID: 1748 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Di
Avatar

Send message
Joined: 13 Nov 09
Posts: 13
Credit: 36,826
RAC: 0
Message 1749 - Posted: 16 Nov 2009, 15:22:33 UTC
Last modified: 16 Nov 2009, 15:24:31 UTC

md_500000_steps_autodock_ga_run_10_29699_ChemDiv_8007-4456_1258149744061143000_1258344446383900000_2 using mdrun version 720
md_500000_steps_autodock_ga_run_10_31933_ChemDiv_5408-3422_1258163057147239000_1258344518888438000_0 using mdrun version 720

50+ hours on 3.2 ghz

I think, that for alfa/beta project stage this long tasks is madness. Its a lot of hours and You must give me 1000% that there will be NO errors and I will take my credits for this huge work.
From: Moscow, Russia | DrugDiscovery,Milkyway/Collatz,FreeHal,Majestic12 at E7300 3.2Ghz, Radeon 3850 HIS turbo GHz | Team: Russia
ID: 1749 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Van Fanel
Avatar

Send message
Joined: 16 Sep 09
Posts: 17
Credit: 103,550
RAC: 0
Message 1750 - Posted: 16 Nov 2009, 15:36:12 UTC

Same here...

Appl.: mdrun 2.08
WU name: md_500000_steps_autodock_ga_run_10_25186_ChemDiv_0422-0057_1258127924579470000_1258375703137446000
Machine: Ubuntu 64bit at 2.66GHz.

I'm estimating something like 48 hours for completion.
ID: 1750 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Tim Turner
Avatar

Send message
Joined: 1 May 09
Posts: 570
Credit: 184,322
RAC: 0
Message 1752 - Posted: 16 Nov 2009, 16:01:32 UTC - in response to Message 1750.  
Last modified: 16 Nov 2009, 16:05:05 UTC

Crap! i have about 40 of these... and yes, this was by design, it's just a general population test; you can abort these if you wish but please keep in mind that this is helping the project figure how many Pico Seconds to generate and Simulate.

we may change it and do 50,000 instead of 500,000.
Tim Turner
Public Relations Admin
Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
If you need help via voice or Convo; PM me and i will give you details on where i will be; Teamspeak, Yahoo Messenger, or Skype.
ID: 1752 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Di
Avatar

Send message
Joined: 13 Nov 09
Posts: 13
Credit: 36,826
RAC: 0
Message 1755 - Posted: 16 Nov 2009, 16:13:30 UTC
Last modified: 16 Nov 2009, 16:13:57 UTC

you can abort these if you wish but please keep in mind that this is helping the project figure how many Pico Seconds to generate and Simulate.


we may change it and do 50,000 instead of 500,000.


You can make 1 000 000 if you need. But give a good checkpointing with it and a low level of errors. And give a chance to slow computers crunch a smaller wu's.
From: Moscow, Russia | DrugDiscovery,Milkyway/Collatz,FreeHal,Majestic12 at E7300 3.2Ghz, Radeon 3850 HIS turbo GHz | Team: Russia
ID: 1755 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Van Fanel
Avatar

Send message
Joined: 16 Sep 09
Posts: 17
Credit: 103,550
RAC: 0
Message 1756 - Posted: 16 Nov 2009, 17:49:11 UTC

Just one question, is there any checkpointing in these monster WUs?

From what I'm seeing in my 4-hours-old WUs there isn't any... :(
ID: 1756 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Tim Turner
Avatar

Send message
Joined: 1 May 09
Posts: 570
Credit: 184,322
RAC: 0
Message 1757 - Posted: 16 Nov 2009, 17:55:37 UTC - in response to Message 1756.  

Just one question, is there any checkpointing in these monster WUs?

From what I'm seeing in my 4-hours-old WUs there isn't any... :(


their is no checkpointing in this version of the app.. for some
reason it doesn't seem to checkpoint. i know that 4 versions ago, checkpointing was on..

I'll let Jack know..
Tim Turner
Public Relations Admin
Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
If you need help via voice or Convo; PM me and i will give you details on where i will be; Teamspeak, Yahoo Messenger, or Skype.
ID: 1757 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Di
Avatar

Send message
Joined: 13 Nov 09
Posts: 13
Credit: 36,826
RAC: 0
Message 1758 - Posted: 16 Nov 2009, 18:03:03 UTC
Last modified: 16 Nov 2009, 18:06:31 UTC

...is there any checkpointing in these monster WUs?

Good question :) Who wants to try figure it out? I'm not, because i'm on 30% now and I guess that there is no checkpoints at all or it works wrong, or... who knows...

ohh... zpm answered already
From: Moscow, Russia | DrugDiscovery,Milkyway/Collatz,FreeHal,Majestic12 at E7300 3.2Ghz, Radeon 3850 HIS turbo GHz | Team: Russia
ID: 1758 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Tim Turner
Avatar

Send message
Joined: 1 May 09
Posts: 570
Credit: 184,322
RAC: 0
Message 1759 - Posted: 16 Nov 2009, 18:13:44 UTC - in response to Message 1758.  
Last modified: 16 Nov 2009, 18:14:59 UTC

...is there any checkpointing in these monster WUs?

Good question :) Who wants to try figure it out? I'm not, because i'm on 30% now and I guess that there is no checkpoints at all or it works wrong, or... who knows...

ohh... zpm answered already



the option -cpt was accidentally left out of the cli build.... but these should run to completeion as long as you don't stop it/crash your computer/and do automatic windows updates.. :)))))) were still alpha/beta and we still have the training wheels on.! well, some of us do.!
Tim Turner
Public Relations Admin
Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
If you need help via voice or Convo; PM me and i will give you details on where i will be; Teamspeak, Yahoo Messenger, or Skype.
ID: 1759 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 1767 - Posted: 17 Nov 2009, 11:59:50 UTC - in response to Message 1759.  
Last modified: 17 Nov 2009, 12:02:09 UTC

If you are currious about the checkpointing options. I have added it to the workunit generator
http://manual.gromacs.org/current/online/mdrun.html

If 2 days of work is too much then cancel it. Its not intended as a long term length. We were discussing the possibility of doing 500k steps, but it looks like it will take too long.
ID: 1767 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Van Fanel
Avatar

Send message
Joined: 16 Sep 09
Posts: 17
Credit: 103,550
RAC: 0
Message 1768 - Posted: 17 Nov 2009, 12:15:54 UTC

It's not a matter of having 2 days of work, it's a matter of the WUs not error'ing out during the process.

I ran 4 of these WUs, and they have all returned an error after over 13 hours of CPU time. These are the WUs:

346872
346864
346863
346861

The error is equal to all of them: Maximum disk usage exceeded. The problem is not on my side as I have over 25GB dedicated to BOINC, so it must be a project feature somewhere. I have aborted all other large WUs.

Meanwhile, the 5,000 steps WUs have arrived, and they are ALL aborting with an error. Here's an example:

347160
ID: 1768 · Rating: 0 · rate: Rate + / Rate - Report as offensive
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 25 Apr 09
Posts: 58
Credit: 1,785,257
RAC: 275
Message 1772 - Posted: 17 Nov 2009, 17:44:34 UTC

Wow! I have a 17 of these that failed for the same thing, after running 13-26 hours each. That is about 282 hours of wasted time. So far none have completed successfully. I have another 11 still in progress.
@zombie_67

ID: 1772 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile robertmiles

Send message
Joined: 13 Oct 09
Posts: 105
Credit: 21,462
RAC: 0
Message 1776 - Posted: 18 Nov 2009, 3:49:21 UTC - in response to Message 1767.  
Last modified: 18 Nov 2009, 3:53:55 UTC

If you are currious about the checkpointing options. I have added it to the workunit generator
http://manual.gromacs.org/current/online/mdrun.html

If 2 days of work is too much then cancel it. Its not intended as a long term length. We were discussing the possibility of doing 500k steps, but it looks like it will take too long.


Looks like I'll have to abort one of the two such mdrun 7.20 workunits I have, since it's estimated time for completion is 52 hours and it's on a machine that I don't allow to run much over 16 hours at a time. I'll let the other one go on, since it's on a faster machine that can often run over 6 days at a time without problems.

64-bit Vista SP2 and 64-bit Vista SP1
ID: 1776 · Rating: 0 · rate: Rate + / Rate - Report as offensive
dividedbymyself

Send message
Joined: 22 Apr 09
Posts: 13
Credit: 23,266
RAC: 0
Message 1779 - Posted: 18 Nov 2009, 17:40:46 UTC

I canceled a few, but let this one run that errorred out after about 24 hrs, with the error: "Maximum disk usage exceeded". I do have over 22GB left for Boinc though.

Bart
ID: 1779 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 1782 - Posted: 18 Nov 2009, 20:14:59 UTC - in response to Message 1779.  
Last modified: 18 Nov 2009, 20:17:35 UTC

Increase nstxout, nstvout, and nstxtcout. That will reduce the size of the .trr and .xtc, which are likely the largest files that you will be writing. If you want to decrease the .xtc file even more, specify xtc_grps. If the .log and .edr files get out of hand, adjust nstlog and nstenergy. The -compact flag only affects the .log file, and is set by default.

http://wiki.gromacs.org/index.php/Using_Trajectory_Information#Reducing_Trajectory_Storage_Volume
ID: 1782 · Rating: 0 · rate: Rate + / Rate - Report as offensive
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 25 Apr 09
Posts: 58
Credit: 1,785,257
RAC: 275
Message 1783 - Posted: 18 Nov 2009, 21:18:18 UTC - in response to Message 1782.  

Increase nstxout, nstvout, and nstxtcout. That will reduce the size of the .trr and .xtc, which are likely the largest files that you will be writing. If you want to decrease the .xtc file even more, specify xtc_grps. If the .log and .edr files get out of hand, adjust nstlog and nstenergy. The -compact flag only affects the .log file, and is set by default.

http://wiki.gromacs.org/index.php/Using_Trajectory_Information#Reducing_Trajectory_Storage_Volume


Who is this message directed to?
@zombie_67

ID: 1783 · Rating: 0 · rate: Rate + / Rate - Report as offensive
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 25 Apr 09
Posts: 58
Credit: 1,785,257
RAC: 275
Message 1785 - Posted: 18 Nov 2009, 21:30:23 UTC

Can this run please be cancelled? I keep getting re-issues of these tasks, and it's just wasting a LOT of crunching time.
@zombie_67

ID: 1785 · Rating: 0 · rate: Rate + / Rate - Report as offensive
dividedbymyself

Send message
Joined: 22 Apr 09
Posts: 13
Credit: 23,266
RAC: 0
Message 1786 - Posted: 18 Nov 2009, 22:26:12 UTC - in response to Message 1782.  

Increase nstxout, nstvout, and nstxtcout. That will reduce the size of the .trr and .xtc, which are likely the largest files that you will be writing. If you want to decrease the .xtc file even more, specify xtc_grps. If the .log and .edr files get out of hand, adjust nstlog and nstenergy. The -compact flag only affects the .log file, and is set by default.

http://wiki.gromacs.org/index.php/Using_Trajectory_Information#Reducing_Trajectory_Storage_Volume

I suppose this was meant for me. Still it's somewhat vague, so it doesn't help me much. First the page linked has been moved. Is this the new link? And in what file should I make the changes and by how much (percentages or numbers)? I can't find any file that contain such settings, but maybe they're supposed to be in a slot dir, but as I've currently no DD Wu's in cache, there's no DD slot either. Maybe it's a better idea you update the settings yourself before sending out new work like this? You know better what to do than I.

Btw, most of the Wu's I aborted had no progress bar. They didn't move at all and were very stable at 0.000% for hours.

Bart
ID: 1786 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile [AF>Dell>LesDelliens]La frite

Send message
Joined: 1 May 09
Posts: 10
Credit: 23,452
RAC: 0
Message 1787 - Posted: 18 Nov 2009, 22:37:16 UTC

Sorry but those very long workunits WITHOUT checkpoints are too risky and waste a lot of time.

See those 2 of mine that finished with an error after 172000 secondes of work, that's 48h!!
http://boinc.drugdiscoveryathome.com/workunit.php?wuid=346914
http://boinc.drugdiscoveryathome.com/workunit.php?wuid=346903
Will we get credits for the lost work? I doubt it...

I know it's still an alpha project but please reconsider giving such workunits out.

Thanks
ID: 1787 · Rating: 0 · rate: Rate + / Rate - Report as offensive
1 · 2 · Next

Message boards : Number crunching : Much longer run times?


©2017 All rights reserved | Design by Digital BioPharm Ltd