Checkpoint on reboot

Message boards : Number crunching : Checkpoint on reboot

To post messages, you must log in.

AuthorMessage
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 2205 - Posted: 2 Feb 2010, 15:05:25 UTC

I am starting this thread because, we have a problem checkpointing the apps when you reboot the machine. I have tested this myself and for some reason its not work about checkpoint when the machine turns off and on again. I looked at Ibercivis which is running some similair apps, and they have no problem. Maybe I should send them an email. Please post any suggestions.
ID: 2205 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile robertmiles

Send message
Joined: 13 Oct 09
Posts: 105
Credit: 21,462
RAC: 0
Message 2207 - Posted: 3 Feb 2010, 21:16:33 UTC - in response to Message 2205.  
Last modified: 3 Feb 2010, 21:54:14 UTC

As I suggested in another thread, make sure that the checkpoint file includes the current CPU register values for the workunit being checkpointed.

Also, the checkpoint file should start and end with something used to check whether all of it was written properly - the checkpoint number should do.

If you're trying the brute-force method of checkpointing I suggested by recording the whole address space of the workunit, it's questionable whether the read-only parts needs to be included, especially if you have not found a way to record and restore just which parts are read-only. If you have sufficient access to the source code, I'd suggest adding code to perform a full memory dump and register dump to a file reserved for this purpose, and then calling this code again just after the attempted restore to produce a second set of dumps which you can then compare to the first one to get some idea what isn't restored correctly.

If you decide to use the brute-force method but without the read-only parts, look for a way to start the program in a way that almost immediately asks the shell for something, to give the shell a chance to restore the parts that are not read-only.

Something that may help with the Windows version at least under Vista and probably also under Windows 7: Start the Windows Task Manager program. Under the Performance tab, click on the Resource Monitor button. If you've never done this before, Windows is then likely to download and install another program. Once you get the Resource Monitor program started, click on the CPU section and find the line for the app program. If it reports more than one thread, I'd expect you to need to do something extra to suspend all the threads in order to have all memory changes in the app program suspended while you are writing the checkpoint.

If you understand the source code enough, though, you should be able to find a point between major actions, and insert checkpoint and restore code to have the checkpoint file include only the necessary parts of the memory, and make the checkpoint files much smaller.

Something to think of when doing the restore - does it need to put everything back into the same memory locations, or can it restore it to some other section of memory instead? I'd expect the answer to depend on how the compiler handles the ability to have the program execute at more than one section of the memory.

If the Ibercivis programs are similar enough and use the same compiler, I'd expect whatever method they used to work well for you also, so you might as well ask them. They are using much shorter workunits for their subprojects I've enabled, though, so it's less noticeable if their workunits have to restart from the beginning.

I'd expect the difference between checkpointing working while BOINC is still running but not after the machine is turned off and restarted to depend on something not being properly recorded in the checkpoint and restored afterwards, but normally saved as part of the address space of the BOINC program. Something to try to help pin down just what - try suspending the workunit, then shutting down the boinc.exe program using the boincmgr.exe program, but not the whole computer, and see if the workunit restarts properly if you then restart boinc.exe without a reboot, also by using the boincmgr.exe program.
ID: 2207 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 2226 - Posted: 8 Feb 2010, 2:04:20 UTC - in response to Message 2207.  

Basically, I'm trying not to reinvent the wheel. The application has its own checkpointing options. You can read about those checkpoint options here.
http://www.psc.edu/general/software/packages/gromacs/online/mdrun.html

I know as a mater of fact we can continue mdrun from where we left off via command line. Running it with the wrapper is not working the way we intend. I spent some time trying to fix a problem with the fraction complete reporting. I fixed that but its still not restarting where we left off when I restart boinc.
ID: 2226 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 2227 - Posted: 8 Feb 2010, 2:08:26 UTC - in response to Message 2226.  

checkpointing for autodock is another issue. Less important because users will loose much less credit.
ID: 2227 · Rating: 0 · rate: Rate + / Rate - Report as offensive

Message boards : Number crunching : Checkpoint on reboot


©2017 All rights reserved | Design by Digital BioPharm Ltd