long running tasks

Message boards : Number crunching : long running tasks

To post messages, you must log in.

AuthorMessage
Profile Paul D. Buck

Send message
Joined: 1 May 09
Posts: 10
Credit: 15,275
RAC: 0
Message 268 - Posted: 13 May 2009, 10:44:18 UTC

I have been getting some long running tasks on my systems that seem to be prone to the exiting no finished file problems of yore, including Hydrogen@Home ...

I sent Jord an e-mail with my data and included screen dumps and a code analysis ... for what it is worth, the first problem seems to be that the unzip process can get entangled and if you have contention this "hangs" BOINC ... then tasks don't get heartbeats and they fold up and die.

As near as I can tell we have a collision of TWO issues. The first is that the zip/unzip processes can collide and when they do nothing gets done until they untangle.

Secondly, there seems to ME to be a longstanding bug in the code that is called in the cases where tasks die. To the point that there is the possiblitly that you garble up BOINC to the extent that even after you clear the tasks you can be subject to follow on crashes and other issues.

Jack, I don't think I have your e-mail, but mine is about as public as you can be (on the dev list if nothing else, and if Jord is abed and does not send you a copy of that junk ... buzz me and if I am here I will forward my nonesense to you ...

Not sure what else to do, so I am running the tasks and letting the ones that pass, pass, and the ones that fail fail on a system that is so slow I don't really care what it is doing ...

Pls advise if you need mroe from me ...
ID: 268 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Ageless
Avatar

Send message
Joined: 11 Apr 09
Posts: 172
Credit: 7,631
RAC: 0
Message 269 - Posted: 13 May 2009, 11:09:10 UTC - in response to Message 268.  

Jack, I don't think I have your e-mail, but mine is about as public as you can be (on the dev list if nothing else, and if Jord is abed and does not send you a copy of that junk ... buzz me and if I am here I will forward my nonesense to you ...

Copy sent.
Jord

'Cause you seem like an orchard of mines, Just take one step at a time.
ID: 269 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Paul D. Buck

Send message
Joined: 1 May 09
Posts: 10
Credit: 15,275
RAC: 0
Message 270 - Posted: 13 May 2009, 11:14:23 UTC - in response to Message 269.  

Jack, I don't think I have your e-mail, but mine is about as public as you can be (on the dev list if nothing else, and if Jord is abed and does not send you a copy of that junk ... buzz me and if I am here I will forward my nonesense to you ...

Copy sent.

I thought you could not see ... :)

I don't see my mention, but this situation I have seen it ALSO on Hydrogen ... so if you are doing the same things here that were done there ,,, well, Bob's your uncle ... :)

I am not saying that what the projects are doing is wrong, just that BOINC does not handle this class of issues well, with short tasks and zip contention and Paul attaching to the project ... and it seems to be Monday somehow ...
ID: 270 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Ageless
Avatar

Send message
Joined: 11 Apr 09
Posts: 172
Credit: 7,631
RAC: 0
Message 271 - Posted: 13 May 2009, 11:22:40 UTC - in response to Message 270.  

The present run of work is the shorties. The next batch of work should be a lot longer, but we'll try some in house testing before sending that out to everyone else. We wouldn't want to do another 400K run of shorties that break people's BOINC.

Now as to why they run for a long time on your system, other than BOINC breaking down and losing the heartbeat of the application (it's a two way system these days, I understand), they shouldn't be running for the half hour in your email. Unless you have a very slow computer... but I don't think there were P2 systems with 8 CPUs. ;-)

I thought you could not see ... :)

Took my medicine, it cleared up the sinuses a little. Less leakage, better vision. Not that that means I wanted to read an email the length of a chapter in a good book, though. It isn't even visible in the BOINC Dev email list archive as it went way over maximum length. ;-)
Jord

'Cause you seem like an orchard of mines, Just take one step at a time.
ID: 271 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Paul D. Buck

Send message
Joined: 1 May 09
Posts: 10
Credit: 15,275
RAC: 0
Message 272 - Posted: 13 May 2009, 11:47:25 UTC - in response to Message 271.  

The present run of work is the shorties. The next batch of work should be a lot longer, but we'll try some in house testing before sending that out to everyone else. We wouldn't want to do another 400K run of shorties that break people's BOINC.

Now as to why they run for a long time on your system, other than BOINC breaking down and losing the heartbeat of the application (it's a two way system these days, I understand), they shouldn't be running for the half hour in your email. Unless you have a very slow computer... but I don't think there were P2 systems with 8 CPUs. ;-)

I thought you could not see ... :)

Took my medicine, it cleared up the sinuses a little. Less leakage, better vision. Not that that means I wanted to read an email the length of a chapter in a good book, though. It isn't even visible in the BOINC Dev email list archive as it went way over maximum length. ;-)

I got one dual core AMD that is about 5 years old running linux. The long running tasks seem to take about 10 minutes or so ... the system is named L1 for some strange reason ...

they are all done now I think... but you can see the results of the deaths in my account
ID: 272 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 274 - Posted: 13 May 2009, 11:56:29 UTC - in response to Message 268.  

I've experienced this issue sometimes with a PowerPC. I have not been able to repeat it reliably so I am not certain of the root cause of this problem. I will examine your workunits and try to reveal the pattern. Its very likely I need to rebuild the wrappers. I forget exactly how to build them on Macs. I went the through the instructions, but I'm not following the details on how o make the sample applications.
ID: 274 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 275 - Posted: 13 May 2009, 11:57:54 UTC - in response to Message 274.  
Last modified: 13 May 2009, 11:58:36 UTC

oh yes you can reach me at jshultz at hydrogenathome.org, I'm on the boinc_projects list too.
ID: 275 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 282 - Posted: 13 May 2009, 14:54:01 UTC - in response to Message 272.  

Funny the mac is the least of our problems here. Maybe I need to rebuild the linux wrapper.
ID: 282 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 283 - Posted: 13 May 2009, 14:56:49 UTC - in response to Message 282.  

I see, you have a linux 64-bit. Both you and Augustine have 64-bit linux and it looks like were run into problems various problems on these platforms...and I know I'm being vague. I should look through the error logs and get the range of these errors.
ID: 283 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 285 - Posted: 13 May 2009, 15:35:28 UTC - in response to Message 283.  

I just went through the list of 64-bit platforms. For those kernels where there were problems other users with the same kernel had success.
ID: 285 · Rating: 0 · rate: Rate + / Rate - Report as offensive
[SETI.USA]Tank_Master
Avatar

Send message
Joined: 25 Apr 09
Posts: 21
Credit: 28,885
RAC: 0
Message 286 - Posted: 13 May 2009, 16:20:59 UTC

do those erroring out have the 32bit libraries installed?
ID: 286 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Ageless
Avatar

Send message
Joined: 11 Apr 09
Posts: 172
Credit: 7,631
RAC: 0
Message 288 - Posted: 13 May 2009, 16:40:01 UTC - in response to Message 286.  

My thought as well, but it seems there's a 64bit application available...
Jord

'Cause you seem like an orchard of mines, Just take one step at a time.
ID: 288 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Paul

Send message
Joined: 1 May 09
Posts: 1
Credit: 12,043
RAC: 0
Message 309 - Posted: 14 May 2009, 2:46:43 UTC
Last modified: 14 May 2009, 2:50:56 UTC

just aborted 2 short wu's- 547727 and 547610 They seemed to be hanging at 79%.
ID: 309 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Paul D. Buck

Send message
Joined: 1 May 09
Posts: 10
Credit: 15,275
RAC: 0
Message 318 - Posted: 14 May 2009, 5:53:09 UTC

If my suspicions are correct Jack, the issue is not necessarily individual tasks. What I mean is that the tasks themselves may all be fine and will run to completion without error on other machines and even on the same machine if re-issued. It is that under some conditions (and I theink I have seen it under both Linux and OS-X) two or more tasks try to do the same thing, in this case, spin up zip/unzip, and that spin-up causes BOINC to focus on the spin-up to the exclusion of other activities.

In the sequence of images you can get a hint of what I saw several times but I only was able to catch a couple of times was that a number of your tasks would be going through their stages and some contention would build and other tasks would error out.

I thought I had found something in the end game processing but it appears that I was wrong about my understanding of the code. Not surprising, I am not very good with C...
ID: 318 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Jack Shultz
Avatar

Send message
Joined: 10 Apr 09
Posts: 503
Credit: 120,150
RAC: 0
Message 332 - Posted: 14 May 2009, 12:01:46 UTC - in response to Message 318.  

It would be great if we could figure out the pattern.
ID: 332 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Paul D. Buck

Send message
Joined: 1 May 09
Posts: 10
Credit: 15,275
RAC: 0
Message 347 - Posted: 14 May 2009, 15:14:42 UTC - in response to Message 332.  

It would be great if we could figure out the pattern.

The only thing that I can point to is that at times, with the standard unzip is that the system does not seem to like a lot of threads spun up at the same time...

When I noticed a problem on OS-X, I suspended all tasks and then unsuspended them one at a time as they started to progress. Until I did that, basically they were running and accumulating time but not seeming to do anything. Almost no CPU used as well ...

I am discussing my nightmare of last night on alpha (or dev, or both) and I am not sure if the "spare" copies of BOINC came from the shells you launched or not ...

The problem is that I do not know for sure when or where those came up, could have been from DD, could have been from the installer as I was trying to get boinc back running after I stopped it for the other errors...
ID: 347 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Paul D. Buck

Send message
Joined: 1 May 09
Posts: 10
Credit: 15,275
RAC: 0
Message 366 - Posted: 14 May 2009, 19:47:15 UTC

Huh! Was going to take a look at the linux side... reached daily goal of 100 ... Daggers!
ID: 366 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile [ESL Brigade] Ssofk
Avatar

Send message
Joined: 1 May 09
Posts: 19
Credit: 13,577
RAC: 0
Message 367 - Posted: 14 May 2009, 19:56:26 UTC

Ok, i must abord 30 task but, i hope the rest will done in 24 hours. I get 100 task with 40min on my dual-core. And that can not reach in 24 hours. The deadline was to short and i dont know how i get this Task, maybe it returns of the short taskĀ“s before. So dont worry about it... .
ID: 367 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Paul D. Buck

Send message
Joined: 1 May 09
Posts: 10
Credit: 15,275
RAC: 0
Message 386 - Posted: 15 May 2009, 5:51:28 UTC

NOt conclusively proven on Linux (yet)... but I found tasks on multiple projects where there was the "no heartbeat" message. The thing is that these tasks were running at the time I was running DD tasks.

I have forwarded this information to the mailing list for the developers and Jack (here) so they can possibly correct the issue.

The issue is simple, in the case of DD, when a task completes BOINC does clean-up and while doing this task BOINC can become "preoccupied" with this clean-up and while it is doing this the other running Science Applications lose connection with the BOINC Client and suicide.

At the moment, I have only proved this conclusively (to my personal satisfaction if no one else's) on OS-X, I have at least one instance where I saw one restart on a linux system ( a very slow system so I would not expect many there). I have not yet proved it on Windows (which uses different methods) and with no work that may have to wait.

The only reason that this can be a significant issue is that there is a top limit on restarts and if you are running too many DD tasks you might see other tasks fold up ...

Have a nice day ...
ID: 386 · Rating: 0 · rate: Rate + / Rate - Report as offensive

Message boards : Number crunching : long running tasks


©2017 All rights reserved | Design by Digital BioPharm Ltd