Posts by Paul D. Buck

1) Message boards : Number crunching : Friendly fire death? (Message 424)
Posted 17 May 2009 by Profile Paul D. Buck
Post:
More on the dev list ...

Groans...

What, you don't like I found a problem?

Proved it not only beyond a reasonable doubt, but even beyond the doubting Thomas types of the UCB cabal that maintains that there are no flaws in the BOINC Client code?
2) Message boards : Number crunching : Friendly fire death? (Message 400)
Posted 15 May 2009 by Profile Paul D. Buck
Post:
After I posted I got another one, the initial inication was the restarts, though it does look more like a bad task now. THe problem is/was that I was seeing a lot of other tasks being heartbeat hit ...

More on the dev list ...
3) Message boards : Number crunching : Friendly fire death? (Message 387)
Posted 15 May 2009 by Profile Paul D. Buck
Post:
This task looks like it was hit with the heartbeat death problem I have been reporting on the alpha forums. You have to love the irony ...

It looks like it continually started and restarted ... not for sure, there may be some other issue ... but ... it is possible ... :)
4) Message boards : Number crunching : long running tasks (Message 386)
Posted 15 May 2009 by Profile Paul D. Buck
Post:
NOt conclusively proven on Linux (yet)... but I found tasks on multiple projects where there was the "no heartbeat" message. The thing is that these tasks were running at the time I was running DD tasks.

I have forwarded this information to the mailing list for the developers and Jack (here) so they can possibly correct the issue.

The issue is simple, in the case of DD, when a task completes BOINC does clean-up and while doing this task BOINC can become "preoccupied" with this clean-up and while it is doing this the other running Science Applications lose connection with the BOINC Client and suicide.

At the moment, I have only proved this conclusively (to my personal satisfaction if no one else's) on OS-X, I have at least one instance where I saw one restart on a linux system ( a very slow system so I would not expect many there). I have not yet proved it on Windows (which uses different methods) and with no work that may have to wait.

The only reason that this can be a significant issue is that there is a top limit on restarts and if you are running too many DD tasks you might see other tasks fold up ...

Have a nice day ...
5) Message boards : Number crunching : long running tasks (Message 366)
Posted 14 May 2009 by Profile Paul D. Buck
Post:
Huh! Was going to take a look at the linux side... reached daily goal of 100 ... Daggers!
6) Message boards : Number crunching : long running tasks (Message 347)
Posted 14 May 2009 by Profile Paul D. Buck
Post:
It would be great if we could figure out the pattern.

The only thing that I can point to is that at times, with the standard unzip is that the system does not seem to like a lot of threads spun up at the same time...

When I noticed a problem on OS-X, I suspended all tasks and then unsuspended them one at a time as they started to progress. Until I did that, basically they were running and accumulating time but not seeming to do anything. Almost no CPU used as well ...

I am discussing my nightmare of last night on alpha (or dev, or both) and I am not sure if the "spare" copies of BOINC came from the shells you launched or not ...

The problem is that I do not know for sure when or where those came up, could have been from DD, could have been from the installer as I was trying to get boinc back running after I stopped it for the other errors...
7) Message boards : Number crunching : long running tasks (Message 318)
Posted 14 May 2009 by Profile Paul D. Buck
Post:
If my suspicions are correct Jack, the issue is not necessarily individual tasks. What I mean is that the tasks themselves may all be fine and will run to completion without error on other machines and even on the same machine if re-issued. It is that under some conditions (and I theink I have seen it under both Linux and OS-X) two or more tasks try to do the same thing, in this case, spin up zip/unzip, and that spin-up causes BOINC to focus on the spin-up to the exclusion of other activities.

In the sequence of images you can get a hint of what I saw several times but I only was able to catch a couple of times was that a number of your tasks would be going through their stages and some contention would build and other tasks would error out.

I thought I had found something in the end game processing but it appears that I was wrong about my understanding of the code. Not surprising, I am not very good with C...
8) Message boards : Number crunching : long running tasks (Message 272)
Posted 13 May 2009 by Profile Paul D. Buck
Post:
The present run of work is the shorties. The next batch of work should be a lot longer, but we'll try some in house testing before sending that out to everyone else. We wouldn't want to do another 400K run of shorties that break people's BOINC.

Now as to why they run for a long time on your system, other than BOINC breaking down and losing the heartbeat of the application (it's a two way system these days, I understand), they shouldn't be running for the half hour in your email. Unless you have a very slow computer... but I don't think there were P2 systems with 8 CPUs. ;-)

I thought you could not see ... :)

Took my medicine, it cleared up the sinuses a little. Less leakage, better vision. Not that that means I wanted to read an email the length of a chapter in a good book, though. It isn't even visible in the BOINC Dev email list archive as it went way over maximum length. ;-)

I got one dual core AMD that is about 5 years old running linux. The long running tasks seem to take about 10 minutes or so ... the system is named L1 for some strange reason ...

they are all done now I think... but you can see the results of the deaths in my account
9) Message boards : Number crunching : long running tasks (Message 270)
Posted 13 May 2009 by Profile Paul D. Buck
Post:
Jack, I don't think I have your e-mail, but mine is about as public as you can be (on the dev list if nothing else, and if Jord is abed and does not send you a copy of that junk ... buzz me and if I am here I will forward my nonesense to you ...

Copy sent.

I thought you could not see ... :)

I don't see my mention, but this situation I have seen it ALSO on Hydrogen ... so if you are doing the same things here that were done there ,,, well, Bob's your uncle ... :)

I am not saying that what the projects are doing is wrong, just that BOINC does not handle this class of issues well, with short tasks and zip contention and Paul attaching to the project ... and it seems to be Monday somehow ...
10) Message boards : Number crunching : long running tasks (Message 268)
Posted 13 May 2009 by Profile Paul D. Buck
Post:
I have been getting some long running tasks on my systems that seem to be prone to the exiting no finished file problems of yore, including Hydrogen@Home ...

I sent Jord an e-mail with my data and included screen dumps and a code analysis ... for what it is worth, the first problem seems to be that the unzip process can get entangled and if you have contention this "hangs" BOINC ... then tasks don't get heartbeats and they fold up and die.

As near as I can tell we have a collision of TWO issues. The first is that the zip/unzip processes can collide and when they do nothing gets done until they untangle.

Secondly, there seems to ME to be a longstanding bug in the code that is called in the cases where tasks die. To the point that there is the possiblitly that you garble up BOINC to the extent that even after you clear the tasks you can be subject to follow on crashes and other issues.

Jack, I don't think I have your e-mail, but mine is about as public as you can be (on the dev list if nothing else, and if Jord is abed and does not send you a copy of that junk ... buzz me and if I am here I will forward my nonesense to you ...

Not sure what else to do, so I am running the tasks and letting the ones that pass, pass, and the ones that fail fail on a system that is so slow I don't really care what it is doing ...

Pls advise if you need mroe from me ...



©2017 All rights reserved | Design by Digital BioPharm Ltd