Sign-in / Sign-up
Your question

Folding@Home problem

Tags:
  • CPUs
  • Core
Last response: in CPUs
August 20, 2005 12:06:07 PM

At least 10 WUs crashed within last 16 hours. Here's a part of the log files.

<pre>[06:47:00] - Connecting to assignment server
[06:47:03] - Successful: assigned to (171.65.103.156).
[06:47:03] + News From Folding@Home: Welcome to Folding@Home
[06:47:03] Loaded queue successfully.
[06:59:18] + Closed connections
[06:59:18]
[06:59:18] + Processing work unit
[06:59:18] Core required: FahCore_78.exe
[06:59:18] Core found.
[06:59:18] Working on Unit 08 [August 20 06:59:18]
[06:59:18] + Working ...
[06:59:18]
[06:59:18] *------------------------------*
[06:59:18] Folding@Home Gromacs Core
[06:59:18] Version 1.80 (March 16, 2005)
[06:59:18]
[06:59:18] Preparing to commence simulation
[06:59:18] - Looking at optimizations...
[06:59:18] - Created dyn
[06:59:18] - Files status OK
[06:59:26] - Expanded 2966937 -> 16166417 (decompressed 544.8 percent)
[06:59:27] - Starting from initial work packet
[06:59:27]
[06:59:27] Project: 1140 (Run 98, Clone 14, Gen 17)
[06:59:27]
[06:59:27] Assembly optimizations on if available.
[06:59:27] Entering M.D.
[06:59:36] Protein: p1140_RIBO_FSpeptide_EXT_nospring
[06:59:36]
[06:59:36] Writing local files
[06:59:44] Extra SSE boost OK.
[06:59:46] Writing local files
[06:59:46] Completed 0 out of 250000 steps (0)
[07:55:10] Writing local files
[07:55:10] Completed 2500 out of 250000 steps (1)
[08:45:02] Writing local files
[08:45:02] Completed 5000 out of 250000 steps (2)
[09:16:09] Quit 101 - Fatal error: NaN detected: (ener[12])
[09:16:09]
[09:16:09] Simulation instability has been encountered. The run has entered a
[09:16:09] state from which no further progress can be made.
[09:16:09] This may be the correct result of the simulation, however if you
[09:16:09] often see other project units terminating early like this
[09:16:09] too, you may wish to check the stability of your computer (issues
[09:16:09] such as high temperature, overclocking, etc.).
[09:16:09] Going to send back what have done.
[09:16:09] logfile size: 8012
[09:16:09] - Writing 8575 bytes of core data to disk...
[09:16:09] ... Done.
[09:16:09]
[09:16:09] Folding@home Core Shutdown: EARLY_UNIT_END
[09:16:13] CoreStatus = 72 (114)
[09:16:13] Sending work to server


[09:16:13] + Attempting to send results
[09:16:19] + Results successfully sent
[09:16:19] Thank you for your contribution to Folding@Home.


[09:16:23] + Attempting to send results
[09:16:36] + Results successfully sent
[09:16:36] Thank you for your contribution to Folding@Home.
[09:16:36] - Preparing to get new work unit...
[09:16:36] + Attempting to get work packet
[09:16:36] - Connecting to assignment server
[09:16:44] - Successful: assigned to (171.65.103.158).
[09:16:44] + News From Folding@Home: Welcome to Folding@Home
[09:16:44] Loaded queue successfully.
[09:18:41] + Closed connections
[09:18:46]
[09:18:46] + Processing work unit
[09:18:46] Core required: FahCore_82.exe
[09:18:46] Core found.
[09:18:46] Working on Unit 09 [August 20 09:18:46]
[09:18:46] + Working ...
[09:18:47]
[09:18:47] *------------------------------*
[09:18:47] Folding@Home PMD Core
[09:18:47] Version 1.01 (Oct 15, 2004)
[09:18:47]
[09:18:47] Preparing to commence simulation
[09:18:47] - Looking at optimizations...
[09:18:47] - Created dyn
[09:18:47] - Files status OK
[09:18:47] - Expanded 82038 -> 558743 (decompressed 681.0 percent)
[09:18:47]
[09:18:47] Project: 1805 (Run 12, Clone 756, Gen 8)
[09:18:47]
[09:18:47] Assembly optimizations on if available.
[09:18:47] Entering M.D.
[09:18:53] Protein: p1805_Collagen_POG10_refolding_gamma
[09:18:53]
[09:18:53] Completed 0 out of 500000 steps (0)
[09:33:51] Writing checkpoint files
[09:42:51] Writing local files
[09:42:51] Completed 5000 out of 500000 steps (1)
[09:48:52] Writing checkpoint files
[10:01:47] Writing local files
[10:01:47] Completed 10000 out of 500000 steps (2)
[10:03:53] Writing checkpoint files
[10:18:55] Writing checkpoint files
[10:19:16] Writing local files
[10:19:16] Completed 15000 out of 500000 steps (3)
[10:33:55] Writing checkpoint files
[10:36:18] Writing local files
[10:36:18] Completed 20000 out of 500000 steps (4)
[10:40:36] NaN/Inf detected e[0]
[10:40:36] Going to send back what have done.
[10:40:36] logfile size: 5120
[10:40:36] - Writing 5640 bytes of core data to disk...
[10:40:36] ... Done.
[10:40:36]
[10:40:36] Folding@home Core Shutdown: EARLY_UNIT_END
[10:40:39] CoreStatus = 72 (114)
[10:40:39] Sending work to server


[10:40:39] + Attempting to send results
[10:40:47] + Results successfully sent
[10:40:47] Thank you for your contribution to Folding@Home.
[10:40:51] - Preparing to get new work unit...
[10:40:51] + Attempting to get work packet
[10:40:51] - Connecting to assignment server
[10:41:07] + Could not connect to Assignment Server
[10:41:20] - Successful: assigned to (171.65.103.158).
[10:41:20] + News From Folding@Home: Welcome to Folding@Home
[10:41:20] Loaded queue successfully.
[10:42:12] + Closed connections
[10:42:17]
[10:42:17] + Processing work unit
[10:42:17] Core required: FahCore_82.exe
[10:42:17] Core found.
[10:42:17] Working on Unit 00 [August 20 10:42:17]
[10:42:17] + Working ...
[10:42:17]
[10:42:17] *------------------------------*
[10:42:17] Folding@Home PMD Core
[10:42:17] Version 1.01 (Oct 15, 2004)
[10:42:17]
[10:42:17] Preparing to commence simulation
[10:42:17] - Looking at optimizations...
[10:42:17] - Created dyn
[10:42:17] - Files status OK
[10:42:18] - Expanded 81057 -> 557800 (decompressed 688.1 percent)
[10:42:18]
[10:42:18] Project: 1807 (Run 14, Clone 41, Gen 0)
[10:42:18]
[10:42:18] Assembly optimizations on if available.
[10:42:18] Entering M.D.
[10:42:26] Protein: p1807_Collagen_POG10new_restrained_unfolding
[10:42:26]
[10:42:26] Completed 0 out of 500000 steps (0)
[10:44:44] Writing local files
[10:44:44] Completed 5000 out of 500000 steps (1)
[10:46:59] Writing local files
[10:46:59] Completed 10000 out of 500000 steps (2)
[10:49:10] Writing local files
[10:49:10] Completed 15000 out of 500000 steps (3)
[10:51:32] Writing local files
[10:51:32] Completed 20000 out of 500000 steps (4)
[10:53:45] Writing local files
[10:53:45] Completed 25000 out of 500000 steps (5)
[10:56:05] Writing local files
[10:56:05] Completed 30000 out of 500000 steps (6)
[10:57:18] Writing checkpoint files
[10:57:24] NaN/Inf detected e[0]
[10:57:24] Going to send back what have done.
[10:57:24] logfile size: 13312
[10:57:24] - Writing 13832 bytes of core data to disk...
[10:57:24] ... Done.
[10:57:24]
[10:57:24] Folding@home Core Shutdown: EARLY_UNIT_END
[10:57:27] CoreStatus = 72 (114)
[10:57:27] Sending work to server


[10:57:27] + Attempting to send results
[10:57:39] + Results successfully sent
[10:57:39] Thank you for your contribution to Folding@Home.
[10:57:43] - Preparing to get new work unit...
[10:57:43] + Attempting to get work packet
[10:57:43] - Connecting to assignment server
[10:57:49] - Successful: assigned to (171.65.103.156).
[10:57:49] + News From Folding@Home: Welcome to Folding@Home
[10:57:49] Loaded queue successfully.
[11:25:56] + Closed connections
[11:26:01]
[11:26:01] + Processing work unit
[11:26:01] Core required: FahCore_78.exe
[11:26:01] Core found.
[11:26:01] Working on Unit 01 [August 20 11:26:01]
[11:26:01] + Working ...
[11:26:02]
[11:26:02] *------------------------------*
[11:26:02] Folding@Home Gromacs Core
[11:26:02] Version 1.80 (March 16, 2005)
[11:26:02]
[11:26:02] Preparing to commence simulation
[11:26:02] - Looking at optimizations...
[11:26:02] - Created dyn
[11:26:02] - Files status OK
[11:26:10] - Expanded 3035945 -> 16546233 (decompressed 545.0 percent)
[11:26:10] - Starting from initial work packet
[11:26:10]
[11:26:10] Project: 1144 (Run 115, Clone 12, Gen 5)
[11:26:10]
[11:26:11] Assembly optimizations on if available.
[11:26:11] Entering M.D.
[11:26:19] Protein: p1144_RIBO_nopeptide
[11:26:19]
[11:26:20] Writing local files
[11:26:27] Extra SSE boost OK.
[11:26:28] Writing local files
[11:26:29] Completed 0 out of 250000 steps (0)
</pre><p>
My CPU has been always undervoted (1.55 from default 1.65) and my PC2700 CL2.5 DDR has been running at 2.0-2-2-5 @ PC2100. CPU temp around 50-52.

------------
<font color=orange><b><A HREF="http://www.mozilla.org/products/firefox" target="_new">Rediscover the web</A></b></font color=orange>

More about : folding home problem

August 20, 2005 5:03:39 PM

The last time that I added up the uncompleted WUs that I have lost with no credit it was around 5000 points. I hardly ever fold anymore because of this.

ASUS P5WD2 Premium
Intel 3.73 EE @ 5.6Ghz
XMS2 DDR2 @ 1180Mhz

<A HREF="http://valid.x86-secret.com/records.php?PHPSESSID=792e8..." target="_new">#2 CPUZ</A>
SuperPI 25secs
August 21, 2005 6:07:40 AM

Time to run memtest86, and check your temps and voltages with mbm5. Let us know the results.
Related resources
August 21, 2005 2:41:41 PM

I haven't tried memtest86 yet, but increasing the voltage to 1.6v (still lower than default 1.65v) didn't solve the problem. Now I'm back to 1.55v and trying 2.0-3-3-6 with RAM. If this doesn't work, then I'll run memtest86.

My voltages are fine, at least not abnormal compared to the 100% stable days. Although F@H keeps generating errors, everything else still work fine.

I'm really worried about this problem. It's threatening to shatter all of my landmark dreams. So far I've managed 84 points from 20 broken WUs (4 among these 20 were 600 pointers :frown: )

------------
<font color=orange><b><A HREF="http://www.mozilla.org/products/firefox" target="_new">Rediscover the web</A></b></font color=orange>
August 21, 2005 2:45:32 PM

This kind of overclocked system is highly unlikely to be stable enough for F@H :wink:

------------
<font color=orange><b><A HREF="http://www.mozilla.org/products/firefox" target="_new">Rediscover the web</A></b></font color=orange>
August 21, 2005 9:52:47 PM

F@H WUs crashed again, even after using RAM at 2.0-3-3-6.

I've went back to 2.0-2-2-5 and ran memtest86. It passed memtest 3 times without errors before I stopped it.

Edit: For memtest86, I used CPU at default 1.65v

------------
<font color=orange><b><A HREF="http://www.mozilla.org/products/firefox" target="_new">Rediscover the web</A></b></font color=orange>
August 22, 2005 7:11:12 AM

3 of my systems have 10% or better OCs, the other three are stock. Ill get one bad core a month. Why wont it play nice for you?
August 22, 2005 7:16:17 AM

When was the last time you shut down? If I dont shut down once a week, some of my systems go a little funny.
Your mobo is getting a little long in the tooth, maybe it's time to give it a good clean and look. Check the north bridge hsf well.
How old is your xp install?
August 22, 2005 1:17:49 PM

After setting voltage to default 1.65v, the problem seems to be gone. I said before that it paseed memtest thrice, and now it has finished a WU without crashing.

FYI, my WinXP installation is 8 months old, case is quite dust free (I've cleaned it couple of weeks ago). Northbridge HSF is working fine. Usually I can never run PC for more than 48 hours, not for stabliity issues but for unstable internet connection and loadshedding.

Has my mobo become not good enough to handle undervolted CPU anymore? The problem with 1.65v is my CPU is reaching upto 56-57C while folding (but it can come down to 43C while idle).

My PSU is probably not liking 1.65v CPU at full load, +12v is swinging around 11.80v to 11.74v. But when it's idle (1.65v) or working under full load (1.55v), +12v usually gives 11.9v to 11.94v.

------------
<font color=orange><b><A HREF="http://www.mozilla.org/products/firefox" target="_new">Rediscover the web</A></b></font color=orange>
August 22, 2005 2:34:27 PM

I've noticed that every so often F@H crashes on my system too. (Though more like one crash every two months with the PC running 24/7.) But my system is a completely stock P4C 2.6 with a sturdy i865 mobo, good CAS2 DDR400 at reasonable (ie. not as tight as can be) timings and slightly volted up for ensured stability, no PAT (I could enable it if I wanted, I just don't want), good voltages from a reliable PSU, good airflow, and good cooling. I even load WHQL-certified drivers over non-certified whenever I have the choice. It's got an intake filter. I clean dust out with compressed air fairly regularly. Heck, it's even got a UPS. (More for voltage regulation than power outages.) And it's on RAID1 in case of a hard drive failure.

So the only thing that I <i>didn't</i> do to make it the most stable system on earth was give it registered/ECC RAM and maybe give it ten foot thick concrete, water, and lead shielding. **ROFL**

And yet F@H crashes every so often, even though no other apps give me any troubles. Memtests run fine. Prime95 runs fine. Sanrda runs fine. Etc. So frankly, I think at lease some of the problem people see with F@H crashing is the work units themselves. I mean heaven forbid suggesting that the code actually contains a bug. :o 

I mean sure, obviously a PC tweaked enough may stress out under the extreme load of constantly running F@H 24/7. So if there's any fault at all in your system, running F@H seriously is probably going to bring it out fairly quickly. But when even a majorly stable system can crash from F@H (but nothing else) it begins to suggest that F@H itself may be the culprit, at least from time to time.

I love F@H, but if you look at, for example, the oddities of the Windows GUI version of F@H closely enough, you'll begin to wonder about the sanity (if not skill) of some of the developers. Besides, no one is perfect. We all make mistakes.

:evil:  یί∫υєг ρђœŋίχ :evil: 
<font color=red><i>The Devil himself is good, when he is pleased.</i></font color=red>
@ 195K of 200K!
August 22, 2005 5:01:11 PM

F@H GUI or core wasn't crashing, the WUs were crashing!

My system was undervolted and running with aggressive memory timing, so I can't complain much about the problem. My PSU and cooling is at best just good enough. It's back to stable state since I stopped undervolting. But it has been F@H stable so far (except for odd WU crashes, like your case) with same settings and power/cooling equipments. Why would it suddenly start to dislike undervolting?

------------
<font color=orange><b><A HREF="http://www.mozilla.org/products/firefox" target="_new">Rediscover the web</A></b></font color=orange>
August 22, 2005 7:10:52 PM

maybe thats the wear & tear that comes with running F@H constantly. I'm not much of a hardware-guy myself, all of my systems are running at stock-speeds and of course I see some workunit crash from time to time. (I think, most of the time the WU's crash, but sometimes it's the app thats processing the workunits: I have more trust in the console-version; I've both versions running on various systems, and the console-version not only keeps the logs smaller but also makes the more stable impression on me, judged over a longer period [ie months])
the wear and tear of hardware running constantly at full load can be seen most easily with laptops in my opinion: the fan gets louder (probably dust, but tricky to clean!) - so why might there not also be other symptoms such as a mobo needing more voltage? cheers
August 22, 2005 7:15:20 PM

Quote:
F@H GUI or core wasn't crashing, the WUs were crashing!

I know. I'm saying that if the same developer(s) that made the F@H GUI program also make WU exes then it's no wonder that some WUs crash! **ROFL**

Quote:
Why would it suddenly start to dislike undervolting?

That is weird, but probably just has to do with parts aging or a change in ambient temps?

:evil:  یί∫υєг ρђœŋίχ :evil: 
<font color=red><i>The Devil himself is good, when he is pleased.</i></font color=red>
@ 195K of 200K!
August 22, 2005 8:36:52 PM

You mean, we download "fahcore_78.exe" or "fahcore_65.exe" everytime we get a WU?

Anyway, this has been a very hot summer. System temp has been hitting 45 almost for the whole summer, maybe that had some effects on my mobo. It has gone through 2½ summer so far.

------------
<font color=orange><b><A HREF="http://www.mozilla.org/products/firefox" target="_new">Rediscover the web</A></b></font color=orange>
August 22, 2005 10:39:03 PM

you only download the core-files, when the client-programm does not find any. it even may be wise to delete the core-files from time to time, as the client is forced to download what probably is a newer version of the core.
August 22, 2005 11:38:09 PM

They do a lot of beta testing before release, but they do changes on the fly. It's robably the changes that cause the odd crash.
XP is the best os M$ has ever put out, but having said that, it still needs rebooting, and or reloading from time to time.
August 22, 2005 11:41:39 PM

The filter cct for v-core on the NF7 boards is thier weakest link. It's what makes them undervolt so often. If you can get a little more air on the coils, it may help.
August 23, 2005 12:16:35 AM

yeah, NF7 always undervolted the CPUs by 0.2-0.3v, it hasn't started to undervolt more than this range recently. But I was intentionally undervolting, everything used to be fine with (my undervolting + auto undervolting)

------------
<font color=orange><b><A HREF="http://www.mozilla.org/products/firefox" target="_new">Rediscover the web</A></b></font color=orange>
August 23, 2005 1:37:38 AM

I'm at work now, so I cant check my manual.
Cant you raise the n-bridge v on your board? That may be worth a try.
August 23, 2005 11:33:21 AM

How can northbridge overvolting help me to undervolt my CPU? I'm running FSB at only 133 MHz and my RAM is doing fine at 2.0-2-2-5

------------
<font color=orange><b><A HREF="http://www.mozilla.org/products/firefox" target="_new">Rediscover the web</A></b></font color=orange>
August 23, 2005 2:01:24 PM

Quote:
System temp has been hitting 45 almost for the whole summer, maybe that had some effects on my mobo.

Or on your CPU. Maybe your CPU is just getting slightly too warm when maxed out to remain stable when undervolted. When the summer temps simmer back down try undervolting again. With luck it could just be a temporary condition. :) 

And thinking of undervolting, I've really got to see what I can manage to do for my P4C. I've done everything else I could to make it more quiet. I don't know why I haven't undervolted it yet. :o 

:evil:  یί∫υєг ρђœŋίχ :evil: 
<font color=red><i>The Devil made me do it, but I <b>liked</b> it.</i></font color=red>
@ 196K of 200K!<P ID="edit"><FONT SIZE=-1><EM>Edited by slvr_phoenix on 08/23/05 09:03 AM.</EM></FONT></P>
August 23, 2005 8:15:14 PM

Oh no, it's still not stable enough to finish 600 pointers!

<pre>[18:16:20] *------------------------------*
[18:16:20] Folding@Home Gromacs Core
[18:16:20] Version 1.80 (March 16, 2005)
[18:16:20]
[18:16:20] Preparing to commence simulation
[18:16:20] - Looking at optimizations...
[18:16:23] - Files status OK
[18:16:33] - Expanded 2968802 -> 16166417 (decompressed 544.5 percent)
[18:16:34]
[18:16:34] Project: 1140 (Run 86, Clone 98, Gen 14)
[18:16:34]
[18:16:35] Assembly optimizations on if available.
[18:16:35] Entering M.D.
[18:16:59] (Starting from checkpoint)
[18:16:59] Protein: p1140_RIBO_FSpeptide_EXT_nospring
[18:16:59]
[18:16:59] Writing local files
[18:16:59] Completed 75000 out of 250000 steps (30)
[18:17:07] Extra SSE boost OK.
[19:08:51] Writing local files
[19:08:52] Completed 77500 out of 250000 steps (31)
[19:35:38] Quit 101 - Fatal error: NaN detected: (ener[12])
[19:35:38]
[19:35:38] Simulation instability has been encountered. The run has entered a
[19:35:38] state from which no further progress can be made.
[19:35:38] This may be the correct result of the simulation, however if you
[19:35:38] often see other project units terminating early like this
[19:35:38] too, you may wish to check the stability of your computer (issues
[19:35:38] such as high temperature, overclocking, etc.).
[19:35:38] Going to send back what have done.
[19:35:38] logfile size: 31970
[19:35:38] - Writing 32533 bytes of core data to disk...
[19:35:38] ... Done.
[19:35:38]
[19:35:38] Folding@home Core Shutdown: EARLY_UNIT_END
[19:35:41] CoreStatus = 72 (114)
[19:35:41] Sending work to server</pre><p>
I've got another 600 pointer. I'll wait and see what happens to this one.

------------
<font color=orange><b><A HREF="http://www.mozilla.org/products/firefox" target="_new">Rediscover the web</A></b></font color=orange>
August 24, 2005 1:00:37 AM

Yep, had issues with a certain kind of molecule. The overclock was a little too tight. Fortuneately, didnt get many of those.

Hi everyone. Glad you've managed to keep contributions flowing in. I'll be able to pitch in a little now that the summer heat is dying down. On our way to the top 100!!! Yeah Team!!!

The loving are the daring!
a b à CPUs
August 24, 2005 2:32:07 AM

WOOTZ! Good to see you again Flinx!

__________________________________________________
<font color=red>You're a boil on the arse of progress - don't make me squeeze you!</font color=red>