Inexplicable problem on one-year old rig

karvala

Distinguished
Sep 19, 2009
11
0
18,510
Hi all,

I'm not sure if this is the right forum for this; I can't see a better one but a mod can move it if it's more appropriate somewhere else. I posted this problem on experts exchange as well about a week ago, and it seems to have baffles a couple of people there as well, so I'm hoping you guys can provide more insight. Apologies for the extreme length of this, but the problem is so weird, and I've done a comprehensive set of investigations so far, and I know you guys will want as much info as possible, so here it goes. I'll first of all post my original core system specs so you know what we're talking about, then I'll give you the long history bit of the problem during the last week, but if you want to skip that initially, a summary of the situation is given at the end.

Thermaltake Soprano Black case
OZC GameXStream 600W power supply
MSI Neo2 P35 Crossfire motherboard
Q6600 Quad Core (previously overclocked to @3.2Ghz, but currently being tested at stock 2.4Ghz)
4*2Gb Corair XMS2 DDR2 800Mhz RAM (stock timings now)
2*Sapphire Radeon HD4850 512Mb PCI-E graphics cards.
Creative X-Fi Elite Pro soundcard
Texas Instruments 1394 network card (not actually used)
Samsung 22" 226BW monitor

--------------------------------------

LONG HISTORY

I run 32-bit XP as my main OS, and regularly make backups using Acronis True Image 8 (and for more recent ones, True Image 11, but it's TI8 restores that we're talking about). A few days ago, a spectacular virus attack got through the various defences of my homebuilt system (hardware firewall, software firewall, startup control, AV heuristics; it all fell apart), and left the XP installation unusable. After failing to get it completely under control again, I deleted the partition (after backing it up to an external drive in case I need anything from it in future), and decided to restore a disk image made 3 weeks earlier. The image was restored to the same disk that it originally came from, it was a full disk image restore (including MBR and Track 0) and that appeared to be successful. The restored XP installation was now bootable, and contained all the files that had been there when the image was made. So far so good.

Now the first problem: to my surprise, when trying to play a 3D game, it crashed. Tried another, it also crashed. Some others seemed fine: on investigation, the pattern was basically if 3D was stressed, major graphics corruption occurred which if unstopped would eventually lockup the system. Booting into Vista at the time, 3D graphics appeared to be fine; no crashes or lockups. A clean XP install similarly performed fine, but when I attempted to restore an earlier XP image, to a different hard drive, the same problem occurred again.

If that wasn't weird enough, things got a good deal stranger when I attempted to restore the original image once again (this is now the third image restoration). This time, it got to the Welcome screen fine, but I couldn't even login, with complaints that it couldn't find the profile, and shortly after that, complaints about the filesystem (which I imagine were both from the same underlying cause). I restored the other image, again to a different drive, and had the same problem. So now, for two different images on two different drives, both of them showed one lot of strange behaviour on the first restore (graphics failure under stress) and a different lot of strange behaviour on the second restore (filesystem problems).

At that point I was prepared to put it down to just image problems, although I verified the images in True Image both at the time I made them, and at the time I restored them, and they were fine. I can also browse and open the files within the images without any problems, so I don't believe they're corrupt, but in any case I tried restoring them once more, through my laptop now using an external enclosure, and sure enough, that seemed fine, I could login again! However, the system shortly became unstable, with lots of "missing" files (which weren't actually missing at all), causing many apps to fail. More filesystem problems. Restoring a different image to a different drive gave similar results again; always after restoration via the laptop I can login, but they quickly show these missing file problems.

Putting it down finally to bad images, I turned back to my clean XP installation, which was having the usual Radeon driver fun. Finally found some drivers (9.6) which seemed to accept both cards being present, and what did I find? 3D graphics corruption under stress. Meanwhile Vista on the same machine continued to be fine, including 3D graphics.

Scrapped that, reformatted the drive and installed another clean version of XP. This seemed fine, and 3D graphics were fine. Meanwhile, my Vista installation decided it had had enough, and would no longer boot, citing missing or corrupt boot files. Replacing them one by one worked for each individual file (but there were way too many to make that a practical solution), suggesting that they really were corrupt or missing, but I have no idea what had suddenly caused it. The clean XP installation remained there intact, but increasingly (to more than half the time) locked up during the boot process (towards the end of the Windows logo screen), something which was not helped by either VGA mode or Safe mode.

I now figured it must be a coincidental hardware problem, since many different installation and restorations were all suffering ill effects at the same time, after a year of trouble-free usage right from the first build (and that inherited the XP installation from a previous build without any trouble). I'd already tried pulling out one of the graphics cards, and swapping over the existing one, so that was ruled out, and it's not clear that they could really cause system-wide filesystem corruption such as I'd seen anyway. The most obvious candidate was the motherboard, so I bought a different motherboard (Asrock P45XE), restored an XP image via the laptop again, and installed it, and off we went. First login was great; all seemed to be well. Uninstalled the old chipset drivers and utilities, installed the new ones, and seemed good; could even run 3D apps no problem. No filesystem problems mentioned, no apps screaming that they couldn't find files. Thought I'd solved it. Then rebooted, and now the ATi drivers decided they didn't like me, and reliably gave me a 7E stop code and BSOD on boot. Pulled one of the cards out again, to get round that problem for the time being, and was able to login again, but now filesystem problems once again appeared. Restored another image, same filesystem problems appeared.

Additional things I have done in regard to hardware including different hard drives, different data cables, swapping out memory modules, running memtest and Windows memory diagnostics (nothing found), running Prime95 to look for CPU problems on all four cores (nothing found), checking temperatures (all fine now, including drives and motherboard). The only slightly odd thing I noticed was Everest reading voltages from the Winbond chip, gave a very strange reading for the 12v rail(s), fluctuating between 0v and 3v, and never going anywhere near 12; this is a known problem with Everest and some sensors, however, and the BIOS reported the 12v voltage at being in the 12.0-12.4 range consistently, so I don't think this is a genuine problem.

One important test that I wish I had the facilities to carry out in a more comprehensive fashion, may reveal something important. Having restored an XP image to an IDE drive, I was able to test it in one machine at work (swapping the drives out). It's the only machine I can test it on since all other work machines are Dell and don't have PS2 ports and the USB keyboards/mice don't work without the chipsets installed, but can't login to install them (and don't have the driver disk for them anyway). It's a rather old and slow machine and I don't have drivers for it either, so parts of the chipset, unfortunately including the onboard VGA driver, cannot be loaded for testing, but it is notable that when testing in this machine, XP boots fine every time, and none of "[various_names].dll missing" problems for applications associated with the home machine have occurred, in half a dozen or more boots. Similarly, none of the problems with slowdowns or files that appear to be missing (don't show up in Windows Explorer and can't be found by apps which need them) until half an hour or more of slow partial disk activity has gone when they're slowly discovered, as happens in the restored XP image on the home machine. I dearly wish I had access to a more recent or complete machine with all the drivers in which I could test this definitively, but unfortunately I don't and there is simply no way I can go and buy one (I've already spent more than I should on various bits of kit to test this problem!).

--------------------------------------

BRIEF SUMMARY


SOFTWARE

(1) Virus attack triggered the whole thing and required restoration of True Image backup of main XP installation.

(2) Backup restored okay, but graphics failed under stress. Different backup restored, similar problem. Different backup files, using different versions of backup software tested, same result.

(3) Further backups restored through the desktop were unable even to login due to crippling problems. Chkdsk reports filesystem problems, and interestingly, with the same problem file IDs in two different instances.

(4) Further backups restored via external drives connected to a laptop always seem to start okay, apart from the 3D graphics problem. Together with (3) this suggests that the desktop system is also failing under the stress of restoring a large (350GB) image.

(4) First clean XP installation appears okay, but after a while shows the same graphics corruption problem that the restored images showed.

(5) Clean XP installation appears okay, graphics okay to start with, but rapidly develops boot lockup problem, and eventually declared unbootable even though the files are still in place. This may be another example of a filesystem problem appearing for no clear reason.

(6) On the other hand, restoring the clean XP image initially appears to function perfectly well.

(7) Vista installation initially seemed okay, including graphics, but eventually also failed, again with boot file problems.

(8) Possible filesystem problems also developed on non-bootable data drive present while restored XP images were running.

Summary: different OSs on different drives all develop one or both of two types of problems: graphics corruption under stress, and filesystem/resource problems.


HARDWARE

(1) Memtest and Windows memory diagnostic find no problems with memory. Memory modules also swapped for a completely different set, but appears to make no difference.

(2) CPU was overheating and causing system shutdown (due to thermal protection enabled in new BIOS) after motherboard switch, but new CPU cooler seems to have taken care of that. Prime95 torture test shows no problems with any CPU core.

(3) Motherboard changed for a new one from a different brand and with a different chipset. Seems to make no difference to the problem again.

(4) Several different hard drives, all verified as working before, some brand new, all with good S.M.A.R.T status, tested, and same problems occurring on each. Similarly, different cables used. Some drives are SATA, some are IDE, tested on different ports on different motheroboards.

(5) Two graphics cards tested, individually and paired together, both verified as working before and during the problem. Same problems occur with either/both.


So basically, I've changed/tested all the hardware, and it all appears to be fine and yet these somewhat random and progressive errors remain in place, and affecting different OS installations, some of them new, with both graphics corruption under stress and/or filesystem problems. It's not a hardware problem, it's not an image problem, it's not an OS or software problem, and yet it is a problem that seems to affect only my machine. Does it not like the decor in my apartment or something?!!

Any ideas at all? Anyone? I'm completely pulling my hair out over this; in ten years of building my own systems I've never come across anything like it. It seems simply inexplicable, and obviously I can't move forward even with a clean install until I've solved it, because clean installs also appear to fall apart after a short while. Meanwhile, thanks for reading, and apologies for the enormous length of this.
 

karvala

Distinguished
Sep 19, 2009
11
0
18,510
Thanks for the reply. Yeah, the PSU is my number candidate as well, not least because it's basically the only part that I haven't changed yet! I mentioned it a few days to some people, but they dismissed the idea because the voltages on all rails remain firmly within acceptable limits, and it's a decent unit. I think it definitely should be swapped out, though, so I'm glad you mentioned it too. Out of interest, though, what is the mechanism you image for a PSU problem causing these symptoms, and in particular, how is it that the Vista installation survived nearly a week, with okay graphics under stress even, after the XP one collapsed?

Can't be the motherboard (unless I'm extremely unlucky) because I've already changed it (thinking along the same lines as you), and had exactly the same symptoms with the new motherboard.
 
The only thing wrong with the PSU could be a flaky voltage once in a while, but that's enough to cause problems. Vista's caching decreases disk access over XP and with 8 GB (I presume that you are using Vista 64-bit if you installed that much memory), caching has been optimized.
 

karvala

Distinguished
Sep 19, 2009
11
0
18,510
Well, it's not good news, and honestly I'm out of ideas now, so I hope you guys have some insight. I replaced the PSU with a newer and more poweful one (OCZ ModXStream Pro 700W), connected it up to a restored image done via the laptop (so not with filesystem problems, at least initially), and with the new motherboard drivers etc. installed. And what happened? Yep, you guessed it: collapsed under 3D graphics stress, with graphics corruption kicking after about a minute or so. I'm just totally confused now; I don't know what more I can do. I've literally replaced everything (except the case; surely it can't be the case??!!!). What can possibly be causing these symptoms? Is it even hardware, or software?
 

karvala

Distinguished
Sep 19, 2009
11
0
18,510


Thanks, that explanation makes sense, although as you have just seen in my above post, it turns out not to be a bad PSU, or at least a replacement makes no difference. There a few strange features to note about this:-

(1) Changing the PSU and motherboard makes no difference.

(2) Which image is restored, and which disk it's restored onto, makes no difference.

(3) Filesystem problems seem to come and go; tonight, for example, I've tested two restored images on two different drives, and both of them are fully bootable and usable; it's just the graphics problem that's present in both. Last night, the same two on the same two drives (before being restored again today, I stress; these are not literally the same restorations) were both essentially unbootable.

(4) At the same time (i.e. testing straight after) that graphics problems under stress show up in one OS, they don't show up in another. For example, the Vista (which is/was 64-bit, yes) system showed no graphics stress in some familiar programs at the time the XP one did; this is what drew me away from a graphics hardware explanation initially (along with the fact that both graphics cards showed the same effect).

Thinking about it, I wonder if the graphics under stress is actually a graphics problem, or could it also be a disk problem? Presumably what is displayed on screen has to come off disk first in some form, and if the disks are struggling for whatever reason, then perhaps that would be most likely to show up in a 3D stress situation? Or is that a very unlikely explanation?
 

karvala

Distinguished
Sep 19, 2009
11
0
18,510


Yes, either the first, second or both video cards will show the same results. The tests I've just done with the new PSU are with a single video card. The card is certainly capable of the tests I gave it as well, but it's also interesting to note that they are doing slightly better than they did when previously restored; then they would corrupt and lock up immediately, here they're going for a minute or so before falling apart.

Not sure what you mean by adequate grounding. It's all installed in a case, with the power cord going to a plug including the usual earth if that's what you mean; certainly the grounding situation hasn't visibly changed as far as I'm aware since it was all working fine. Not sure if that answers your question?
 
You answered the grounding question. The whole issue doesn't make much sense, but other than testing each part in another system, it will be difficult to find the culprit. Basically you have replaced every part except for the CPU and case. I agree that it shouldn't be the case and it probably isn't a CPU issue (though it can't be ruled out). I presume that you already used memtest to verify the memory, etc. Are the monitor and PC connected to the same outlet?
 

karvala

Distinguished
Sep 19, 2009
11
0
18,510
Yes, exactly, I've replaced all the parts except the CPU and the case, and it's very hard to see how it could be either, and yet here we are. I've used memtest and the Windows memory diagnostic to test the memory (both were fine), and also swapped it for an entirely different set, so I don't it can be that. I've also used Prime95 to torture test all four cores of the CPU, and no errors were found.

The PC and the monitor were connected to the same outlet, although I've moved the monitor today so they're no longer, but I'm about to test the PC in another outlet as well, just in case there is a dodgy connection somewhere, although it seems unlikely.

What I can't understand is how the same issue does not show up on each OS installed simultaneously, and yet OS is eventually affected by graphics or filesystem problems. If it were software, you'd expect not every OS to go down, and if it were hardware, you'd expect each OS to be affected at the same time. I simply can't think of any even unlikely scenario in which the behaviour I'm witnessing is actually possible....
 

karvala

Distinguished
Sep 19, 2009
11
0
18,510
Final update just in case anyone is still interested: it turned out to be an obscure memory address problem when faced with the combination of a restored image, the catalyst drivers and the /3GB boot switch in 32-bit XP. Basically, it was dynamically assigning memory addresses which didn't actually exist, and which caused all manner of problems including filesystem problems, applications unable to initialise (but if you close it and later try it again, it worked because memory addresses lower down had been released in the meantime and were assigned instead). The graphics hardware accelerator library failed to load, and the application notification also failed because it wasn't working itself for the same reason. How MS can allow this behaviour without even a basic check is quite beyond me, but there we are. Removing the /3GB switch (which was working fine at the time of imaging, but not after restoring that image because hardware I/O addresses changed) resulted in problem solved. :)
 

TRENDING THREADS