Increasing Blue Screens leading to Windows 7 not booting w/ RAID 0

csialbany

Distinguished
Dec 1, 2010
2
0
18,510
Hello,

I am having issues with a machine configured w/ two WD6000HLHX drives in RAID 0 using the intel controller on an Asus Rampage III Extreme motherboard. To summarize: the user experienced increasingly frequent blue screens and now the machine will not boot to the hard drive (although it does bring up Windows 7 startup repair). I have tested the drives separately, as well as the RAM and all components appear to pass diagnostic tests.

I have one user on expertsexchange who is strongly recommending the problem is related to the WD6000HLHXs not being adequate for a RAID 0 array. I was just hoping to get other opinions on the issue. The expertsexchange post is copied below (http://www.experts-exchange.com/OS/Microsoft_Operating_Systems/Windows/Windows_7/Q_26646322.html).

Thanks.

---

Me:
Hello,

I am working on a computer which no longer boots into Windows 7 x64.

I delivered the machine to a client in what appeared to be perfect working order. After a week or two of use the machine began blue screening. I was not able to make it to the client's for another week. During this time the blue screens became more and more erratic (no consistent error code). Eventually the machine failed to boot, loading the startup repair screen. The startup repair fails with the error: "Auto Fail Over ... no os installed".

I have the system drive setup as two 600 GB drives in raid 0. I disabled the RAID and tested each drive individually. They both passed full read tests.

The board is a rampage iii extreme i am using the intel RAID controller and the array status shows up as Normal.

I then tested the RAM - it passed memtest repeatedly.

I am a bit stumped. I re-enabled the RAID and booted to the startup repair screen. When the startup repair screen appears it identifies the Windows 7 install and you can proceed. It always fails with the NO OS INSTALLED message.

In the command prompt I can navigate to the drive and view all the files so the filesystem appears okay. If I check BCDEdit I see the bootloader points to the correct drive location. If I try to manually rebuild the bootloader using bootrec I recieve the same error: No OS Installed.

Any help is greatly appreciated.

UserA:
Testing HDDs requires appropriate methodology, software (sometimes hardware), and knowledge of interoperability issues.

Based on what you report, my educated guess is that you are using consumer class disks which are not suitable for the Matrix controller. Specifically, the disks don't have the proper timing / settings to deal with error recovery (google TLER). Intel does not qualify consumer class disks for use with many of their controllers, because they will cause such problems.

Also, you probably don't even have the right kind of diagnostics that can expose such issues. You certainly can't run them when the disk is attached to that "controller", which is not so much a controller as a $3.00 chip. You will get a faster and more stable system with native XP RAID. It will also do load balancing on reads. The controller you have probably doesn't,

(Also, I think you mean you are using RAID1, not RAID0, right?)

Here is a paper I wrote which discusses some of this in more detail.
http://www.experts-exchange.com/Storage/Misc/A_2757-Disk-drive-reliability-overview.html

UserB:
Ive had issues like this in the past with RAM not being compatible with the MOBO. Try removing one RAM at a time or looking up the MOBO to see what ram is kosher for it.

UserC:
A single .wdl (WatchDog Log) file is created in the \Windows\LogFiles\Watchdog folder for each crash. Just open the most recently dated file in your favorite text editor (or Notepad) to view details of the crash and some related information. I mean if possiable via Safe Mode.

Me:
User A: We are using two Western Digital VelociRaptor WD6000HLHX 600GB drives. These drives appear to support TLER and RAID (at least raid 1) according to http://community.wdc.com/t5/Desktop/VelociRaptor-WD6000HLHX-RAID-and-TLER/td-p/28261 . I am 100 percent sure the drives are hooked up in Raid 0. Raid 0 was requested by the client to improve the speed of a PostgreSQL database they serve on their machine.

Regarding hard drive testing, I disabled RAID in the bios so both drives were hooked up to the SATA controller running in IDE mode. I am using quicktech pro diagnostic software which ID's the drives correctly. I am fairly certain that the hard drive tests were run correctly.

User B: I will investigate the RAM compatibility but the machine did run without a problem for at least a month and a half. I have removed the RAM and installed it in various configurations without being able to resolve the startup issue.

User C: I will check the .wdl as soon as possible, thank you. I cannot start the machine in safemode but i can use the windows 7 recovery command line to copy the file to another machine.

Me:
User A: To clarify I tested the hard drives to find out if the drives themselves are damaged. I was not testing them to find out if the there was an issue with the RAID. I just re-read your post and realized you were referring to RAID issues when you said "you probably don't even have the right kind of diagnostics that can expose such issues."

User A:
The 'raptor does NOT have the appropriate firmware for use behind a RAID controller. You need the RE3/RE4s. You'll note on their website that the RE3/RE4s both specifically mention RAID compatibility/TLER, while the velociraptors do not. All you have is a blog posting from a moderator that says it "should" be OK. The spec sheets on the drives do NOT mention suitability. (Unfortunately I don't have a raptor in my lab, so I can't hook it up to our software to report the read/write TLER settings,)

The velociraptors are enterprise class for reliability and data integrity, but the error recovery logic is not tuned for the 2-3 second max window before it gives up.

Besides, you are really not even using a RAID controller in your design. This is a fakeraid. The device driver does the work.

Read this ...
http://www.wdc.com/en/library/sata/2579-001098.pdf

Me:
Thanks User A

If I understand what you are saying the problem lies in the fact that the western digital drives cannot handle the RAID configuration due to the TLER spec (or lack thereof). If they were up to spec than the the intel raid "controller" would suffice?

User A:
well, it would 'work'.. but understand that this is a fakeraid & the device driver does all the work. it is a low end, cheap 3 dollar chip that you are trusting your data to. no processor, no ECC, no battery. frankly imho, it is unsuitable.

Me:
Hi again,

I just spoke with Western Digital and they claim the WD6000HLHX is RAID compatible but "did not have the information" whether or not it has the TELR spec.

I think I trust your opinion over a western digital customer service representative. I'm just wondering if it is normal to see a problem with intermittent failures that increase over time on an inadequate drive in RAID? Is it also normal for the RAID status to show up as "Normal" in the intel configuration screen on boot? Also is it normal that I can still access the filesystem even if there are RAID issues?

Thanks for your answers so far. Unfortunately the only solution this leads em to is reinstalling the OS without RAID and waiting to find out if the problems reoccur.

User A:
Here is the deal with "RAID compatible". A USB memory stick is RAID compatible. A floppy drive, and an EMC subsystem is also RAID compatible.

The deal is how the RAID "controller", whether it is software raid or hardware RAID is architected to deal with error recovery situations. I'm going to take some shortcuts here and not write a volume on all error recovery scenarios, and generalize things, just because we all have better stuff to do, and I have a conf call in a few mins.

The intel matrix controller, as well as LSI MPT, and HP SMARTArray, and pretty much all the hardware controllers allow about 7-10 seconds for a drive to respond to queued I/O requests. When a HDD goes into a deep recovery mode, it sits there and re-reads and re-reads the offending sector(s) until it times out based on settings in firmware. TLER has a read and write timer. TLER is also a marketing term, BTW, as it started as a vendor-specific feature that has since been adopted as a standard.

Basically, the firmware sets how long a disk is going to be allowed to work exclusively on recovering an unreadable block (or range, depending on the ATA command). "enterprise raid" disks set it for 2-3 seconds. The premise is that the disk is behind a RAID controller or software and there is redundancy, so just give up and let the controller get the disk from XOR parity elsewhere, and it will overwrite and remap the block automatically.

A disk without TLER (which again is not the correct term, and not even the correct way to say it, but it is just easier to go with the flow here), will go for 30+ seconds. It is presumably a desktop machine, and the block belongs to some guys's wedding photos, or important data, and there is never a backup, so it needs to pull out all the stops and try to recover before giving up.

So to the controller, the disk locks up during that time.

This is why the 2-3 second timeout found in high-end disks and standard with FC, SAS, SCSI just makes those systems so snappy, and they rarely lock up.

Now the problem, is that the matrix controller allows 7-10 seconds (it varies by model), for a disk to time out before it takes "Corrective actions". Corrective actions could be a lot of things, and it depends on variables I can't get into w/o NDA and looking things up on specific chip, and I wouldn't even do that w/o $$$ for the effort :)

The matrix controller does NOT have the intelligence to automatically re-route read requests to the other disk. (if you had RAID 1), and since you are doing RAID0, and both disks must respond to each I/O, it waits.

Windows native software RAID1 as well as LINUX md driver, and solaris ZFS, and other similar RAIDs will remap automatically, and won't kill a disk w/o knowing it is really dead.

SO that is the problem.
Just turn OFF the INTEL matrix firmware, reinstall, and if you want to keep those disks, you are simply going to have to go to RAID10 or get 2 larger drives and go with RAID1. With software-RAID, you have benefit of reading from both disks at the same time so reads will have improved throughput and IOPS. YOu do not get this benefit with the matrix controller anyway, so you should actually see a nice performance boost in software RAID1 vs MATRIX RAID0 in reads. In writes, real-world, it will probably be a wash.
 

NetworkStorageTips

Distinguished
Jul 19, 2010
91
0
18,660
Well, this was just awarded best answer on another thread:

http://www.tomshardware.com/forum//forum2.php?config=tomshardwareus.inc&cat=32&post=263832&page=1&p=1

Outside of that, though, frequent blue screens in my experience are caused by a hard drive going bad or RAM.

What RAM are you using? (I know, you tested it, but for how long?)

Are there driver updates for the RAID controller?

Again, referencing my reply above, RAID 0 has almost no purpose in life (my opinion), RAID belongs on servers with top quality controller and enterprise class drives.

If you want speed on a workstation SSD for Windows drive, huge files go on another drive or quite possible a NAS (see my website for suggestions)

Best,
Roger.