Hard drives dropping out of MD array built on Intel S1200BTL

Erk1209

Distinguished
Dec 6, 2011
47
0
18,560
Good evening,

I have been wrestling with an issue with my file server that I would be very grateful for some advice about. First, my system:

Intel Xeon E3-1240V2
Intel Server motherboard S1200BTL
8GB Crucial ECC RAM
120GB SSD for Ubuntu 12.04 OS
10x 2TB Seagate Barracuda in MD software array level 6
2x LSI 9211-8i controllers flashed to IT mode
400W Seasonic Gold PSU

I am having issues with drives dropping out of my array. This board has an x16 slot, three x8 slots, and a regular PCI slot. I initially ran 8 hard drives with a Supermicro x4 controller in the top PCI x8 slot for about a year without any issues what so ever. I wanted to add two more drives, so I purchased a Supermicro x8 card which I installed in the topmost x8 slot, moving the x4 Supermicro card to the bottom x8 slot. When attempting to rebuild the array, I had issues with drives dropping out and making horrifying mechanical sounds and was never able to complete a rebuild. Eventually enough drives dropped out that the array failed all together. Even though drives were reported bad, they never tested bad with any program. I removed the x8 Supermicro card, replacing it with my original x4 Supermicro card in the topmost PCIe x8 slot and connected the two extra drives directly to the motherboard. This time the array assembled perfectly and ran smoothly.

Assuming my Supermicro x8 controller was bad, I swapped it for a LSI 9211-8i controller and placed that controller in the bottom most PCIe x8 slot. This also worked perfectly, assembling the MD array and running it without issues.

Encouraged, I purchased a second 9211 and installed it in the 3rd PCIe x8 slot on the board, moving my first 9211 to the topmost PCIe x8 slot. This did not work at all. Drives dropped out (almost always two at a time) or failed to be detected on the controller in the topmost x8 slot. I was never able to get a rebuild complete with this set up.

I was able to get a rebuild completed by moving the 9211 in the topmost x8 slot into the PCI x16 slot, leaving my second 9211 in the third PCIe x8 slot. Not only did this detect all drives and rebuild without a failure, it rebuilt at about twice the speed that the array had been building at with my previous set-ups; over 100000k/sec. I was thrilled, thinking I had figured out the issue. The array rebuilt overnight and ran stable all day.

I came home from work tonight with the intention to close up my tower and get back to normal. As I was standing the case up from lying on it's side (the machine was on) I heard a mechanical sound from a drive (not a click) and poof, two drives had dropped out connected to the 9211 in the third PCIe x8 slot. I was able to immediately re-add the missing drives and the array began rebuilding like normal at high speeds (127000k/sec). This continued uninterrupted for some time. I went back to close up the case and in the process brushed a power cable with the force that a butterfly lands on a flower. Immediately, I heard the mechanical sound and a drive connected to the 9211 in the third PCIe x8 slot dropped out, showing "removed" in mdadm --detail /dev/md0 as has become the norm. The array immediately began to rebuild with one of the "spares" I added back after tonight's initial problem when I stood the tower up and that has been running just fine for about two hours now.

However, I'm really uncomfortable with how volatile things are right now. I don't know if it's a coincidence that I touched a cable or stood the tower up and drives dropped out of the array, or if it really is that sensitive. I'm wondering if there's an issue with this motherboard just not being able to support the two x8 cards.

The current set up is eight drives connected to a 9211 in the PCIe x16 slot with two connected to the 9211 in the third PCIe x8 slot.

Thank you for reading and please let me know if you have any ideas or if you need anything explained better. I could really use a hand as I'm at my wits end with this...

-Eric
 

InvalidError

Titan
Moderator
If touching or shaking connectors "with the force of a butterfly" causes drives to drop out, that sounds like an intermittent power connection somewhere... bad crimp in wiring harnesses, bad solder joints, broken/loose connectors, intermittent short, etc. You could start with unplugging all your drives, inspect all the connectors to see if there are any that look better/worse than the others, plug them back in and do a wiggle test on each to see if you still have a weak connection somewhere.
 

Erk1209

Distinguished
Dec 6, 2011
47
0
18,560
Invalid, thank you for your reply. I am using Molex ---> 4 SATA power adapters to power the ten drives. I suppose a short could explain why a minor jostle in two cases could cause drives to drop out. I did just look and notice that the back of two of the SATA power connectors has fallen off exposing the metal. There is no connection between the drives that are dropping out and the power cables with the missing backs. Would you recommend that I replace my adapters all the same?
 

InvalidError

Titan
Moderator
If by "missing backs" you mean that the SATA power connectors are missing part of the shroud around their wafer connector, that would reduce the contact force applied on the power connection to maintain a good electrical connection.

If the broken connectors are not a problem with the drives they are currently connected to, you might want to consider how those connectors got damaged in the first place: maybe the drive-side connector those cable connectors were originally connected to when they broke got damaged too. Is it always the same drives that keep dropping out?
 

Erk1209

Distinguished
Dec 6, 2011
47
0
18,560
Thanks for your help Invalid. I heard back from Intel today who informed me that my board cannot support the two x8 cards. It turns out the x16 slot is actually an electric(?) x8 slot and the physical x8 slots are actually electric x4 slots. So I either need to change boards or change my approach to expanding my storage. Considering an Intel RES2SV240 for that purpose. I will most likely still replace the damaged Molex adapters though, per your advice
 

InvalidError

Titan
Moderator
Your board being x8x4x4 or some variant thereof should not prevent your controller from working or cause drives to drop off. It should only affect the sustained bandwidth you can use through that slot/card and a x4 2.0 slot should be good enough for 20Gbps. You should have no problem pulling over 1GB/s using any of those slots.

I checked your board's datasheet and it seems the top-most PCIe slot (slot #3) is the IO-Hub's x4 slot and your CPU-hosted x8 slot is #6 while #4 and #5 are your CPU-hosted x4 slots. So you want to put your most IO-intensive card in slot #6 and your other IO-intensive cards in #4 and #5. You want to avoid using #3 for high-speed IO since it has to go through the DMI bus which adds latency and may become a bottleneck at high speeds.

BTW, you may want to re-check your data cables too if you did not do so already: if any of those cables are loose or broken, the "mechanical noises" could simply be drives re-initializing themselves after their data connection has been broken and restored.
 

Erk1209

Distinguished
Dec 6, 2011
47
0
18,560


Invalid, thanks for your continued assistance. I see your point about the pcie interfaces not normally causing the problems I've experienced. To control for this as tests go forward I'll keep an x4 card in the x8 slot just to eliminate that possibility. I am going to pick up a new set of Molex to SATA power adapters this weekend to test with. I may as well pick up an extra sff-8087 cable or two as well in case I broke a cable while wiring the system. Recently only the two drives connected to the x4 card in the bottom x8 slot have had issues so that's what I'll focus on.

Another gentleman in a somewhat related thread suggested that my power supply might not be sufficient to power ten drives and the rest of the machine. Do you have any opinion on this? I'm thinking of upgrading to a 550W or better if replacing the cables doesn't give me success.

And finally you mentioned not plugging any cards into the #3 slot. Is that the one immediately beneath the x16 slot?

Thanks again!
 

InvalidError

Titan
Moderator

Intel's manual is somewhat confusing about that: Intel numbers slots from the bottom so when the text says #3, that would be the PCIe slot nearest to the regular PCI slot but in the board drawings, slot #6 has a separate label from slots #4 and #5 which seems to imply slot #6 (the one nearest to the x16 slot which would be #7) is the ugly duckling. One of those two must be a mistake.

For the PSU, unless you have extremely power-hungry drives, your drives should not use much more than 20W during spin-up and ~12W active while the rest of your system should not be using more than 100W unless you have a GPU and run CPU/GPU-intensive stuff on your server. Since your server should not use more than ~180W under normal circumstances, a 400W PSU should be more than perfectly adequate; even more so considering you said you had a GOLD-RATED SEASONC PSU so you should be well within the PSU's comfort zone. The only reason you might want to consider a beefier PSU would be to reduce the number of power adapters/splitters you need since beefier PSUs tend to have more cables with beefier wires and sometimes more connectors so you do not need to use as many splitters and hook up as many loads per wire from the PSU.
 

Erk1209

Distinguished
Dec 6, 2011
47
0
18,560
Intel's manual is somewhat confusing about that: Intel numbers slots from the bottom so when the text says #3, that would be the PCIe slot nearest to the regular PCI slot but in the board drawings, slot #6 has a separate label from slots #4 and #5 which seems to imply slot #6 (the one nearest to the x16 slot which would be #7) is the ugly duckling. One of those two must be a mistake.

For the PSU, unless you have extremely power-hungry drives, your drives should not use much more than 20W during spin-up and ~12W active while the rest of your system should not be using more than 100W unless you have a GPU and run CPU/GPU-intensive stuff on your server. Since your server should not use more than ~180W under normal circumstances, a 400W PSU should be more than perfectly adequate; even more so considering you said you had a GOLD-RATED SEASONC PSU so you should be well within the PSU's comfort zone. The only reason you might want to consider a beefier PSU would be to reduce the number of power adapters/splitters you need since beefier PSUs tend to have more cables with beefier wires and sometimes more connectors so you do not need to use as many splitters and hook up as many loads per wire from the PSU.[/quotemsg]

Thanks for clearing that up for me. I had a lot of confidence in my Seasonic so I was a little surprised when that other gentlemen suggested that might be the issue. My plan going forward is to leave x4 card in there an replace my Molex adapters while adding a second Molex cable to the PSU so I'm not running so much off the one cable. Hopefully this will clear up the extreme sensitivity to movement I'm currently experiencing. If that works as expected I'll reinstall the x8 card and see if things are stable. I'll report back this weekend with my results.

Thanks again

 

InvalidError

Titan
Moderator
A computer should not be that sensitive to relatively minor vibrations so my best guess is you have something loose somewhere. It could be a bad solder joint or a bad crimp in the PSU's wiring/cables or your splitters too. The best way to find out is the scientific "wiggle-test" : try to wiggle only one thing at a time and see what causes the most repeatable failures. Do not stop at the first seemingly repeatable failure since it is quite possible that you are not in the right place yet and there may be an even more sensitive spot nearby which could turn into a more likely candidate.
 

kiwifruktish

Reputable
Dec 8, 2014
1
0
4,510
Your problem sounds like smartd.

I had the same problem with my lsi 1068e.

The solution is to remove any commands in smartd that can "spin" up a disk.

My self i used:
/dev/sda -a -o on -S on -T permissive -W 0,0,0 -I 194 -I 231 -n never,q -s S/../.././12 -m kx@x.com -M exec /usr/share/smartmontools/smartd-runner

Problem was:
-n never
"smartd will poll (check) the device regardless of its power mode. This may cause a disk which is spun-down to be spun-up when smartd checks it. This is the default behavior if the '-n' Directive is not given."

-o on ( alos if im not wrogn.

Now i use the fallowing settings:
/sda -S on -a -s (S/../.././11|L/../../(2|7)/13) -m kx@x.com -M exec /usr/share/smartmontools/smartd-runner

I also have s1200btl, so dont listen to intel what pci-e cards are supported or not ;) since alot of controllers are supported that isnt listed.