Sign in with
Sign up | Sign in
Your question

Two Disk Failures a Co-Incidence

Last response: in Storage
Share
Anonymous
a b G Storage
June 29, 2005 3:48:57 AM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

Two 2U servers in a rack in a machine room. One does the real work,
the other is a "warm" standby. They are about 30 months
old, with SCSI disks.

Just after midnight, one server crashes: nothing is logged, but the
Linux console messages say "IO error" with a (large) sector number.

So everything gets moved to the other server, in preparation
for tracking down the problem and deciding what to do.

((The disk subsequently tells SMART that it is fine, and has never had
a bad sector, or been out of normal temperature range))

7PM (19 hours later), the SECOND server crashes. Again, nothing in
the logs, but a console messages saying "IO Error" with a sector number.

So what is going on?

Was this just a coincidence? Two independent failures that happened to be
on the same day?

Or did something cause both crashes? If so, what could it have been.

We have no reason to suspect any kind of mechanical disturbance. Both
machines have been in the same rack, and on the same
UPS, since they were installed some 30 months ago, with no sign of
problems.

Neither machine had ever crashed before, and the last time they were
(both) re-booted was in November, to add an extra disk drive to each
machine.

The machine room had new air conditioning put in a couple of months
ago. And the ventilation was upgraded throughout the building was
upgraded earlier this year.

What ARE the odds of two SCSI disk systems (disks +
controllers) both failing on the same day after 30 months? And is it
a more (or less) likely explanation than the building work a few rooms
away or the air conditioning installed a good number of weeks ago?

It has been decided that the servers were pretty well due for
replacement anyway, so new ones will be ordered. But given this
rather surprising double wobbler, is there anything about the
environment that should be (double) checked?

Robert.

--
|_) _ |_ _ ._ |- | So what? It's easier for me, so I'll do it!
| \(_)|_)(-'| |_ |
deadspam.com is a spamtrap. | > > What's wrong with top posting?
Use bcs.org.uk instead. | > It makes it hard to see comments in context.
Anonymous
a b G Storage
June 29, 2005 3:48:58 AM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

Perhaps the parts (HDs and or controllers, etc) were manufactured in the
same batch and had a defect that kicks in like clockwork.

--Dan

"Robert Inder" <robert@deadspam.com> wrote in message
news:f51ekamw10m.fsf@3lg.org...
> Was this just a coincidence? Two independent failures that happened to be
> on the same day?
Anonymous
a b G Storage
June 29, 2005 3:48:58 AM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

Robert Inder wrote:
>
> 7PM (19 hours later), the SECOND server crashes. Again, nothing in
> the logs, but a console messages saying "IO Error" with a sector number.

We had 2 pcs in the office built the same day with the same parts. Both
hard drives failed in the same day. They were both on 24/7. Upon calling
WD, it turned out that both hard drives were made on the same day.








--
http://www.bootdisk.com/
Related resources
Anonymous
a b G Storage
June 29, 2005 5:29:10 AM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

"Robert Inder" <robert@deadspam.com> wrote in message news:f51ekamw10m.fsf@3lg.org
> Two 2U servers in a rack in a machine room. One does the real work, the
> other is a "warm" standby. They are about 30 months old, with SCSI disks.
>
> Just after midnight, one server crashes:

> nothing is logged, but the
> Linux console messages say "IO error" with a (large) sector number.

A crash on a single IO error, that in itself is suspicious.

>
> So everything gets moved to the other server, in preparation
> for tracking down the problem and deciding what to do.
>
> ((The disk subsequently tells SMART that it is fine, and has never had
> a bad sector, or been out of normal temperature range))
>
> 7PM (19 hours later), the SECOND server crashes. Again, nothing in
> the logs, but a console messages saying "IO Error" with a sector number.
>
> So what is going on?
>
> Was this just a coincidence? Two independent failures that happened
> to be on the same day?
>
> Or did something cause both crashes? If so, what could it have been.
>
> We have no reason to suspect any kind of mechanical disturbance.
> Both machines have been in the same rack, and on the same UPS,
> since they were installed some 30 months ago, with no sign of problems.
>
> Neither machine had ever crashed before, and the last time they were
> (both) re-booted was in November, to add an extra disk drive to each
> machine.
>
> The machine room had new air conditioning put in a couple of months
> ago. And the ventilation was upgraded throughout the building was
> upgraded earlier this year.
>
> What ARE the odds of two SCSI disk systems (disks + controllers)
> both failing on the same day after 30 months? And is it a more
> (or less) likely explanation than the building work a few rooms
> away or the air conditioning installed a good number of weeks ago?
>
> It has been decided that the servers were pretty well due for
> replacement anyway, so new ones will be ordered. But given this
> rather surprising double wobbler, is there anything about the
> environment that should be (double) checked?

For the machine to crash, that IO error has to be rather severe.
You have the block numbers, check them and try to find out if
they are in a file of crucial importance for the system to rely on.

>
> Robert.
Anonymous
a b G Storage
June 29, 2005 11:45:18 AM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

In article <f51ekamw10m.fsf@3lg.org>, Robert Inder <robert@deadspam.com>
writes

>Just after midnight, one server crashes: nothing is logged, but the
>Linux console messages say "IO error" with a (large) sector number.

Is the swap partition at the end of the disk?
Anonymous
a b G Storage
June 29, 2005 2:39:10 PM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

dg <dan_gus@hotmail.com> wrote in message
news:iRkwe.34828$J12.2937@newssvr14.news.prodigy.com...

> Perhaps the parts (HDs and or controllers, etc) were manufactured in the same
> batch and had a defect that kicks in like clockwork.

Very unlikely indeed.

> Robert Inder <robert@deadspam.com> wrote

>> Was this just a coincidence? Two independent failures that happened to be on
>> the same day?
Anonymous
a b G Storage
June 29, 2005 2:39:11 PM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

Rod Speed wrote:
>
> > Perhaps the parts (HDs and or controllers, etc) were manufactured in the same
> > batch and had a defect that kicks in like clockwork.
>
> Very unlikely indeed.

It was the day a disgruntled employee took some valve grinding compound
to work.
Anonymous
a b G Storage
June 29, 2005 2:39:12 PM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

"Plato" <|@|.|> wrote in message
news:42c1efe5$0$2478$bb4e3ad8@newscene.com...
> Rod Speed wrote:
> >
> > > Perhaps the parts (HDs and or controllers, etc) were manufactured in
the same
> > > batch and had a defect that kicks in like clockwork.
> >
> > Very unlikely indeed.
>
> It was the day a disgruntled employee took some valve grinding compound
> to work.

That wouldn't affect a HD.
Anonymous
a b G Storage
June 29, 2005 4:39:40 PM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

Plato <|@|.|> wrote in message
news:42c1efe5$0$2478$bb4e3ad8@newscene.com...
> Rod Speed wrote:

>>> Perhaps the parts (HDs and or controllers, etc) were manufactured
>>> in the same batch and had a defect that kicks in like clockwork.

>> Very unlikely indeed.

> It was the day a disgruntled employee took
> some valve grinding compound to work.

Again, very unlikely indeed. That would produce
something visible in the SMART stats.
Anonymous
a b G Storage
June 29, 2005 4:41:36 PM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

"Plato" <|@|.|> wrote in message news:42c1efe4$1$2478$bb4e3ad8@newscene.com...
> Robert Inder wrote:
>>
>> 7PM (19 hours later), the SECOND server crashes. Again, nothing in
>> the logs, but a console messages saying "IO Error" with a sector number.
>
> We had 2 pcs in the office built the same day with the same parts. Both
> hard drives failed in the same day. They were both on 24/7. Upon calling
> WD, it turned out that both hard drives were made on the same day.

But that would have produced SMART data, his didnt.
June 30, 2005 10:41:49 AM

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

Sounds like a brown-out. Had several device go out in one computer for
literally no reported reason. Last thing was to test voltage and found that
the socket didn't produce enough wattage. Also, UPS's can do the same when
their battery goes bad.
"Mike Tomlinson" <mike@NOSPAM.jasper.org.uk> wrote in message
news:UBFmEKN+NkwCFwQl@jasper.org.uk...
> In article <f51ekamw10m.fsf@3lg.org>, Robert Inder <robert@deadspam.com>
> writes
>
>>Just after midnight, one server crashes: nothing is logged, but the
>>Linux console messages say "IO error" with a (large) sector number.
>
> Is the swap partition at the end of the disk?
>
>
>
!