Two Disk Failures a Co-Incidence

Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

Two 2U servers in a rack in a machine room. One does the real work,
the other is a "warm" standby. They are about 30 months
old, with SCSI disks.

Just after midnight, one server crashes: nothing is logged, but the
Linux console messages say "IO error" with a (large) sector number.

So everything gets moved to the other server, in preparation
for tracking down the problem and deciding what to do.

((The disk subsequently tells SMART that it is fine, and has never had
a bad sector, or been out of normal temperature range))

7PM (19 hours later), the SECOND server crashes. Again, nothing in
the logs, but a console messages saying "IO Error" with a sector number.

So what is going on?

Was this just a coincidence? Two independent failures that happened to be
on the same day?

Or did something cause both crashes? If so, what could it have been.

We have no reason to suspect any kind of mechanical disturbance. Both
machines have been in the same rack, and on the same
UPS, since they were installed some 30 months ago, with no sign of
problems.

Neither machine had ever crashed before, and the last time they were
(both) re-booted was in November, to add an extra disk drive to each
machine.

The machine room had new air conditioning put in a couple of months
ago. And the ventilation was upgraded throughout the building was
upgraded earlier this year.

What ARE the odds of two SCSI disk systems (disks +
controllers) both failing on the same day after 30 months? And is it
a more (or less) likely explanation than the building work a few rooms
away or the air conditioning installed a good number of weeks ago?

It has been decided that the servers were pretty well due for
replacement anyway, so new ones will be ordered. But given this
rather surprising double wobbler, is there anything about the
environment that should be (double) checked?

Robert.

--
|_) _ |_ _ ._ |- | So what? It's easier for me, so I'll do it!
| \(_)|_)(-'| |_ |
deadspam.com is a spamtrap. | > > What's wrong with top posting?
Use bcs.org.uk instead. | > It makes it hard to see comments in context.
10 answers Last reply
More about disk failures incidence
  1. Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

    Perhaps the parts (HDs and or controllers, etc) were manufactured in the
    same batch and had a defect that kicks in like clockwork.

    --Dan

    "Robert Inder" <robert@deadspam.com> wrote in message
    news:f51ekamw10m.fsf@3lg.org...
    > Was this just a coincidence? Two independent failures that happened to be
    > on the same day?
  2. Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

    Robert Inder wrote:
    >
    > 7PM (19 hours later), the SECOND server crashes. Again, nothing in
    > the logs, but a console messages saying "IO Error" with a sector number.

    We had 2 pcs in the office built the same day with the same parts. Both
    hard drives failed in the same day. They were both on 24/7. Upon calling
    WD, it turned out that both hard drives were made on the same day.


    --
    http://www.bootdisk.com/
  3. Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

    "Robert Inder" <robert@deadspam.com> wrote in message news:f51ekamw10m.fsf@3lg.org
    > Two 2U servers in a rack in a machine room. One does the real work, the
    > other is a "warm" standby. They are about 30 months old, with SCSI disks.
    >
    > Just after midnight, one server crashes:

    > nothing is logged, but the
    > Linux console messages say "IO error" with a (large) sector number.

    A crash on a single IO error, that in itself is suspicious.

    >
    > So everything gets moved to the other server, in preparation
    > for tracking down the problem and deciding what to do.
    >
    > ((The disk subsequently tells SMART that it is fine, and has never had
    > a bad sector, or been out of normal temperature range))
    >
    > 7PM (19 hours later), the SECOND server crashes. Again, nothing in
    > the logs, but a console messages saying "IO Error" with a sector number.
    >
    > So what is going on?
    >
    > Was this just a coincidence? Two independent failures that happened
    > to be on the same day?
    >
    > Or did something cause both crashes? If so, what could it have been.
    >
    > We have no reason to suspect any kind of mechanical disturbance.
    > Both machines have been in the same rack, and on the same UPS,
    > since they were installed some 30 months ago, with no sign of problems.
    >
    > Neither machine had ever crashed before, and the last time they were
    > (both) re-booted was in November, to add an extra disk drive to each
    > machine.
    >
    > The machine room had new air conditioning put in a couple of months
    > ago. And the ventilation was upgraded throughout the building was
    > upgraded earlier this year.
    >
    > What ARE the odds of two SCSI disk systems (disks + controllers)
    > both failing on the same day after 30 months? And is it a more
    > (or less) likely explanation than the building work a few rooms
    > away or the air conditioning installed a good number of weeks ago?
    >
    > It has been decided that the servers were pretty well due for
    > replacement anyway, so new ones will be ordered. But given this
    > rather surprising double wobbler, is there anything about the
    > environment that should be (double) checked?

    For the machine to crash, that IO error has to be rather severe.
    You have the block numbers, check them and try to find out if
    they are in a file of crucial importance for the system to rely on.

    >
    > Robert.
  4. Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

    In article <f51ekamw10m.fsf@3lg.org>, Robert Inder <robert@deadspam.com>
    writes

    >Just after midnight, one server crashes: nothing is logged, but the
    >Linux console messages say "IO error" with a (large) sector number.

    Is the swap partition at the end of the disk?
  5. Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

    dg <dan_gus@hotmail.com> wrote in message
    news:iRkwe.34828$J12.2937@newssvr14.news.prodigy.com...

    > Perhaps the parts (HDs and or controllers, etc) were manufactured in the same
    > batch and had a defect that kicks in like clockwork.

    Very unlikely indeed.

    > Robert Inder <robert@deadspam.com> wrote

    >> Was this just a coincidence? Two independent failures that happened to be on
    >> the same day?
  6. Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

    Rod Speed wrote:
    >
    > > Perhaps the parts (HDs and or controllers, etc) were manufactured in the same
    > > batch and had a defect that kicks in like clockwork.
    >
    > Very unlikely indeed.

    It was the day a disgruntled employee took some valve grinding compound
    to work.
  7. Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

    "Plato" <|@|.|> wrote in message
    news:42c1efe5$0$2478$bb4e3ad8@newscene.com...
    > Rod Speed wrote:
    > >
    > > > Perhaps the parts (HDs and or controllers, etc) were manufactured in
    the same
    > > > batch and had a defect that kicks in like clockwork.
    > >
    > > Very unlikely indeed.
    >
    > It was the day a disgruntled employee took some valve grinding compound
    > to work.

    That wouldn't affect a HD.
  8. Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

    Plato <|@|.|> wrote in message
    news:42c1efe5$0$2478$bb4e3ad8@newscene.com...
    > Rod Speed wrote:

    >>> Perhaps the parts (HDs and or controllers, etc) were manufactured
    >>> in the same batch and had a defect that kicks in like clockwork.

    >> Very unlikely indeed.

    > It was the day a disgruntled employee took
    > some valve grinding compound to work.

    Again, very unlikely indeed. That would produce
    something visible in the SMART stats.
  9. Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

    "Plato" <|@|.|> wrote in message news:42c1efe4$1$2478$bb4e3ad8@newscene.com...
    > Robert Inder wrote:
    >>
    >> 7PM (19 hours later), the SECOND server crashes. Again, nothing in
    >> the logs, but a console messages saying "IO Error" with a sector number.
    >
    > We had 2 pcs in the office built the same day with the same parts. Both
    > hard drives failed in the same day. They were both on 24/7. Upon calling
    > WD, it turned out that both hard drives were made on the same day.

    But that would have produced SMART data, his didnt.
  10. Archived from groups: comp.sys.ibm.pc.hardware.storage (More info?)

    Sounds like a brown-out. Had several device go out in one computer for
    literally no reported reason. Last thing was to test voltage and found that
    the socket didn't produce enough wattage. Also, UPS's can do the same when
    their battery goes bad.
    "Mike Tomlinson" <mike@NOSPAM.jasper.org.uk> wrote in message
    news:UBFmEKN+NkwCFwQl@jasper.org.uk...
    > In article <f51ekamw10m.fsf@3lg.org>, Robert Inder <robert@deadspam.com>
    > writes
    >
    >>Just after midnight, one server crashes: nothing is logged, but the
    >>Linux console messages say "IO error" with a (large) sector number.
    >
    > Is the swap partition at the end of the disk?
    >
    >
    >
Ask a new question

Read More

Servers Storage