Hard disk smart mode cannot predict failures

Discussion in 'NZ Computing' started by Craig Sutton, Mar 2, 2007.

  1. Craig Sutton

    Craig Sutton Guest

    Also interresting RE: harddisk temps cooler is NOT better!

    Hard Disk SMART data is ineffective at predicting failure
    As manufacturers continue to release larger hard disk year after year,
    regardless of what hard drive one gets, the last thing anyone wants to see
    is a failing hard disk, particularly if it contains a lot of content that is
    not backed up. Google has recently released quite an interesting paper
    going into detail with statistics about its infrastructure's hard disk
    failures. Their main findings were that the drive's self-monitoring data
    (S.M.A.R.T.) does not reliably predict failure and that the drive
    temperature and usage levels are not proportional to failure.

    What is interesting about Google's hard disk usage is that unlike most
    businesses that typically use 10,000kRPM and 15,000kRPM SCSI hard disks in
    their servers, Google uses cheaper consumer-grade serial ATA and parallel
    ATA 5,400kRPM and 7,200RPM hard disks. They consider a hard drive failed
    when it is replaced during a repair. All drives have had their S.M.A.R.T.
    information gathered excluding spurious readings.

    Going by their statistics, hard drives tend to fail the most in their early
    stage with about 3% failing in the first three months and then at a fairly
    steady rate after 2 years, with 5 years being the typical end-of-life. When
    it came to analysing the S.M.A.R.T. data, they found that only four main
    values were closely related to the failed hard disks which include counts
    for scan error, sector reallocations, offline reallocations and sectors on
    probations. One interesting discovery was that no hard disk has had a
    single spindle failure or at least a spin retry count in the S.M.A.R.T.
    data.

    Unfortunately, even with these four S.M.A.R.T. values, 56% of the drives
    that failed did not have a single count in any of these four values, which
    means that over half of the hard drives have failed with not even a sign
    warning from the S.M.A.R.T. data. Finally, when it came to temperature,
    despite most expectations, they found that the cooler the drive was, the
    more prone it was to failing. Only when it came to very high temperatures
    did the rate of failure start increasing also.

    Even though Google used consumer grade hard disks, it is worth noting that
    unlike the average home PC, chances are that most of the hard disks in
    Google's servers are only ever switched on once at the time of installation,
    run continuously until the point of failure and run in a temperature
    controlled environment. In home PCs, the hard disks are regularly spun up &
    down and the temperatures vary from room temperature up to their operating
    temperature each time the computer is used also. As a result, it would be
    interesting to see what the statistics would be like from a large survey of
    hard disks used in the home.

    Further information can be found in this Ars Technica article and in this
    Google paper.
     
    Craig Sutton, Mar 2, 2007
    #1
    1. Advertising

  2. Craig Sutton

    -=rjh=- Guest

    I was surprised to see the way this article was reported at the time.

    The real news isn't that SMART cannot reliably predict failure (it never
    could, around 50% of failures are not mechanical) but that temperature
    and 'use' weren't relevant to failure; also the rather short lifetime of
    modern disks. Chances of any of my drives (which have 5 year warranties
    now) of surviving 5 years is very low.

    SMART may not *reliably* predict failures; but that doesn't make it
    entirely useless. If there is a 50% chance that SMART can tell me there
    is going to be a problem with one of my disks, I want to know about it.

    It may alert you to bad sectors well before scandisk or similar tools
    will be aware of them. Disk shift and G-sense might be useful if you
    want to see if a drive has been physically abused.

    I'd hardly call it ineffective; it certainly saved my data last year. I
    was able to retrieve my data before failure, and get an RMA by quoting
    the SMART data.

    I don't think SMART was ever intended to reliably predict failure - but
    it tracks a surprising amount of interesting data anyway.

    It might be interesting to see how physical abuse affects disk life in a
    wider and more representative population (unlike Google's production
    line) - like where disks get handled individually by untrained staff,
    slammed down on desks, used with unstable PSUs etc.


    Craig Sutton wrote:
    > Also interresting RE: harddisk temps cooler is NOT better!
    >
    > Hard Disk SMART data is ineffective at predicting failure
    > As manufacturers continue to release larger hard disk year after year,
    > regardless of what hard drive one gets, the last thing anyone wants to
    > see is a failing hard disk, particularly if it contains a lot of content
    > that is not backed up. Google has recently released quite an
    > interesting paper going into detail with statistics about its
    > infrastructure's hard disk failures. Their main findings were that the
    > drive's self-monitoring data (S.M.A.R.T.) does not reliably predict
    > failure and that the drive temperature and usage levels are not
    > proportional to failure.
    >
    > What is interesting about Google's hard disk usage is that unlike most
    > businesses that typically use 10,000kRPM and 15,000kRPM SCSI hard disks
    > in their servers, Google uses cheaper consumer-grade serial ATA and
    > parallel ATA 5,400kRPM and 7,200RPM hard disks. They consider a hard
    > drive failed when it is replaced during a repair. All drives have had
    > their S.M.A.R.T. information gathered excluding spurious readings.
    >
    > Going by their statistics, hard drives tend to fail the most in their
    > early stage with about 3% failing in the first three months and then at
    > a fairly steady rate after 2 years, with 5 years being the typical
    > end-of-life. When it came to analysing the S.M.A.R.T. data, they found
    > that only four main values were closely related to the failed hard disks
    > which include counts for scan error, sector reallocations, offline
    > reallocations and sectors on probations. One interesting discovery was
    > that no hard disk has had a single spindle failure or at least a spin
    > retry count in the S.M.A.R.T. data.
    >
    > Unfortunately, even with these four S.M.A.R.T. values, 56% of the drives
    > that failed did not have a single count in any of these four values,
    > which means that over half of the hard drives have failed with not even
    > a sign warning from the S.M.A.R.T. data. Finally, when it came to
    > temperature, despite most expectations, they found that the cooler the
    > drive was, the more prone it was to failing. Only when it came to very
    > high temperatures did the rate of failure start increasing also.
    >
    > Even though Google used consumer grade hard disks, it is worth noting
    > that unlike the average home PC, chances are that most of the hard disks
    > in Google's servers are only ever switched on once at the time of
    > installation, run continuously until the point of failure and run in a
    > temperature controlled environment. In home PCs, the hard disks are
    > regularly spun up & down and the temperatures vary from room temperature
    > up to their operating temperature each time the computer is used also.
    > As a result, it would be interesting to see what the statistics would be
    > like from a large survey of hard disks used in the home.
    >
    > Further information can be found in this Ars Technica article and in
    > this Google paper.
    >
    >
    >
     
    -=rjh=-, Mar 2, 2007
    #2
    1. Advertising

  3. Craig Sutton

    Enkidu Guest

    -=rjh=- wrote:
    > I was surprised to see the way this article was reported at the time.
    >
    > The real news isn't that SMART cannot reliably predict failure (it never
    > could, around 50% of failures are not mechanical) but that temperature
    > and 'use' weren't relevant to failure; also the rather short lifetime of
    > modern disks. Chances of any of my drives (which have 5 year warranties
    > now) of surviving 5 years is very low.
    >
    > SMART may not *reliably* predict failures; but that doesn't make it
    > entirely useless. If there is a 50% chance that SMART can tell me there
    > is going to be a problem with one of my disks, I want to know about it.
    >
    > It may alert you to bad sectors well before scandisk or similar tools
    > will be aware of them. Disk shift and G-sense might be useful if you
    > want to see if a drive has been physically abused.
    >
    > I'd hardly call it ineffective; it certainly saved my data last year. I
    > was able to retrieve my data before failure, and get an RMA by quoting
    > the SMART data.
    >

    SMART is useless. I've had disks fail when it has not predicted failure
    and I've had disks run for ages when it reports problems. All you need
    to do is keep a spare for your RAID arrays and wait until a disk
    actually fails and bung a new one in. The chances of two failures in a
    RAID array are minute.

    Cheers,

    Cliff

    --

    Have you ever noticed that if something is advertised as 'amusing' or
    'hilarious', it usually isn't?
     
    Enkidu, Mar 2, 2007
    #3
  4. Craig Sutton

    thingy Guest

    Enkidu wrote:

    8><----

    >>

    > SMART is useless.


    I tend to agree I have seen Windows throw up errors in the event logs
    yet smart think he drives fine.....then the drive fails....it is
    certainly an eye opener that you cannot trust SMART.

    I've had disks fail when it has not predicted failure
    > and I've had disks run for ages when it reports problems. All you need
    > to do is keep a spare for your RAID arrays and wait until a disk
    > actually fails and bung a new one in. The chances of two failures in a
    > RAID array are minute.
    >
    > Cheers,
    >
    > Cliff
    >


    Depending on the number of disks and their age. Also having an array of
    all the same vendor and model increases the risk of 2 failing before a
    rebuild. Always good to have a hot spare.

    Personally having to look after about 400 disks in our SAN, I replace
    them when there are any more than one or 2 soft errors...I get nervous
    of having to spend 2 days restoring TBs from tape....

    regards

    Thing
     
    thingy, Mar 3, 2007
    #4
  5. Craig Sutton

    ~misfit~ Guest

    Craig Sutton wrote:
    > Also interresting RE: harddisk temps cooler is NOT better!


    <snip>

    > Finally, when
    > it came to temperature, despite most expectations, they found that
    > the cooler the drive was, the more prone it was to failing. Only
    > when it came to very high temperatures did the rate of failure start
    > increasing also.


    <snip>

    > Further information can be found in this Ars Technica article and in
    > this Google paper.


    Thanks for this Craig. I've always believed that constant temperature is
    more important than low temperature (within reason). Ergo, for PCs that are
    turned on-and-off a lot, keeping the operating temp low *is* keeping it
    (fairly) constant.

    Do you have URLs for these articles? I'd like to see if Google actually
    quote temp figures.

    Cheers,
    --
    Shaun.
     
    ~misfit~, Mar 3, 2007
    #5
  6. Craig Sutton

    -=rjh=- Guest

    ~misfit~ wrote:
    > Craig Sutton wrote:
    >> Also interresting RE: harddisk temps cooler is NOT better!

    >
    > <snip>
    >
    >> Finally, when
    >> it came to temperature, despite most expectations, they found that
    >> the cooler the drive was, the more prone it was to failing. Only
    >> when it came to very high temperatures did the rate of failure start
    >> increasing also.

    >
    > <snip>
    >
    >> Further information can be found in this Ars Technica article and in
    >> this Google paper.

    >
    > Thanks for this Craig. I've always believed that constant temperature is
    > more important than low temperature (within reason). Ergo, for PCs that are
    > turned on-and-off a lot, keeping the operating temp low *is* keeping it
    > (fairly) constant.
    >
    > Do you have URLs for these articles? I'd like to see if Google actually
    > quote temp figures.
    >


    The usenix article is at

    http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html

    Google article (pdf)

    http://labs.google.com/papers/disk_failures.pdf
     
    -=rjh=-, Mar 3, 2007
    #6
  7. Craig Sutton

    ~misfit~ Guest

    -=rjh=- wrote:
    > ~misfit~ wrote:
    > > Craig Sutton wrote:
    > > > Also interresting RE: harddisk temps cooler is NOT better!

    > >
    > > <snip>
    > >
    > > > Finally, when
    > > > it came to temperature, despite most expectations, they found that
    > > > the cooler the drive was, the more prone it was to failing. Only
    > > > when it came to very high temperatures did the rate of failure
    > > > start increasing also.

    > >
    > > <snip>
    > >
    > > > Further information can be found in this Ars Technica article and
    > > > in this Google paper.

    > >
    > > Thanks for this Craig. I've always believed that constant
    > > temperature is more important than low temperature (within reason).
    > > Ergo, for PCs that are turned on-and-off a lot, keeping the
    > > operating temp low *is* keeping it (fairly) constant.
    > >
    > > Do you have URLs for these articles? I'd like to see if Google
    > > actually quote temp figures.
    > >

    >
    > The usenix article is at
    >
    > http://www.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html
    >
    > Google article (pdf)
    >
    > http://labs.google.com/papers/disk_failures.pdf


    Thanks heaps.
    --
    Shaun.
     
    ~misfit~, Mar 3, 2007
    #7
  8. Craig Sutton

    ~misfit~ Guest

    -=rjh=- wrote:
    > ~misfit~ wrote:
    > Google article (pdf)
    >
    > http://labs.google.com/papers/disk_failures.pdf


    Interesting but a shame that they ony correlate *average* temps to failure
    rate. Considering all the temperature data that they say they gathered it
    would be nice if they shared more of it.

    Cheers,
    --
    Shaun.
     
    ~misfit~, Mar 3, 2007
    #8
  9. Craig Sutton

    Dave Doe Guest

    In article <>,
    says...
    > -=rjh=- wrote:
    > > I was surprised to see the way this article was reported at the time.
    > >
    > > The real news isn't that SMART cannot reliably predict failure (it never
    > > could, around 50% of failures are not mechanical) but that temperature
    > > and 'use' weren't relevant to failure; also the rather short lifetime of
    > > modern disks. Chances of any of my drives (which have 5 year warranties
    > > now) of surviving 5 years is very low.
    > >
    > > SMART may not *reliably* predict failures; but that doesn't make it
    > > entirely useless. If there is a 50% chance that SMART can tell me there
    > > is going to be a problem with one of my disks, I want to know about it.
    > >
    > > It may alert you to bad sectors well before scandisk or similar tools
    > > will be aware of them. Disk shift and G-sense might be useful if you
    > > want to see if a drive has been physically abused.
    > >
    > > I'd hardly call it ineffective; it certainly saved my data last year. I
    > > was able to retrieve my data before failure, and get an RMA by quoting
    > > the SMART data.
    > >

    > SMART is useless. I've had disks fail when it has not predicted failure
    > and I've had disks run for ages when it reports problems. All you need
    > to do is keep a spare for your RAID arrays and wait until a disk
    > actually fails and bung a new one in. The chances of two failures in a
    > RAID array are minute.


    For Servers (or high end RAID wk.stations) - sure. But for yer average
    workstation with a single drive, it's better than nothing.

    Last year I recovered 2 workstation HDD SMART failures sucessfully
    (cloning the drive to a new one). (Which is another good thing about a
    SMART failure, you get a new drive no questions). In total I *think*
    there were about 8 HDD failures - so it's better than nothing.

    I've also fixed one HDD in a RAID array (SCSI set) - but did not have
    SMART on - and I would agree that for a Server one should probably not
    use SMART (I've heard it impacts on performance a small amount). Being
    a hot swappable array I just swapped in a new disk. Unfortunately it
    did not automatically rebuild, but a format of the new drive and manual
    array rebuild brought it back. The server was never shutdown and there
    was no downtime at all (the array ran fine with one disk down for 3 days
    (and a night while the array rebuilt)).


    --
    Duncan
     
    Dave Doe, Mar 4, 2007
    #9
  10. Craig Sutton

    MarkH Guest

    -=rjh=- <> wrote in news:45e88b09$:

    > I'd hardly call it ineffective; it certainly saved my data last year.
    > I was able to retrieve my data before failure, and get an RMA by
    > quoting the SMART data.


    It sounds like you were certainly dicing with death there. You waited
    until SMART warned you before saving your data? It sounds like SMART is
    worse than nothing when people become complacent and wait for SMART to
    advise of impending failure - the article you quoted said that 56% of the
    failures had no corresponding SMART alerts.

    I would think that it is safer to assume failure is imminent at all times
    and regularly backup your data, just in case.


    --
    Mark Heyes (New Zealand)
    See my pics at www.gigatech.co.nz (last updated 23-Nov-06)
    "The person on the other side was a young woman. Very obviously a
    young woman. There was no possible way she could have been mistaken
    for a young man in any language, especially Braille."
    Maskerade
     
    MarkH, Mar 4, 2007
    #10
  11. Craig Sutton

    -=rjh=- Guest

    MarkH wrote:
    > -=rjh=- <> wrote in news:45e88b09$:
    >
    >> I'd hardly call it ineffective; it certainly saved my data last year.
    >> I was able to retrieve my data before failure, and get an RMA by
    >> quoting the SMART data.

    >
    > It sounds like you were certainly dicing with death there. You waited
    > until SMART warned you before saving your data?


    Hell no, I have automated backups - I use SMART to tell me when I need
    to get my drive RMA'd :)

    However the warning did enable me to take an image of the system while
    it was still working, which is a better situation than dealing with
    backups only.
     
    -=rjh=-, Mar 4, 2007
    #11
  12. Craig Sutton

    -=rjh=- Guest

    Dave Doe wrote:
    > In article <>,
    > says...
    >> -=rjh=- wrote:
    >>> I was surprised to see the way this article was reported at the time.
    >>>
    >>> The real news isn't that SMART cannot reliably predict failure (it never
    >>> could, around 50% of failures are not mechanical) but that temperature
    >>> and 'use' weren't relevant to failure; also the rather short lifetime of
    >>> modern disks. Chances of any of my drives (which have 5 year warranties
    >>> now) of surviving 5 years is very low.
    >>>
    >>> SMART may not *reliably* predict failures; but that doesn't make it
    >>> entirely useless. If there is a 50% chance that SMART can tell me there
    >>> is going to be a problem with one of my disks, I want to know about it.
    >>>
    >>> It may alert you to bad sectors well before scandisk or similar tools
    >>> will be aware of them. Disk shift and G-sense might be useful if you
    >>> want to see if a drive has been physically abused.
    >>>
    >>> I'd hardly call it ineffective; it certainly saved my data last year. I
    >>> was able to retrieve my data before failure, and get an RMA by quoting
    >>> the SMART data.
    >>>

    >> SMART is useless. I've had disks fail when it has not predicted failure
    >> and I've had disks run for ages when it reports problems. All you need
    >> to do is keep a spare for your RAID arrays and wait until a disk
    >> actually fails and bung a new one in. The chances of two failures in a
    >> RAID array are minute.

    >
    > For Servers (or high end RAID wk.stations) - sure. But for yer average
    > workstation with a single drive, it's better than nothing.
    >
    > Last year I recovered 2 workstation HDD SMART failures sucessfully
    > (cloning the drive to a new one). (Which is another good thing about a
    > SMART failure, you get a new drive no questions). In total I *think*
    > there were about 8 HDD failures - so it's better than nothing.
    >
    > I've also fixed one HDD in a RAID array (SCSI set) - but did not have
    > SMART on - and I would agree that for a Server one should probably not
    > use SMART (I've heard it impacts on performance a small amount).


    Why would that be? AFAIK the drive is monitoring and recording the SMART
    data anyway, all that these SMART tools are doing is accessing,
    interpreting and presenting the data.
     
    -=rjh=-, Mar 4, 2007
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. gary
    Replies:
    2
    Views:
    859
    Walter Mautner
    Oct 28, 2004
  2. Confessor
    Replies:
    12
    Views:
    804
  3. Mark Adams
    Replies:
    5
    Views:
    586
    Mark Adams
    Nov 17, 2006
  4. kc
    Replies:
    5
    Views:
    322
  5. Spin
    Replies:
    7
    Views:
    781
    Bill in Co.
    Apr 9, 2008
Loading...

Share This Page