RE: Early determinition of bad sectors and correcting them ...

From: Adam Radford (aradford@3WARE.com)
Date: Fri Nov 22 2002 - 19:33:29 EST

Next message: Rusty Lynch: "Re: [BUG] sysfs on 2.5.48 unable to remove files while in use"
Previous message: Stephen Rothwell: "Re: [PATCH] Beginnings of conpat 32 code cleanups"
Maybe in reply to: Manish Lachwani: "Early determinition of bad sectors and correcting them ..."
Next in thread: Andre Hedrick: "RE: Early determinition of bad sectors and correcting them ..."
Reply: Andre Hedrick: "RE: Early determinition of bad sectors and correcting them ..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Manish,

In the case of 3ware, If you run the 3ware card in raid mode not jbod,
sectors causing
ECC errors will be automatically remapped on the fly. There is also a
'Media Scan'
feature which will cause bad sectors to be remapped as a background task on
the
controller while normal IO is running.

-Adam

-----Original Message-----
From: Manish Lachwani [mailto:manish@Zambeel.com]
Sent: Friday, November 22, 2002 4:16 PM
To: Manish Lachwani; 'Rod.VanMeter@nokia.com';
'linux-kernel@vger.kernel.org'; 'alan@lxorguk.ukuu.org.uk'
Subject: RE: Early determinition of bad sectors and correcting them ...

Btw, I am reffering to IDE drives and IDE controllers (including controllers
like 3ware) ...

Thanks
-Manish

-----Original Message-----
From: Manish Lachwani
Sent: Friday, November 22, 2002 4:15 PM
To: 'Rod.VanMeter@nokia.com'; Manish Lachwani;
linux-kernel@vger.kernel.org; alan@lxorguk.ukuu.org.uk
Subject: RE: Early determinition of bad sectors and correcting them ...

Yes, you are right abt taking different factors into account especially
queuing before making a decision abt the threshold. However, I was actually
referring to disks only.

I have actually written a scrubber that traverses the disk and if it
encounters a problem with the medium, it tries to get the sector remapped.
However, the problem with the scrubber is that it will have to traverse all
the disks in a subsystem. The amount of time it takes to traverse each disk
is long depending on the size of the disk. We also have to make sure that
the scrubber process does not take too much CPU when running.

Hence if the scrubber traverses accross the disk, it is possible that a
problem occurs on a sector that the scrubber passed. Then we will have to
wait for the scrubber to restart the read.

Most of the disks support an operating temperature of 60C. However, if we
are operating at higher temperatures that result in long reads (due to shaky
medium at higher temperatures), this facility is good.

-----Original Message-----
From: Rod.VanMeter@nokia.com [mailto:Rod.VanMeter@nokia.com]
Sent: Friday, November 22, 2002 3:54 PM
To: manish@zambeel.com; linux-kernel@vger.kernel.org;
alan@lxorguk.ukuu.org.uk
Subject: RE: Early determinition of bad sectors and correcting them ...

I won't comment on the code, but I'll note that there
are legitimate reasons why a read can take a long time:
drive spin-up (if it has been spun down), queued commands
(if the device supports them), slow devices (e.g., some
removable media), very large commands, the occasional
thermal recalibration all come to mind immediately.

It would be great if we had this functionality, and even
better if we had it for all devices.

On at least some SCSI devices, it's possible to set a
parameter in the device that sets a threshold for how
severe an error to report (ECC errors are not a one-
dimensional thing). Unfortunately, it's pretty device
dependent. I would be nice, when you care about your
data, to set the threshold very sensitive. Then when
an error occurs, you get notified, and you can retry
or rewrite the block.

A slightly different approach is an idle scrubber, that
reads all of your blocks when the system is idle, looking
for errors and rewriting as necessary.

Note that in either case, it doesn't come for free and
doesn't really guarantee your data; a power loss or
other problem in the middle of the bad-block-rewrite
can cause problems, and writes fail more often than
reads.

--Rod

> -----Original Message-----
> From: ext Manish Lachwani [mailto:manish@Zambeel.com]
> Sent: Friday, November 22, 2002 11:33 AM
> To: linux-kernel@vger.kernel.org; 'Alan Cox'
> Cc: Manish Lachwani
> Subject: Early determinition of bad sectors and correcting them ...
>
>
> I had thought abt this earlier and tried to implemented it.
>
> Everytime there is an ECC error (0x40), there is a pending
> set of sectors
> that the drive needs to remap. The drive can map the sectors
> as part of its
> house keeping function or the drive can remap it when an
> explicit write is
> made to that sector. Once an ECC error occurs, the remapping
> process is
> manual or we have to wait till an write operation takes place to that
> sector.
>
> If a READ gives an ECC error, the amount of time it takes to
> read is usually
> higher as compared to READ operations accross sectors that
> are good. Even
> for a sector or a region of sectors that are degrading over
> time, the READ
> time is a good indication that the sector is deteriorating. A
> write to that
> sector will fix the problem.
>
> Based on the above, I modified the ide driver to implement this simple
> change. I created a sysctl entry called
> ide_disk_delay_threshold which is
> initially set to 250 ms. In ide-dma.c, I measure the amount
> of time it takes
> to complete a READ request:
>
> drive->service_time = jiffies - drive->service_start;
> if (rq->cmd == READ && (ide_disk_delay_threshold > 0) &&
> ( (drive->service_time*10) > ide_disk_delay_threshold) ) {
> printk("%s: re-write, ", drive->name);
> printk("READ took %d ms \n", drive->service_time*10);
> /*
> * Set the command to write
> */
> rq->cmd = WRITE;
> return ide_stopped;
> }
>
> I have tested the above and I have found that everytime I get
> accross an ECC
> error (0x40), the driver immediately writes to that location
> remapping that
> sector. This way, I get away with the bad sectors. The
> threshold 250 ms can
> be changed depending on the application or requirement. But,
> it seems to be
> a good indicator for early prediction of bad sectors ...
>
> Thanks
> -Manish
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Rusty Lynch: "Re: [BUG] sysfs on 2.5.48 unable to remove files while in use"
Previous message: Stephen Rothwell: "Re: [PATCH] Beginnings of conpat 32 code cleanups"
Maybe in reply to: Manish Lachwani: "Early determinition of bad sectors and correcting them ..."
Next in thread: Andre Hedrick: "RE: Early determinition of bad sectors and correcting them ..."
Reply: Andre Hedrick: "RE: Early determinition of bad sectors and correcting them ..."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sat Nov 23 2002 - 22:00:42 EST