Re: Linux kernel - Libata bad block error handling to user mode program

From: Mike Hayward
Date: Fri Mar 05 2010 - 11:43:53 EST

> The data written through linux cannot be read back by any other means.
> Does that prove any data corruption? I wrote a signature on to a bad
> drive. (With all the before mentioned permutation and combinations).
> The program returned 0 (zero) errors and said the data was
> successfully written to all the sectors of the drive and it had taken
> 5 hrs (The sample size of the drive is 20 GB). And I tried to verify
> it using another program on linux. It produced read errors across a
> couple of million sectors after almost 13 hours of grinding the
> hdd.

It is normal, although low probability, for what we call a 'stable'
storage device to lose data for numerous reasons. It detects this by
returning io error if a checksum doesn't match. An I/O error is not
data corruption, it is what we would call data loss or unavailability.

> I can understand the slow remapping process during the write
> operations. But what if the drive has used up all the available
> sectors for mapping and is slowly dying. The SMART data displays
> thousands of seek, read, crc errors and still linux does not notify
> the program which has asked it to write some data. ????

SMART data is not really all that standardized, and it is quite normal
to see the drive correcting errors with rereads, reseeks, ecc, etc. so
determining drive health really is manufacturer and model specific.

If it remaps either from it's own retry or from the operating system
retrying, it should of course return a succesful write even if it
takes a minute or two. Once it is out of blocks to remap with it must
return io error or timeout.

All that being said, if a drive returns success after writing, and you
read different data than you "successfully wrote", as opposed to an
error, this is data corruption. My number 1 rule of storage is "thou
shalt not silently corrupt data". It should be incredibly unlikely
due to sufficiently strong checksum that silent corruption should
occur. If you are detecting it this frequently, clearly something is
not working as intended. This means the storage system is not
sufficiently "stable" to rely upon it's own checksums and return codes
for correctness.

This is why some apps may resort to replication or to adding
additional checksums or ecc at a higher layer, but this should
generally be unnecessary. I would use such techniques primarily to
prove corruption defects in kernels, drivers, or hardware, or if, as
Alan mentioned, I were storing an extremely large amount of data. For
performance reasons, my software (which does store huge amounts of
data) relies primarily upon replication (to work around both
unavailability and corruption) as opposed to parity techniques and
this is effectively what you are doing to prove data corruption here.

Hopefully you haven't found high probability data corruption :-) Can
you reproduce the problem with different manufacturers or models of
drives? If so, the problem is most likely not in the drive. I'd say
that's job number one and it's easy to try. Short of doing a white
box inspection of the kernel, you could narrow the problem down by
swapping out kernels (try another much older or newer linux kernel,
and try another os) and various pieces of hardware.

If everything points to the linux kernel, then you'll have to start
instrumenting the kernel to track down where, exactly, it returns
success after having logged ata errors. If the write didn't
eventually succeed after retries, but returned success to your app,
you'll have your kernel bug and be famous :-)

Or you could start there if you are confident it isn't the hardware or
your program. Thankfully you are using linux and have an open kernel
data path to work with.

If you prove the drive is lying, which manufacturer makes it? You
could call up the manufacturer with your reproducible problem. They
would probably like to know if their controller is corrupting.

- Mike
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at