Re: Blockbusting news, this is important (Re: Why are bad disk sectors numbered strangely, and what happens to them?)

From: Norman Diamond
Date: Fri Oct 17 2003 - 06:14:59 EST


Replying first to Hans Reiser; below to Russell King and Pavel Machek.

> Instead of recording the bad blocks, just write to them.

If writes are guaranteed to force reallocations then this is potentially
part of a solution.

I still remain suspicious because the first failed read was milliseconds or
minutes after the preceding write. I think the odds are very high that the
sector was already bad at the time of the write but reallocation did not
occur. It is possible but I think very unlikely that the sector was
reallocated to a different physical sector which went bad milliseconds after
being written after reallocation, and equally unlikely that the sector
wasn't reallocated because it really hadn't been bad but went bad
milliseconds later. In other words, I think it is overwhelmingly likely
that the write failed but was not detected as such and did not result in
reallocation.

Now, maybe there is a technique to force it anyway. When a partition is
newly created and is being formatted with the intention of writing data a
few minutes later, do writes that "should" have a better chance of being
detected. The way to start this is to simply write every block, but this is
obviously insufficient because my block did get written shortly after the
partition was formatted and that write didn't cause the block to be
reallocated. So in addition to simply writing every block, also read every
block. For each read that fails, proceed to do another write which "should"
force reallocation.

Mr. Reiser, when I created a partition of your design, that technique was
not offered. Why? And will it soon start being offered?

Also, I remain highly suspicious that for each read that fails, when the
formatting program proceeds to do another write which "should" force
reallocation, the drive might not do it. The formatter will have to proceed
to yet another read. And if the block is still bad, then figure that the
drive is refusing to reallocate the bad block. And then yes, the formatter
will still have to make a list of known bad blocks and do something to
prevent ordinary file system operations from ever seeing those blocks.


Russell King replied to me:

> > When a drive tries to read a block, if it detects errors, it retries up
> > to 255 times. If a retry succeeds then the block gets reallocated. IF
> > 255 RETRIES FAIL THEN THE BLOCK DOES NOT GET REALLOCATED.
>
> This is perfectly reasonable. If the drive can't recover your old data
> to reallocate it to a new block, then leaving the error present until you
> write new data to that bad block is the correct thing to do.

Only if the subsequent write is guaranteed to result in reallocation. I
remain suspicious that the drive does not guarantee such. Suppose the
contents of the next write happen to get stored close enough to correct that
the block doesn't get reallocated and the data survive for another 100
milliseconds before getting corrupt again?

> Think about what would happen if it did get reallocated. What data would
> the drive return when requested to read the bad block?

Why does it matter? The drive already reported a read failure. Maybe Linux
programs aren't all smart enough to inform the user when a read operation
results in an I/O error, but drivers could be smarter. I think there's
probably a bit of room in an inode to add a flag saying that the file has
been detected to be partially unreadable. Sorry for the digression.
Anyway, it is 100% true that the data in that block are gone. The block
should be reallocated and the new physical block can either be zeroed or
randomized or anything, and that's what subsequent reads will get until the
block gets written again.

> If the error persists during a write to the bad block, then yes, I'd
> expect it to be reallocated at that point - but only because the drive has
> the correct data for that block available.

We agree in our moral expectations and our technical analysis that correct
data will be available at that time. But if your word "expect" means you
have confidence that the drive will perform correctly, I do not share your
confidence (I think it is possible but highly unlikely that the drive did
its job correctly during the previous write).

> Your description of the way Toshibas drive works seems perfectly sane.
> In fact, I'd consider a drive to be broken if it behaved in any other way
> - capable of almost silent data loss.

I think it would not be silent. If the system log had one repetition
instead of fifty repetitions, it would not be silent. I don't know which
application was silent and am irritated. (dd wasn't silent when I tried
copying the entire partition to /dev/null).


Pavel Machek wrote:

> Well, this behaviour makes sense.
>
> "If we can't read this, leave it in place, perhaps we can read it in
> future (when temperature drops below 80Celsius or something)". "If we
> can't write this, bad, but we can reallocate without loosing
> anything".

Well, consider the two extremes we've seen in this thread now. Mr. Bradford
felt that the entire drive should be discarded on account of having one bad
block. Mr. Machek feels that we should preserve the possibility of reusing
the bad block because in the future it might appear not to be bad. I take
the middle road. The drive should not be discarded until errors become more
frequent or numerous, but known bad blocks should be acted on so that those
physical blocks should not have a chance of being used again.

Suppose the block became readable when the temperature drops (this one
didn't but I believe some can). What happens when the block becomes
readable, and then a program writes new data to that block, and the block
temporarily appears good? At that time it will get written and will not get
reallocated, right? And a few milliseconds later, what? I do not want that
block reused. I want it reallocated.

And when a drive doesn't guarantee reallocation, I want the driver to remove
the sector from the file system.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/