Neil Brown wrote:
>
> For RAID5 a 'stripe' is a set of blocks, one from each underlying
> device, which are all at the same offset within their device.
> For each stripe, one of the blocks is a "parity" block - though it is
> a different block for each stripe (parity is rotated).
>
> Content of the parity block is computed from the xor of the content of
> all the other (data) blocks.
>
> To update a data block, you must also update the parity block to keep
> it consistant. For example, you can read old partity block, read old
> data block, compute
> newparity = oldparity xor olddata xor newdata
> and then write out newparity and newdata.
>
> It is not possible (on current hardware:-) to write both newparity and
> newdata to the different devices atomically. If the system fails
> (e.g. power failure) between writing one and writing the other, then
> you have an inconsistant stripe.
OK, and not only newdata is corrupted, but n-2 of its unrelated
neighbors on the same stripe. I see the problem. I'm also...
beginning... to see... a solution. Maybe.
[stuff I can't answer intelligently yet snipped]
> > Given a clear statement of the problem, I think I can show how to update
> > the stripes atomically. At the very least, I'll know what interface
> > Tux2 needs from RAID in order to guarantee an atomic update.
>
> From my understanding, there are two ways to approach this problem.
>
> 1/ store updates to a separate device, either NV ram or a separate
> disc drive. Providing you write address/oldvalue/newvalue to the
> separate device before updating the main array, you could be safe
> against single device failures combined with system failures.
A journalling filesystem. Fine. I'm sure Stephen has put plenty of
thought into this one. Advantage: it's obvious how it helps the RAID
problem. Disadvantage: you have the normal finicky journalling boundary
conditions to worry about. Miscellaneous fact: you will be writing
everything twice (roughly).
> 2/ Arrange your filesystem so that you write new data to an otherwise
> unused stripe a whole stripe at a time, and store some sort of
> chechksum in the stripe so that corruption can be detected. This
> implies a log structured filesystem (though possibly you could come
> close enough with a journalling or similar filesystem, I'm not
> sure).
I think it's true that Tux2's approach can do many of the things a LFS
can do. And you can't tell by looking at a block which inode it belongs
to - I think we need to know this. The obvious fix is to extend the
group metafile with a section that reverse maps each block using a
two-word inode:index pair. (.2% extra space with 4K blocks.)
A nice fact about Tux2 is that the changes in a filesystem from phase to
phase can be completely arbitrary. (LFS shares this property - it falls
out from doing copy-on-write towards the root of a tree.) So you can
operate a write-twice algorithm like this: first clean out a number of
partially-populated stripes by branching the in-use blocks off to empty
stripes. The reverse lookup is used to know which inode blocks to
branch. You don't have to worry about writing full stripes because Tux2
will automatically revert to a consistent state on interruption. When
you have enough clear space you cause a phase transition, and now you
have a consistent filesystem with lots of clear stripes into which you
can branch update blocks.
Even numbered phases: Clear out freespace by branching orphan blocks
Odd numbered phases: Branch updates into the new freespace
Notice that what we I'm doing up there very closely resembles an
incremental defrag, and can be tuned to really be a defrag. This might
be useful.
What is accomplished is that we never kill innocent blocks in the nasty
way you described earlier.
I'm not claiming this is at all efficient - I smell a better algorithm
somewhere in there. On the other hand, it's about the same efficiency
as journalling, doesn't have so many tricky boundary conditions, and you
get the defrag, something that is a lot harder to do with an
update-in-place scheme. Correct me if I'm wrong, but don't people
running RAID care more about safety than speed?
This is just the first cut. I think I have some sort of understanding
of the problem now, however imperfect. I'll let it sit and percolate
for a while, and now I *must* stop doing this, get some sleep, then try
to prepare some slides for next week in Atlanta :-)
-- Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sat Oct 07 2000 - 21:00:16 EST