Re: Status of ReiserFS + Journalling

From: Neil Brown (neilb@cse.unsw.edu.au)
Date: Wed Oct 04 2000 - 21:54:59 EST


On Wednesday October 4, news-innominate.list.linux.kernel@innominate.de wrote:
> Andi Kleen wrote:
> > On Wed, Oct 04, 2000 at 01:42:46AM -0600, Andreas Dilger wrote:
> > > You should ask the reiserfs mailing list for outstanding problems. As
> > > far as LVM is concerned, I don't think there is a problem, but watch out
> > > for software RAID 5 and journalling filesystems (reiser or ext3, at least
> > > under 2.2) - it can have problems if there is a disk crash.
> >
> > It is not inherent to journaling file systems, linux software raid 5
> > can always corrupt your data when you have a system crash with a disk
> > crash (no way to write stripe sets atomically and half writen strip sets
> > usually give random data for any crashed block in it when xored against parity)
>
> 'Atomic' - a word that makes my ears perk up. Tux2 is all about atomic
> updating. Could you please give a simple statement of the problem for a
> person who doesn't know much more about RAID than that it stands for
> Redundant Array of Inexpensive Disks (Drives?)

[that's I == "independant" these days. You are allowed to use expensive devices]

Stepping in infront of Andi (you aren't a bus are you?)

For RAID5 a 'stripe' is a set of blocks, one from each underlying
device, which are all at the same offset within their device.
For each stripe, one of the blocks is a "parity" block - though it is
a different block for each stripe (parity is rotated).

Content of the parity block is computed from the xor of the content of
all the other (data) blocks.

To update a data block, you must also update the parity block to keep
it consistant. For example, you can read old partity block, read old
data block, compute
   newparity = oldparity xor olddata xor newdata
and then write out newparity and newdata.

It is not possible (on current hardware:-) to write both newparity and
newdata to the different devices atomically. If the system fails
(e.g. power failure) between writing one and writing the other, then
you have an inconsistant stripe.

A raid5 array has a 'superblock' which can contain a number of
things. One of which is a flag to indicate that all stripes are
consistant.
When a raid5 array is made writable, this flag is cleared. When it is
reverted to readonly (e.g. at shutdown) this flag is set after all
other writes have completed.

At initialisation time, the raid5 system checks this flag. If it is
clear, then all stripes must be checked for consistancy. We go
through computing parity and comparing with the current parity and
writing it out if there is a difference.

But, suppose at initialisation time:
 - The flag is clear, so some stripes might be inconsistant.
 - One drive has failed.

In this state we cannot check consistancy. Is that a problem? We
might have lost data, but you do that when systems crash....

Suppose, for stripe X the parity device is device 1 and we were
updating the block on device 0 at the time of system failure.
What had happened was that the new parity block was written out, but
the new data block wasn't.
Suppose further than when the system come back, device 2 has failed.
We now cannot recover the data that was on stripe X, device 2. If we
tried, we would xor all the blocks from working devices together and I
hope that you can see that this would be the wrong answer. This poor,
innocent, block, which hasn't been modified for years, has just been
corrupted. Not good for PR.

>
> Given a clear statement of the problem, I think I can show how to update
> the stripes atomically. At the very least, I'll know what interface
> Tux2 needs from RAID in order to guarantee an atomic update.

 From my understanding, there are two ways to approach this problem.

 1/ store updates to a separate device, either NV ram or a separate
  disc drive. Providing you write address/oldvalue/newvalue to the
  separate device before updating the main array, you could be safe
  against single device failures combined with system failures.

 2/ Arrange your filesystem so that you write new data to an otherwise
   unused stripe a whole stripe at a time, and store some sort of
   chechksum in the stripe so that corruption can be detected. This
   implies a log structured filesystem (though possibly you could come
   close enough with a journalling or similar filesystem, I'm not
   sure).

   Basically, as restart time you need to be able to find where you
   last wrote data, and check if it is valid or not.

   If Tux2 can do that, then you probably only need to get stripe-size
   and some sort of 'help-me-out-here' call from the raid software.

>
> > In this case "safe" just means that you don't need a fsck to be sure that
> > the metadata is consistent -- data is never guaranteed to be consistent
> > unless you have applications that use fsync/O_SYNC properly (=basically do
> > their own journaling)
>
> I truly believe that's a temporary situation.

I actually think that it is worse than that. Without a
'start-transaction / end-transaction' interface to the filesystem, you
cannot be sure that data is consistent. fsync is only
'end-transaction' which isn't enough, and O_SYNC will only work if
every transaction fits in a disk block.
(That is, unless the application does it's own journalling. Then
fsync can be enough).

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Oct 07 2000 - 21:00:15 EST