Re: reiser4 plugins

From: Steve Lord
Date: Mon Jun 27 2005 - 15:21:51 EST

Next message: Tim Strobell (Contractor): "no output on serial console between probe and init"
Previous message: Rajesh Shah: "Re: ACPI-based PCI resources: PCMCIA bugfix, but resources missing in trees"
In reply to: Hans Reiser: "Re: reiser4 plugins"
Next in thread: Theodore Ts'o: "Re: reiser4 plugins"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hans Reiser wrote:

Steve, there is a remark about XFS below which you are going to be more
expert on.

Theodore Ts'o wrote:

XFS has similar issues where it assumes that hardware has powerfail
interrupts, and that the OS can use said powerfail interrupt to stop
DMA's in its tracks on an power failure, so that you don't have
garbage written to key filesystem data structures when the memory
starts suffering from the dropping voltage on the power bus faster
than the DMA engine or the disk drives. So XFS is a great filesystem
--- but you'd better be running it on a UPS, or on a system which has
power fail interrupts and an OS that knows what to do. Ext3, because
it does physical block journalling, does not suffer from this problem.
(Yes, Resierfs uses logical journalling as well, so it suffers from
the same problem.)

I presume Ted is referring to problems guaranteeing the integrity of
the journal at recovery time. I am coming into this without all the
available context, so I may be barking up the wrong tree.... In
particular, I am not sure how journaling whole blocks protects
you from this.

The xfs journal protects itself against partial writes, to a certain
degree. The header of a journal write (inside a 512 byte sector)
contains an array of words which are swapped out from the start of
each following 512 byte sector of the journal write. The following
sectors then each have the log sequence number (LSN) of the write inserted
in place of that data.

During recovery, we find the most recent LSN via a binary chop
search, this gives us an associated tail LSN. A scan backwards
from the head LSN is then done - this covers the total possible
amount of in flight data (maximum log buffers x maximum log buffer
size). If any of the sectors has the wrong LSN in the first word,
then it an all following data is discarded from replay. Of course,
we will also not replay any journal entry for which we do not find
the transaction commit record.

Now, this protects against some failure cases, it assumes that
sector writes are atomic, they either happen or they do not
happen. If sector writes are not atomic and one end can be
good with the other is bad, then a partial sector is possibly
going to get replayed. There have been discussions about doing
this with the head and tail of each sector, or using a checksum
instead.

XFS on linux has had power cycle crash testing, but there is no
way you can cover all possible hardware configurations, and I
seem to recall some hardware never recovered from this testing,
by that I mean the PC did not survive the continual power cycling
and went up in smoke.

Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Tim Strobell (Contractor): "no output on serial console between probe and init"
Previous message: Rajesh Shah: "Re: ACPI-based PCI resources: PCMCIA bugfix, but resources missing in trees"
In reply to: Hans Reiser: "Re: reiser4 plugins"
Next in thread: Theodore Ts'o: "Re: reiser4 plugins"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]