Re: metadata operation reordering regards to crash

From: Andreas Dilger
Date: Sat Sep 15 2018 - 14:05:17 EST


On Sep 15, 2018, at 12:58 AM, çæå <milestonejxd@xxxxxxxxx> wrote:
>
> On Sat, Sep 15, 2018 at 6:23 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>
>> On Fri, Sep 14, 2018 at 05:06:44PM +0800, çæå wrote:
>>> Hi, all,
>>>
>>> A probably bit of complex question:
>>> Does nowadays practical filesystems, eg., extX, btfs, preserve metadata
>>> operation order through a crash/power failure?
>>
>> Yes.
>>
>> Behaviour is filesystem dependent, but we have tests in fstests that
>> specifically exercise order preservation across filesystem failures.
>>
>>> What I know is modern filesystems ensure metadata consistency
>>> after crash/power failure. Journal filesystems like extX do that by
>>> write-ahead logging of metadata operations into transactions. Other
>>> filesystems do that in various ways as btfs do that by COW.
>>>
>>> What I'm not so far clear is whether these filesystems preserve
>>> metadata operation order after a crash.
>>>
>>> For example,
>>> op 1. rename(A, B)
>>> op 2. rename(C, D)
>>>
>>> As mentioned above, metadata consistency is ensured after a crash.
>>> Thus, B is either the original B(or not exists) or has been replaced by A.
>>> The same to D.
>>>
>>> Is it possible that, after a crash, D has been replaced by C but B is still
>>> the original file(or not exists)?
>>
>> Not for XFS, ext4, btrfs or f2fs. Other filesystems might be
>> different.
>
> Thanks, Dave,
>
> I found this archive:
> https://www.mail-archive.com/linux-btrfs@xxxxxxxxxxxxxxx/msg31937.html
>
> It seems btrfs people thinks reordering could happen.
>
> It is a relatively old reply. Has the implement changed? Or is there
> some new standard that requires reordering not happen?

There is nothing in POSIX that requires any particular ordering. However,
the sequence "A, B, C, sync C" on ext3/ext4 has "always" resulted in A, B
also being sync'd to disk (including parent directory creation, etc).

For a while, ext4 with delayed allocation resulted in write A, rename A->B
causing "B" to potentially not have any data (commit v2.6.29-5120-g8750c6d).
While the applications are depending on non-POSIX behaviour, the operation
ordering behaviour has been around long that applications have grown to
depend on it, and consider the filesystem to have a bug when it doesn't
behave that way.

If you want to write a robust application, you should fsync() the files you
care about (possibly with AIO so you get a notification on completion rather
than waiting).

Cheers, Andreas





Attachment: signature.asc
Description: Message signed with OpenPGP