Re: [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode

From: Jan Kara
Date: Wed Dec 04 2024 - 06:26:34 EST


On Wed 20-11-24 10:56:53, Zhang Yi wrote:
> On 2024/11/19 7:15, Darrick J. Wong wrote:
> > On Tue, Oct 22, 2024 at 07:10:34PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>
> >> There is no need to write back all data before punching a hole in
> >> data=ordered|writeback mode since it will be dropped soon after removing
> >> space, so just remove the filemap_write_and_wait_range() in these modes.
> >> However, in data=journal mode, we need to write dirty pages out before
> >> discarding page cache in case of crash before committing the freeing
> >> data transaction, which could expose old, stale data.
> >
> > Can't the same thing happen with non-journaled data writes?
> >
> > Say you write 1GB of "A"s to a file and fsync. Then you write "B"s to
> > the same 1GB of file and immediately start punching it. If the system
> > reboots before the mapping updates all get written to disk, won't you
> > risk seeing some of those "A" because we no longer flush the "B"s?
> >
> > Also, since the program didn't explicitly fsync the Bs, why bother
> > flushing the dirty data at all? Are data=journal writes supposed to be
> > synchronous flushing writes nowadays?
>
> Thanks you for your replay.
>
> This case is not exactly the problem that can occur in data=journal
> mode, the problem is even if we fsync "B"s before punching the hole, we
> may still encounter old data ("A"s or even order) if the system reboots
> before the hole-punching process is completed.
>
> The details of this problem is the ext4_punch_hole()->
> truncate_pagecache_range()-> ..->journal_unmap_buffer() will drop the
> checkpoint transaction, which may contain B's journaled data. Consequently,
> the journal tail could move advance beyond this point. If we do not flush
> the data before dropping the cache and a crash occurs before the punching
> transaction is committed, B's transaction will never recover, resulting
> in the loss of B's data. Therefore, this cannot happen in non-journaled
> data writes.

Yes, you're correct. The logic in journal_unmap_buffer() (used when freeing
journaled data blocks) assumes that if there's no running / committing
transaction, then orphan replay is going to fixup possible partial
operations and thus it simply discards the block that's being freed. That
works for truncate but not for hole punch or range zeroing.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR