Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry

From: Jan Kara
Date: Wed Dec 04 2024 - 07:42:41 EST


On Tue 22-10-24 19:10:43, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>
> In the iomap_write_iter(), the iomap buffered write frame does not hold
> any locks between querying the inode extent mapping info and performing
> page cache writes. As a result, the extent mapping can be changed due to
> concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
> write-back process faces a similar problem: concurrent changes can
> invalidate the extent mapping before the I/O is submitted.
>
> Therefore, both of these processes must recheck the mapping info after
> acquiring the folio lock. To address this, similar to XFS, we propose
> introducing an extent sequence number to serve as a validity cookie for
> the extent. We will increment this number whenever the extent status
> tree changes, thereby preparing for the buffered write iomap conversion.
> Besides, it also changes the trace code style to make checkpatch.pl
> happy.
>
> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>

Overall using some sequence counter makes sense.

> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index c786691dabd3..bea4f87db502 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
> return es->es_lblk + es->es_len - 1;
> }
>
> +static inline void ext4_es_inc_seq(struct inode *inode)
> +{
> + struct ext4_inode_info *ei = EXT4_I(inode);
> +
> + WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
> +}

This looks potentially dangerous because we can loose i_es_seq updates this
way. Like

CPU1 CPU2
x = READ_ONCE(ei->i_es_seq)
x = READ_ONCE(ei->i_es_seq)
WRITE_ONCE(ei->i_es_seq, x + 1)
...
potentially many times
WRITE_ONCE(ei->i_es_seq, x + 1)
-> the counter goes back leading to possibly false equality checks

I think you'll need to use atomic_t and appropriate functions here.

> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
> BUG_ON(end < lblk);
> WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>
> + ext4_es_inc_seq(inode);

I'm somewhat wondering: Are extent status tree modifications the right
place to advance the sequence counter? The counter needs to advance
whenever the mapping information changes. This means that we'd be
needlessly advancing the counter (and thus possibly forcing retries) when
we are just adding new information from ordinary extent tree into cache.
Also someone can be doing extent tree manipulations without touching extent
status tree (if the information was already pruned from there). So I think
needs some very good documentation what are the expectations from the
sequence counter and explanations why they are satisfied so that we don't
break this in the future.

Honza


--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR