Re: [PATCH] ext4: only update i_reserved_data_blocks on successful block allocation

From: Jan Kara
Date: Tue Mar 28 2023 - 06:00:59 EST


On Mon 27-03-23 21:09:42, Baokun Li wrote:
> On 2023/3/27 20:47, Jan Kara wrote:
> > On Sat 25-03-23 14:34:43, Baokun Li wrote:
> > > In our fault injection test, we create an ext4 file, migrate it to
> > > non-extent based file, then punch a hole and finally trigger a WARN_ON
> > > in the ext4_da_update_reserve_space():
> > >
> > > EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
> > > ino 14, used 11 with only 10 reserved data blocks
> > >
> > > When writing back a non-extent based file, if we enable delalloc, the
> > > number of reserved blocks will be subtracted from the number of blocks
> > > mapped by ext4_ind_map_blocks(), and the extent status tree will be
> > > updated. We update the extent status tree by first removing the old
> > > extent_status and then inserting the new extent_status. If the block range
> > > we remove happens to be in an extent, then we need to allocate another
> > > extent_status with ext4_es_alloc_extent().
> > >
> > > use old to remove to add new
> > > |----------|------------|------------|
> > > old extent_status
> > >
> > > The problem is that the allocation of a new extent_status failed due to a
> > > fault injection, and __es_shrink() did not get free memory, resulting in
> > > a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
> > > we map to the same extent again, and the number of reserved blocks is again
> > > subtracted from the number of blocks in that extent. Since the blocks in
> > > the same extent are subtracted twice, we end up triggering WARN_ON at
> > > ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
> > Hum, but this second call to ext4_map_blocks() should find already allocated
> > blocks in the indirect block and thus should not be subtracting
> > ei->i_reserved_data_blocks for the second time. What am I missing?
> >
> > Honza
> >
> ext4_map_blocks
>   1. Lookup extent status tree firstly
>        goto found;
>   2. get the block without requesting a new file system block.
> found:
>   3. ceate and map the block
>
> When we call ext4_map_blocks() for the second time, we directly find the
> corresponding blocks in the extent status tree, and then go directly to step
> 3,
> because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED
> but contains EXT4_GET_BLOCKS_CREATE, thus subtracting
> ei->i_reserved_data_blocks
> for the second time.

Ah, I see. Thanks for explanation. But then the problem is deeper than just
a mismatch in number of reserved delalloc block. The problem really is that
if extent status tree update fails, we have inconsistency between what is
stored in the extent status tree and what is stored on disk. And that can
cause even data corruption issues in some cases.

So I think we rather need to work on handling of errors in extent status
tree operations. In the extent status tree, we have extents which we can
just drop without issues and extents we must not drop - this depends on the
extent's status - currently ext4_es_is_delayed() extents must stay, others
may be dropped but I'd wrap the decision in a helper function.

I'm currently inclined towards the following:

1) Removal must never fail. If we need to split extent, we use GFP_NOFAIL
if we cannot just drop the second part of the split extent in case of
allocation failure.

2) Similarly if inserting extent that cannot be dropped, we use GFP_NOFAIL.

3) We do not try to "undo" failed operations like we currently do - with
the above rules we never loose information that cannot be restored.

And this should also fix the problem you've hit because in case of
allocation failure we may just end up with removed extent from the extent
status tree and thus we refetch info from the disk and find out blocks are
already allocated.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR