On Mon 03-04-23 22:02:56, Baokun Li wrote:
On 2023/3/30 0:22, Jan Kara wrote:Well, do_writepages() will not retry if wbc->sync_mode == WB_SYNC_NONE. So
On Wed 29-03-23 15:23:19, Baokun Li wrote:Yes! There is an inconsistency here, but do_writepages finds that the
On 2023/3/28 18:00, Jan Kara wrote:Yes, this is what I was expecting. So now extent status tree is
On Mon 27-03-23 21:09:42, Baokun Li wrote:The scenario we encountered was this:
On 2023/3/27 20:47, Jan Kara wrote:Ah, I see. Thanks for explanation. But then the problem is deeper than just
On Sat 25-03-23 14:34:43, Baokun Li wrote:ext4_map_blocks
In our fault injection test, we create an ext4 file, migrate it toHum, but this second call to ext4_map_blocks() should find already allocated
non-extent based file, then punch a hole and finally trigger a WARN_ON
in the ext4_da_update_reserve_space():
EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
ino 14, used 11 with only 10 reserved data blocks
When writing back a non-extent based file, if we enable delalloc, the
number of reserved blocks will be subtracted from the number of blocks
mapped by ext4_ind_map_blocks(), and the extent status tree will be
updated. We update the extent status tree by first removing the old
extent_status and then inserting the new extent_status. If the block range
we remove happens to be in an extent, then we need to allocate another
extent_status with ext4_es_alloc_extent().
use old to remove to add new
|----------|------------|------------|
old extent_status
The problem is that the allocation of a new extent_status failed due to a
fault injection, and __es_shrink() did not get free memory, resulting in
a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
we map to the same extent again, and the number of reserved blocks is again
subtracted from the number of blocks in that extent. Since the blocks in
the same extent are subtracted twice, we end up triggering WARN_ON at
ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
blocks in the indirect block and thus should not be subtracting
ei->i_reserved_data_blocks for the second time. What am I missing?
Honza
1. Lookup extent status tree firstly
goto found;
2. get the block without requesting a new file system block.
found:
3. ceate and map the block
When we call ext4_map_blocks() for the second time, we directly find the
corresponding blocks in the extent status tree, and then go directly to step
3,
because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED
but contains EXT4_GET_BLOCKS_CREATE, thus subtracting
ei->i_reserved_data_blocks
for the second time.
a mismatch in number of reserved delalloc block. The problem really is that
if extent status tree update fails, we have inconsistency between what is
stored in the extent status tree and what is stored on disk. And that can
cause even data corruption issues in some cases.
```
write:
ext4_es_insert_delayed_block
[0/16) 576460752303423487 (U,D)
writepages:
alloc lblk 11 pblk 35328
[0/16) 576460752303423487 (U,D)
-- remove block 11 from extent
[0/11) 576460752303423487 (U,D,R) + (Newly allocated)[12/4)
549196775151 (U,D,R)
--Failure to allocate memory for a new extent will undo as:
[0/16) 576460752303423487 (U,D,R)
inconsistent with the on-disk allocation info because the block 11 is
already allocated on disk but recorded as unallocated in the extent status
tree.
writeback returns -ENOMEM and keeps retrying until it succeeds, at which
point the above inconsistency does not exist.
the inconsistency can stay for a long time.
No, you are right, this case is safe. Still I suspect inconsistencies withIf the similar problem happened say when we punch a hole into a middle of aext4_punch_hole
written extent and so block on disk got freed but extent status tree would
still record it as allocated, user would be able to access freed block thus
potentially exposing sensitive data.
// remove extents in extents status tree
ext4_es_remove_extent
// remove extents tree on disk
ext4_ext_remove_space
In this scenario, we always try to delete the extents in the in-memory
extents status tree first, and then delete the extents tree on disk. So
even if we fail in deleting extents in memory, there is no inconsistency,
am I missing something?
extent status tree can cause more problems and possibly stale data
exposure.
I see, thanks for explanation! Indeed it may be good to cleanup this codeYes, our code is indeed designed to only update the number of reserved-- if success insert block 11 to extent status treeSee above.
[0/11) 576460752303423487 (U,D,R) + (Newly allocated)[11/1) 35328 (W)
+ [12/4) 549196775151 (U,D,R)
U: UNWRITTEN
D: DELAYED
W: WRITTEN
R: REFERENCED
```
When we fail to allocate a new extent, we don't map buffer and we don't do
io_submit, so why is the extent tree in memory inconsistent with the one
stored on disk? Am I missing something?
I would appreciate it if you could explain under what cases and what kind of
data corruption issues can be caused.
I guess there is some misunderstanding here. Both withAnd this should also fix the problem you've hit because in case ofReloading extent tree from disk I don't quite understand here, how do we
allocation failure we may just end up with removed extent from the extent
status tree and thus we refetch info from the disk and find out blocks are
already allocated.
handle reserved blocks? could you explain it in more detail?
Logically, I think it is still necessary to update i_reserved_data_blocks
only after a successful allocation. This is also done in
ext4_ext_map_blocks().
ext4_ext_map_blocks() and ext4_ind_map_blocks() we end up updating
i_reserved_data_blocks only after the blocks are successfully allocated and
inserted in the respective data structure but *before* updating extent
status tree. If extent status tree operation fails, we currently get
inconsistency between extent status tree and on-disk info in both cases
AFAICS. Am I missing something?
blocks after the block allocation is complete. We have different
treatment for extent based file and non-extent based file in commit
5f634d064c70 ("ext4: Fix quota accounting error with fallocate").
For extent based file, we update the number of reserved blocks before the
"got_allocated_blocks" tag after the blocks are successfully allocated in
ext4_ext_map_blocks().
For the non-extent based file we update the number of reserved blocks
after ext4_ind_map_blocks() is executed, which leads to the problem that
when we call ext4_ind_map_blocks() to create a block, it does not always
create a block. For example, if the extents status tree we encountered
earlier does not match the extents tree on disk, this is of course a
problem in itself, but in terms of code logic, updating the number of
reserved blocks as ext4_ext_map_blocks() does can prevent us from trying
to create a block and not creating it, resulting in an incorrect number
of reserved blocks.
so that indirect block and extent based inodes are handled in the same way.
Honza