Re: [PATCH] mm: migrate: buffer_migrate_folio_norefs() fallback migrate not uptodate pages

From: Zhihao Cheng
Date: Thu Aug 25 2022 - 07:32:26 EST


在 2022/8/25 18:57, Jan Kara 写道:
On Thu 25-08-22 16:01:46, Zhihao Cheng wrote:
From: Zhang Yi <yi.zhang@xxxxxxxxxx>

Recently we notice that ext4 filesystem occasionally fail to read
metadata from disk and report error message, but the disk and block
layer looks fine. After analyse, we lockon commit 88dbcbb3a484
("blkdev: avoid migration stalls for blkdev pages"). It provide a
migration method for the bdev, we could move page that has buffers
without extra users now, but it will lock the buffers on the page, which
breaks a lot of current filesystem's fragile metadata read operations,
like ll_rw_block() for common usage and ext4_read_bh_lock() for ext4,
these helpers just trylock the buffer and skip submit IO if it lock
failed, many callers just wait_on_buffer() and conclude IO error if the
buffer is not uptodate after buffer unlocked.

This issue could be easily reproduced by add some delay just after
buffer_migrate_lock_buffers() in __buffer_migrate_folio() and do
fsstress on ext4 filesystem.

EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #73193:
comm fsstress: reading directory lblock 0
EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #75334:
comm fsstress: reading directory lblock 0

Something like ll_rw_block() should be used carefully and seems could
only be safely used for the readahead case. So the best way is to fix
the read operations in filesystem in the long run, but now let us avoid
this issue first. This patch avoid this issue by fallback to migrate
pages that are not uotodate like fallback_migrate_folio(), those pages
that has buffers may probably do read operation soon.

Fixes: 88dbcbb3a484 ("blkdev: avoid migration stalls for blkdev pages")
Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
Signed-off-by: Zhihao Cheng <chengzhihao1@xxxxxxxxxx>

Thanks for the analysis and the fix! As you noted above this is actually a
bug in the filesystems that they assume that locked buffer means it is
under IO. Usually that is the case but there are other places that lock
the buffer without doing IO. Page migration is one of them, jbd2 machinery
another one, there may be others.

So I think this really ought to be fixed in filesystems instead of papering
over the bug in the migration code. I agree this is more work but we will
reduce the technical debt, not increase it :). Honestly, ll_rw_block()
should just die. It is actively dangerous to use. Instead we should have
one call for readahead of bhs and the rest should be converted to
submit_bh() or similar calls. There are like 25 callers remaining so it
won't be even that hard.

And then we have the same buggy code as in ll_rw_block() copied to
ext4_bread_batch() (ext4_read_bh_lock() in particular) so that needs to be
fixed as well...

Honza

You are right, Jan. I totally agree with you. It seems that I shouldn't have been lazy.