Re: [PATCH] btrfs: wait for in-flight readahead BIOs on open_ctree() error

From: Qu Wenruo

Date: Mon Mar 30 2026 - 17:48:50 EST

在 2026/3/31 04:30, Teng Liu 写道:

On 2026-03-30 08:51, Qu Wenruo wrote:

在 2026/3/30 08:36, Qu Wenruo 写道:

Even you wait for all bios, it can still cause problems.

As the bio counter is only for btrfs bio layer, we still have
btrfs_bio::end_io called after btrfs_bio_counter_dec().

And if the full fs_info has been freed, then at end_bbio_meta_read(), we
can still have problems as btrfs_validate_extent_buffer() will access eb
(bbio->private) and fs_info (eb->fs_info), which triggers use after
free.

So using that bio counter is not going to solve all problems, but only
reducing the race window thus masking the problem.

The following ideas come up to me, but neither seems as simple as your
current one:

1) Introduce a dedicated counter for metadata readahead/reads
   This seems to be the simplest one among all.
   But the only usage is only the error handling, thus may not be
   worthy.

2) Disable metadata readahead during open_ctree()
   Which will delay the mount, especially for large extent tree without
   bgt feature.

3) Use buffer_tree xarray to iterate through all ebs
   Since this is only for error handling of open_ctree(), we're fine to
   do the full xarray iteration, and wait for any eb that has
   EXTENT_BUFFER_READING flag.

   The problem is, we do not have a dedicated tag like
   PAGECACHE_TAG_(TOWRITE|DIRTY) to easily catch all dirty/writeback
   ebs.
   So the only option is to go through each eb and check their flags.

   I think this is the one with minimal impact, but may cause much
   longer runtime during this error handling path.

My personal preference is option 3).

Or the 4th one, which is only an idea and I haven't yet verified:

4) Handle error from invalidate_inode_pages2()
Currently we just call invalidate_inode_pages2() on btree inode and
expect it to return 0.

But if there is still an eb reading pending, it will make that
function to return -EBUSY, as try_release_extent_buffer() will
find a eb whose refs is not 0, and refuse the release that eb which
belongs to a folio.

That should be a good indicator of any pending metadata reads.

So if that invalidate_inode_pages2() returned -EBUSY, we should wait
retry until it returns 0.

Thanks! Yes, it makes sense, simply waiting on the bio counter doesnt
fix the problem here.

Among the options, I prefer option 3. Although it may be slower, but it
only happens in mount failure path so extra cost seems acceptable.

The problem is not limited to mount failure, but also affects close_ctree() too.

As it shares the same root problem, we have nothing to trace nor wait for any pending metadata read.

I am quite new to btrfs codebase so I dont know whether
`invalidate_inode_pages2()` would be a reliable solution so maybe I
should start with option 3?

Sure. Although iterating through xarray may not be that simple either, as you may still need to look into all kinds of extra locks/rcu lock etc, and if you apply that to the callsite of close_ctree(), it may be a much bigger problem, as we have a lot of more ebs compared to mount time.

You can even mix option 3 and 4, e.g. only after invalidate_inode_pages2() failed with -EBUSY then switch to xarray iteration.

This should greatly reduce the number of ebs that are still inside the xarray, thus makes the iteration much faster.