Re: [RFC PATCH] btrfs: defer freeing of subpage private state to free_folio

From: Qu Wenruo

Date: Fri Jan 30 2026 - 02:29:36 EST

在 2026/1/30 17:04, Boris Burkov 写道:

On Fri, Jan 30, 2026 at 01:46:59PM +1030, Qu Wenruo wrote:

在 2026/1/30 09:38, JP Kobryn 写道:
[...]

The patch also might have the advantage of being easy to backport to the
LTS trees. On that note, it's worth mentioning that we encountered a kernel
panic as a result of this sequence on a 6.16-based arm64 host (configured
with 64k pages so btrfs is in subpage mode). On our 6.16 kernel, the race
window is shown below between points A and B:

[mm] page cache reclaim path [fs] relocation in subpage mode
shrink_folio_list()
folio_trylock() /* lock acquired */
filemap_release_folio()
mapping->a_ops->release_folio()
btrfs_release_folio()
__btrfs_release_folio()
clear_folio_extent_mapped()
btrfs_detach_folio_state()
bfs = folio_detach_private(folio)
btrfs_free_folio_state(folio)
kfree(bfs) /* point A */

prealloc_file_extent_cluster()
filemap_lock_folio()

Mind to explain which function is calling filemap_lock_folio()?

I guess it's filemap_invalidate_inode() -> filemap_fdatawrite_range() ->
filemap_writeback() -> btrfs_writepages() -> extent_write_cache_pages().

I think you may have missed it in the diagram, and some of the function
names may have shifted a bit between kernels, but it is relocation.

On current btrfs/for-next, I think it would be:

relocate_file_extent_cluster()
relocate_one_folio()
filemap_lock_folio()

Thanks, indeed the filemap_lock_folio() inside prealloc_file_extent_cluster() only exists in v6.16 code base, which does partial folio invalidating manually.

That code is no longer there, and gets replaced with a much healthier solution.

folio_try_get() /* inc refcount */
folio_lock() /* wait for lock */

Another question here is, since the folio is already released in the mm
path, the folio should not have dirty flag set.

That means inside extent_write_cache_pages(), the folio_test_dirty() should
return false, and we should just unlock the folio without touching it
anymore.

Mind to explain why we still continue the writeback of a non-dirty folio?

I think this question is answered by the above as well: we aren't in
writeback, we are in relocation.

I see the problem now. And thankfully it's commit 4e346baee95f ("btrfs: reloc: unconditionally invalidate the page cache for each cluster") fixing the behavior.

And yes, the old code can indeed hit the problem.

But still, the commit 4e346baee95f ("btrfs: reloc: unconditionally invalidate the page cache for each cluster") itself shouldn't be that hard to backport.

Thanks,
Qu

Thanks,
Boris

__remove_mapping()
if (!folio_ref_freeze(folio, refcount)) /* point B */
goto cannot_free /* folio remains in cache */

folio_unlock(folio) /* lock released */

/* lock acquired */
btrfs_subpage_clear_updodate()

Mind to provide more context of where the btrfs_subpage_clear_uptodate()
call is from?

bfs = folio->priv /* use-after-free */

This exact race during relocation should not occur in the latest upstream
code, but it's an example of a backport opportunity for this patch.

And mind to explain what is missing in 6.16 kernel that causes the above
use-after-free?

Signed-off-by: JP Kobryn <inwardvessel@xxxxxxxxx>
---
fs/btrfs/extent_io.c | 6 ++++--
fs/btrfs/inode.c | 18 ++++++++++++++++++
2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3df399dc8856..d83d3f9ae3af 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -928,8 +928,10 @@ void clear_folio_extent_mapped(struct folio *folio)
return;
fs_info = folio_to_fs_info(folio);
- if (btrfs_is_subpage(fs_info, folio))
- return btrfs_detach_folio_state(fs_info, folio, BTRFS_SUBPAGE_DATA);
+ if (btrfs_is_subpage(fs_info, folio)) {
+ /* freeing of private subpage data is deferred to btrfs_free_folio */
+ return;
+ }

Another question is, why only two fses (nfs for dir inode, and orangefs) are
utilizing the free_folio() callback.

Iomap is doing the same as btrfs and only calls ifs_free() in
release_folio() and invalidate_folio().

Thus it looks like free_folio() callback is not the recommended way to free
folio->private pointer.

Cc fsdevel list on whether the free_folio() callback should have new
callers.

folio_detach_private(folio);

This means for regular folio cases, we still remove the private flag of such
folio.

It may be fine for most cases as we will not touch folio->private anyway,
but this still looks like a inconsistent behavior, especially the
free_folio() callback has handling for both cases.

Thanks,
Qu

}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b8abfe7439a3..7a832ee3b591 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7565,6 +7565,23 @@ static bool btrfs_release_folio(struct folio *folio, gfp_t gfp_flags)
return __btrfs_release_folio(folio, gfp_flags);
}
+/* frees subpage private data if present */
+static void btrfs_free_folio(struct folio *folio)
+{
+ struct btrfs_folio_state *bfs;
+
+ if (!folio_test_private(folio))
+ return;
+
+ bfs = folio_detach_private(folio);
+ if (bfs == (void *)EXTENT_FOLIO_PRIVATE) {
+ /* extent map flag is detached in btrfs_folio_release */
+ return;
+ }
+
+ btrfs_free_folio_state(bfs);
+}
+
#ifdef CONFIG_MIGRATION
static int btrfs_migrate_folio(struct address_space *mapping,
struct folio *dst, struct folio *src,
@@ -10651,6 +10668,7 @@ static const struct address_space_operations btrfs_aops = {
.invalidate_folio = btrfs_invalidate_folio,
.launder_folio = btrfs_launder_folio,
.release_folio = btrfs_release_folio,
+ .free_folio = btrfs_free_folio,
.migrate_folio = btrfs_migrate_folio,
.dirty_folio = filemap_dirty_folio,
.error_remove_folio = generic_error_remove_folio,