Re: [PATCH] btrfs: don't let shrinker touch extent_maps that are being logged
From: Filipe Manana
Date: Mon Jun 29 2026 - 11:19:12 EST
On Mon, Jun 29, 2026 at 3:56 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> On Mon, 2026-06-29 at 15:19 +0100, Filipe Manana wrote:
> > On Mon, Jun 29, 2026 at 2:41 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > >
> > > The extent map shrinker can free an extent map that is still owned by an
> > > in-flight fsync and still linked on the inode's modified_extents list,
> > > corrupting that list and eventually causing an RCU stall.
> > >
> > > btrfs_scan_inode() currently skips EXTENT_FLAG_PINNED maps, then calls
> > > btrfs_remove_extent_mapping() followed by btrfs_free_extent_map():
> > >
> > > if (em->flags & EXTENT_FLAG_PINNED)
> > > goto next;
> > > ...
> > > btrfs_remove_extent_mapping(inode, em);
> > > btrfs_free_extent_map(em);
> > >
> > > But btrfs_remove_extent_mapping() deliberately does NOT unlink a map that
> > > has EXTENT_FLAG_LOGGING set:
> > >
> > > if (!(em->flags & EXTENT_FLAG_LOGGING))
> > > list_del_init(&em->list);
> > > remove_em(inode, em);
> > >
> > > This sets up a UAF situation where a later fsync() can trip over the
> > > now-freed extent_map still on the modified_extents() list.
> >
> > I don't see how that can happen, because fsync always takes the
> > inode's i_mmap_lock (in write/exclusive mode), and btrfs_scan_inode()
> > takes the same lock too (in shared/read mode).
> > EXTENT_FLAG_LOGGING is only set and cleared during fsync, so how can
> > the shrinker race with fsync?
> >
>
> ...and yet, the vmcore shows that the shrinker hit an extent_map that
> had LOGGING set while holding the i_mmap_lock.
Yes, but we need to understand why...
>
> I had hoped to add an assertion here that LOGGING couldn't be set on
> the em's after taking the i_mmap_lock, but that's not viable:
>
> The problem is that the i_mmap_lock coverage isn't complete. I had
> Claude assist me with some analysis so take it with a hefty grain of
> salt but the problem is that the em tree can contain em's from other
> inodes where the i_mmap_lock is not held:
And that is the problem: an em tree, which belongs to an inode, should
only have extent maps for that inode...
An extent map can only belong to an inode.
So that's what needs investigation. The bug is elsewhere.
Thanks.
>
> -----------------------8<---------------------
>
> ...that inode is in progress, therefore no em in its tree should have LOGGING set. Your invariant is the right one.
>
> But it's violated, because the i_mmap_lock interlock is incomplete. Two facts:
>
> 1. tree-log.c never takes i_mmap_lock — grep returns 0 occurrences. Only btrfs_sync_file takes it (file.c:1608/1610), and only
> on the one inode being fsync'd.
> 2. A single fsync logs many other inodes, and each can set LOGGING on its own ems via btrfs_log_changed_extents().
> btrfs_log_inode() is called recursively on conflicting inodes, parent directories, and referenced inodes:
> tree-log.c:5982 btrfs_log_inode(di_inode, log_mode) # dir dentries
> tree-log.c:6339 btrfs_log_inode(inode, LOG_INODE_ALL) # conflicting inodes
> tree-log.c:6900 btrfs_log_inode(di_inode, log_mode) # new dir dentries
> tree-log.c:7389 btrfs_log_inode(dir_inode, LOG_INODE_ALL) # log_all_parents
> tree-log.c:7488 btrfs_log_inode(inode, ...) # referenced inodes
>
> 2. For all of these, btrfs_sync_file holds i_mmap_lock on the original fsync'd inode — not on these side inodes. So a side
> inode X gets LOGGING set on its ems while X's own i_mmap_lock is free.
>
> So the shrinker can down_read_trylock(X->i_mmap_lock) successfully (nobody holds it) and walk X's tree while X's ems are
> LOGGING-flagged → it removes+frees one → corruption. The interlock only protects the directly-fsync'd inode, not the inodes
> logged as its dependencies.
>
>
> > Also, there should be no () after modified_extents (it's not a function name).
> >
> >
>
> Sorry, early morning typo. Fixed in my local repo.
>
> >
> > >
> > > Fix it by having the shrinker skip maps that are being logged, the same
> > > way it skips pinned maps. Such a map is owned by the in-flight fsync and
> > > will become reclaimable again once logging clears the flag.
> > >
> > > Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
> > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > ---
> > > We've started hitting a number of these problems in our fleet. It
> > > seems to mostly happen on ARM64 architecture, but there have been some
> > > WARN_ONs that popped on x86_64 too.
> > > ---
> > > fs/btrfs/extent_map.c | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > > index fce9c5cc0122..128f7800e101 100644
> > > --- a/fs/btrfs/extent_map.c
> > > +++ b/fs/btrfs/extent_map.c
> > > @@ -1166,7 +1166,13 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
> > > em = rb_entry(node, struct extent_map, rb_node);
> > > ctx->scanned++;
> > >
> > > - if (em->flags & EXTENT_FLAG_PINNED)
> > > + /*
> > > + * Skip extent maps that are pinned or are being logged. The
> > > + * i_mmap_lock should prevent this from seeing LOGGING on extent_maps
> > > + * directly associated with inode, but em may be associated with
> > > + * other, dependent inodes and their locks are not held.
> > > + */
> > > + if (em->flags & (EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING))
> > > goto next;
> > >
> > > /*
> > >
> > > ---
> > > base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
> > > change-id: 20260629-btrfs-skip-logging-3e31701d9647
> > >
> > > Best regards,
> > > --
> > > Jeff Layton <jlayton@xxxxxxxxxx>
> > >
> > >
>
> --
> Jeff Layton <jlayton@xxxxxxxxxx>