Re: [PATCH] btrfs: don't let shrinker touch extent_maps that are being logged
From: Filipe Manana
Date: Mon Jun 29 2026 - 12:12:01 EST
On Mon, Jun 29, 2026 at 4:06 PM Filipe Manana <fdmanana@xxxxxxxxxx> wrote:
>
> On Mon, Jun 29, 2026 at 3:56 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> >
> > On Mon, 2026-06-29 at 15:19 +0100, Filipe Manana wrote:
> > > On Mon, Jun 29, 2026 at 2:41 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > >
> > > > The extent map shrinker can free an extent map that is still owned by an
> > > > in-flight fsync and still linked on the inode's modified_extents list,
> > > > corrupting that list and eventually causing an RCU stall.
> > > >
> > > > btrfs_scan_inode() currently skips EXTENT_FLAG_PINNED maps, then calls
> > > > btrfs_remove_extent_mapping() followed by btrfs_free_extent_map():
> > > >
> > > > if (em->flags & EXTENT_FLAG_PINNED)
> > > > goto next;
> > > > ...
> > > > btrfs_remove_extent_mapping(inode, em);
> > > > btrfs_free_extent_map(em);
> > > >
> > > > But btrfs_remove_extent_mapping() deliberately does NOT unlink a map that
> > > > has EXTENT_FLAG_LOGGING set:
> > > >
> > > > if (!(em->flags & EXTENT_FLAG_LOGGING))
> > > > list_del_init(&em->list);
> > > > remove_em(inode, em);
> > > >
> > > > This sets up a UAF situation where a later fsync() can trip over the
> > > > now-freed extent_map still on the modified_extents() list.
> > >
> > > I don't see how that can happen, because fsync always takes the
> > > inode's i_mmap_lock (in write/exclusive mode), and btrfs_scan_inode()
> > > takes the same lock too (in shared/read mode).
> > > EXTENT_FLAG_LOGGING is only set and cleared during fsync, so how can
> > > the shrinker race with fsync?
> > >
> >
> > ...and yet, the vmcore shows that the shrinker hit an extent_map that
> > had LOGGING set while holding the i_mmap_lock.
>
> Yes, but we need to understand why...
>
> >
> > I had hoped to add an assertion here that LOGGING couldn't be set on
> > the em's after taking the i_mmap_lock, but that's not viable:
> >
> > The problem is that the i_mmap_lock coverage isn't complete. I had
> > Claude assist me with some analysis so take it with a hefty grain of
> > salt but the problem is that the em tree can contain em's from other
> > inodes where the i_mmap_lock is not held:
>
> And that is the problem: an em tree, which belongs to an inode, should
> only have extent maps for that inode...
> An extent map can only belong to an inode.
>
> So that's what needs investigation. The bug is elsewhere.
>
> Thanks.
>
> >
> > -----------------------8<---------------------
> >
> > ...that inode is in progress, therefore no em in its tree should have LOGGING set. Your invariant is the right one.
> >
> > But it's violated, because the i_mmap_lock interlock is incomplete. Two facts:
> >
> > 1. tree-log.c never takes i_mmap_lock — grep returns 0 occurrences. Only btrfs_sync_file takes it (file.c:1608/1610), and only
> > on the one inode being fsync'd.
> > 2. A single fsync logs many other inodes, and each can set LOGGING on its own ems via btrfs_log_changed_extents().
A single fsync can log other inodes yes, and without taking their
i_mmap_lock, but the inodes are logged in LOG_INODE_EXISTS mode, so
extents are not logged.
The only exception is for symlink inodes, but those are logged in full
mode, only have one extent (inlined), and always have
BTRFS_INODE_NEEDS_FULL_SYNC set, so we never end up in
btrfs_log_changed_extents() for symlinks.
> > btrfs_log_inode() is called recursively on conflicting inodes, parent directories, and referenced inodes:
> > tree-log.c:5982 btrfs_log_inode(di_inode, log_mode) # dir dentries
> > tree-log.c:6339 btrfs_log_inode(inode, LOG_INODE_ALL) # conflicting inodes
> > tree-log.c:6900 btrfs_log_inode(di_inode, log_mode) # new dir dentries
> > tree-log.c:7389 btrfs_log_inode(dir_inode, LOG_INODE_ALL) # log_all_parents
> > tree-log.c:7488 btrfs_log_inode(inode, ...) # referenced inodes
> >
> > 2. For all of these, btrfs_sync_file holds i_mmap_lock on the original fsync'd inode — not on these side inodes. So a side
> > inode X gets LOGGING set on its ems while X's own i_mmap_lock is free.
Nop, see above.
> >
> > So the shrinker can down_read_trylock(X->i_mmap_lock) successfully (nobody holds it) and walk X's tree while X's ems are
> > LOGGING-flagged → it removes+frees one → corruption. The interlock only protects the directly-fsync'd inode, not the inodes
> > logged as its dependencies.
Nop. Unless you can show an actual code path where while fsyncing one
inode we log another inode in LOG_INODE_ALL mode
Again, the exception is symlinks, but for those, the
BTRFS_INODE_NEEDS_FULL_SYNC flag is always set, so we never end up in
btrfs_log_changed_extents().
> >
> >
> > > Also, there should be no () after modified_extents (it's not a function name).
> > >
> > >
> >
> > Sorry, early morning typo. Fixed in my local repo.
> >
> > >
> > > >
> > > > Fix it by having the shrinker skip maps that are being logged, the same
> > > > way it skips pinned maps. Such a map is owned by the in-flight fsync and
> > > > will become reclaimable again once logging clears the flag.
> > > >
> > > > Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
> > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > > ---
> > > > We've started hitting a number of these problems in our fleet. It
> > > > seems to mostly happen on ARM64 architecture, but there have been some
> > > > WARN_ONs that popped on x86_64 too.
> > > > ---
> > > > fs/btrfs/extent_map.c | 8 +++++++-
> > > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > > > index fce9c5cc0122..128f7800e101 100644
> > > > --- a/fs/btrfs/extent_map.c
> > > > +++ b/fs/btrfs/extent_map.c
> > > > @@ -1166,7 +1166,13 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
> > > > em = rb_entry(node, struct extent_map, rb_node);
> > > > ctx->scanned++;
> > > >
> > > > - if (em->flags & EXTENT_FLAG_PINNED)
> > > > + /*
> > > > + * Skip extent maps that are pinned or are being logged. The
> > > > + * i_mmap_lock should prevent this from seeing LOGGING on extent_maps
> > > > + * directly associated with inode, but em may be associated with
> > > > + * other, dependent inodes and their locks are not held.
> > > > + */
> > > > + if (em->flags & (EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING))
> > > > goto next;
> > > >
> > > > /*
> > > >
> > > > ---
> > > > base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
> > > > change-id: 20260629-btrfs-skip-logging-3e31701d9647
> > > >
> > > > Best regards,
> > > > --
> > > > Jeff Layton <jlayton@xxxxxxxxxx>
> > > >
> > > >
> >
> > --
> > Jeff Layton <jlayton@xxxxxxxxxx>