Re: [PATCH] btrfs: don't let shrinker touch extent_maps that are being logged
From: Jeff Layton
Date: Tue Jun 30 2026 - 08:32:44 EST
On Mon, 2026-06-29 at 16:47 +0100, Filipe Manana wrote:
> On Mon, Jun 29, 2026 at 4:06 PM Filipe Manana <fdmanana@xxxxxxxxxx> wrote:
> >
> > On Mon, Jun 29, 2026 at 3:56 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > >
> > > On Mon, 2026-06-29 at 15:19 +0100, Filipe Manana wrote:
> > > > On Mon, Jun 29, 2026 at 2:41 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > >
> > > > > The extent map shrinker can free an extent map that is still owned by an
> > > > > in-flight fsync and still linked on the inode's modified_extents list,
> > > > > corrupting that list and eventually causing an RCU stall.
> > > > >
> > > > > btrfs_scan_inode() currently skips EXTENT_FLAG_PINNED maps, then calls
> > > > > btrfs_remove_extent_mapping() followed by btrfs_free_extent_map():
> > > > >
> > > > > if (em->flags & EXTENT_FLAG_PINNED)
> > > > > goto next;
> > > > > ...
> > > > > btrfs_remove_extent_mapping(inode, em);
> > > > > btrfs_free_extent_map(em);
> > > > >
> > > > > But btrfs_remove_extent_mapping() deliberately does NOT unlink a map that
> > > > > has EXTENT_FLAG_LOGGING set:
> > > > >
> > > > > if (!(em->flags & EXTENT_FLAG_LOGGING))
> > > > > list_del_init(&em->list);
> > > > > remove_em(inode, em);
> > > > >
> > > > > This sets up a UAF situation where a later fsync() can trip over the
> > > > > now-freed extent_map still on the modified_extents() list.
> > > >
> > > > I don't see how that can happen, because fsync always takes the
> > > > inode's i_mmap_lock (in write/exclusive mode), and btrfs_scan_inode()
> > > > takes the same lock too (in shared/read mode).
> > > > EXTENT_FLAG_LOGGING is only set and cleared during fsync, so how can
> > > > the shrinker race with fsync?
> > > >
> > >
> > > ...and yet, the vmcore shows that the shrinker hit an extent_map that
> > > had LOGGING set while holding the i_mmap_lock.
> >
> > Yes, but we need to understand why...
> >
> > >
> > > I had hoped to add an assertion here that LOGGING couldn't be set on
> > > the em's after taking the i_mmap_lock, but that's not viable:
> > >
> > > The problem is that the i_mmap_lock coverage isn't complete. I had
> > > Claude assist me with some analysis so take it with a hefty grain of
> > > salt but the problem is that the em tree can contain em's from other
> > > inodes where the i_mmap_lock is not held:
> >
> > And that is the problem: an em tree, which belongs to an inode, should
> > only have extent maps for that inode...
> > An extent map can only belong to an inode.
> >
> > So that's what needs investigation. The bug is elsewhere.
> >
> > Thanks.
> >
> > >
> > > -----------------------8<---------------------
> > >
> > > ...that inode is in progress, therefore no em in its tree should have LOGGING set. Your invariant is the right one.
> > >
> > > But it's violated, because the i_mmap_lock interlock is incomplete. Two facts:
> > >
> > > 1. tree-log.c never takes i_mmap_lock — grep returns 0 occurrences. Only btrfs_sync_file takes it (file.c:1608/1610), and only
> > > on the one inode being fsync'd.
> > > 2. A single fsync logs many other inodes, and each can set LOGGING on its own ems via btrfs_log_changed_extents().
>
> A single fsync can log other inodes yes, and without taking their
> i_mmap_lock, but the inodes are logged in LOG_INODE_EXISTS mode, so
> extents are not logged.
> The only exception is for symlink inodes, but those are logged in full
> mode, only have one extent (inlined), and always have
> BTRFS_INODE_NEEDS_FULL_SYNC set, so we never end up in
> btrfs_log_changed_extents() for symlinks.
>
Thanks for the explanation. It sounds like we should wire some warnings
into the shrinker around these invariants. Like, if LOGGING is set and
the inode's not a symlink, do a WARN_ON_ONCE()? I'll plan to look into
that as well.
FWIW, internally at Meta, we're turning the WARN_ON() calls in
btrfs_free_extent_map() into BUG() calls across a swath of machines in
the hopes we can get a vmcore earlier. We'll keep you posted.
Thanks for the help so far!
> > > btrfs_log_inode() is called recursively on conflicting inodes, parent directories, and referenced inodes:
> > > tree-log.c:5982 btrfs_log_inode(di_inode, log_mode) # dir dentries
> > > tree-log.c:6339 btrfs_log_inode(inode, LOG_INODE_ALL) # conflicting inodes
> > > tree-log.c:6900 btrfs_log_inode(di_inode, log_mode) # new dir dentries
> > > tree-log.c:7389 btrfs_log_inode(dir_inode, LOG_INODE_ALL) # log_all_parents
> > > tree-log.c:7488 btrfs_log_inode(inode, ...) # referenced inodes
> > >
> > > 2. For all of these, btrfs_sync_file holds i_mmap_lock on the original fsync'd inode — not on these side inodes. So a side
> > > inode X gets LOGGING set on its ems while X's own i_mmap_lock is free.
>
> Nop, see above.
>
> > >
> > > So the shrinker can down_read_trylock(X->i_mmap_lock) successfully (nobody holds it) and walk X's tree while X's ems are
> > > LOGGING-flagged → it removes+frees one → corruption. The interlock only protects the directly-fsync'd inode, not the inodes
> > > logged as its dependencies.
>
> Nop. Unless you can show an actual code path where while fsyncing one
> inode we log another inode in LOG_INODE_ALL mode
> Again, the exception is symlinks, but for those, the
> BTRFS_INODE_NEEDS_FULL_SYNC flag is always set, so we never end up in
> btrfs_log_changed_extents().
>
> > >
> > >
> > > > Also, there should be no () after modified_extents (it's not a function name).
> > > >
> > > >
> > >
> > > Sorry, early morning typo. Fixed in my local repo.
> > >
> > > >
> > > > >
> > > > > Fix it by having the shrinker skip maps that are being logged, the same
> > > > > way it skips pinned maps. Such a map is owned by the in-flight fsync and
> > > > > will become reclaimable again once logging clears the flag.
> > > > >
> > > > > Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
> > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > > > ---
> > > > > We've started hitting a number of these problems in our fleet. It
> > > > > seems to mostly happen on ARM64 architecture, but there have been some
> > > > > WARN_ONs that popped on x86_64 too.
> > > > > ---
> > > > > fs/btrfs/extent_map.c | 8 +++++++-
> > > > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > > > > index fce9c5cc0122..128f7800e101 100644
> > > > > --- a/fs/btrfs/extent_map.c
> > > > > +++ b/fs/btrfs/extent_map.c
> > > > > @@ -1166,7 +1166,13 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
> > > > > em = rb_entry(node, struct extent_map, rb_node);
> > > > > ctx->scanned++;
> > > > >
> > > > > - if (em->flags & EXTENT_FLAG_PINNED)
> > > > > + /*
> > > > > + * Skip extent maps that are pinned or are being logged. The
> > > > > + * i_mmap_lock should prevent this from seeing LOGGING on extent_maps
> > > > > + * directly associated with inode, but em may be associated with
> > > > > + * other, dependent inodes and their locks are not held.
> > > > > + */
> > > > > + if (em->flags & (EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING))
> > > > > goto next;
> > > > >
> > > > > /*
> > > > >
> > > > > ---
> > > > > base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
> > > > > change-id: 20260629-btrfs-skip-logging-3e31701d9647
> > > > >
> > > > > Best regards,
> > > > > --
> > > > > Jeff Layton <jlayton@xxxxxxxxxx>
> > > > >
> > > > >
> > >
> > > --
> > > Jeff Layton <jlayton@xxxxxxxxxx>
--
Jeff Layton <jlayton@xxxxxxxxxx>