Re: [PATCH] btrfs: don't let shrinker touch extent_maps that are being logged
From: Filipe Manana
Date: Tue Jun 30 2026 - 10:17:10 EST
On Tue, Jun 30, 2026 at 1:28 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> On Mon, 2026-06-29 at 16:47 +0100, Filipe Manana wrote:
> > On Mon, Jun 29, 2026 at 4:06 PM Filipe Manana <fdmanana@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Jun 29, 2026 at 3:56 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > >
> > > > On Mon, 2026-06-29 at 15:19 +0100, Filipe Manana wrote:
> > > > > On Mon, Jun 29, 2026 at 2:41 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > The extent map shrinker can free an extent map that is still owned by an
> > > > > > in-flight fsync and still linked on the inode's modified_extents list,
> > > > > > corrupting that list and eventually causing an RCU stall.
> > > > > >
> > > > > > btrfs_scan_inode() currently skips EXTENT_FLAG_PINNED maps, then calls
> > > > > > btrfs_remove_extent_mapping() followed by btrfs_free_extent_map():
> > > > > >
> > > > > > if (em->flags & EXTENT_FLAG_PINNED)
> > > > > > goto next;
> > > > > > ...
> > > > > > btrfs_remove_extent_mapping(inode, em);
> > > > > > btrfs_free_extent_map(em);
> > > > > >
> > > > > > But btrfs_remove_extent_mapping() deliberately does NOT unlink a map that
> > > > > > has EXTENT_FLAG_LOGGING set:
> > > > > >
> > > > > > if (!(em->flags & EXTENT_FLAG_LOGGING))
> > > > > > list_del_init(&em->list);
> > > > > > remove_em(inode, em);
> > > > > >
> > > > > > This sets up a UAF situation where a later fsync() can trip over the
> > > > > > now-freed extent_map still on the modified_extents() list.
> > > > >
> > > > > I don't see how that can happen, because fsync always takes the
> > > > > inode's i_mmap_lock (in write/exclusive mode), and btrfs_scan_inode()
> > > > > takes the same lock too (in shared/read mode).
> > > > > EXTENT_FLAG_LOGGING is only set and cleared during fsync, so how can
> > > > > the shrinker race with fsync?
> > > > >
> > > >
> > > > ...and yet, the vmcore shows that the shrinker hit an extent_map that
> > > > had LOGGING set while holding the i_mmap_lock.
> > >
> > > Yes, but we need to understand why...
> > >
> > > >
> > > > I had hoped to add an assertion here that LOGGING couldn't be set on
> > > > the em's after taking the i_mmap_lock, but that's not viable:
> > > >
> > > > The problem is that the i_mmap_lock coverage isn't complete. I had
> > > > Claude assist me with some analysis so take it with a hefty grain of
> > > > salt but the problem is that the em tree can contain em's from other
> > > > inodes where the i_mmap_lock is not held:
> > >
> > > And that is the problem: an em tree, which belongs to an inode, should
> > > only have extent maps for that inode...
> > > An extent map can only belong to an inode.
> > >
> > > So that's what needs investigation. The bug is elsewhere.
> > >
> > > Thanks.
> > >
> > > >
> > > > -----------------------8<---------------------
> > > >
> > > > ...that inode is in progress, therefore no em in its tree should have LOGGING set. Your invariant is the right one.
> > > >
> > > > But it's violated, because the i_mmap_lock interlock is incomplete. Two facts:
> > > >
> > > > 1. tree-log.c never takes i_mmap_lock — grep returns 0 occurrences. Only btrfs_sync_file takes it (file.c:1608/1610), and only
> > > > on the one inode being fsync'd.
> > > > 2. A single fsync logs many other inodes, and each can set LOGGING on its own ems via btrfs_log_changed_extents().
> >
> > A single fsync can log other inodes yes, and without taking their
> > i_mmap_lock, but the inodes are logged in LOG_INODE_EXISTS mode, so
> > extents are not logged.
> > The only exception is for symlink inodes, but those are logged in full
> > mode, only have one extent (inlined), and always have
> > BTRFS_INODE_NEEDS_FULL_SYNC set, so we never end up in
> > btrfs_log_changed_extents() for symlinks.
> >
>
> Thanks for the explanation. It sounds like we should wire some warnings
> into the shrinker around these invariants. Like, if LOGGING is set and
> the inode's not a symlink, do a WARN_ON_ONCE()? I'll plan to look into
> that as well.
>
> FWIW, internally at Meta, we're turning the WARN_ON() calls in
> btrfs_free_extent_map() into BUG() calls across a swath of machines in
> the hopes we can get a vmcore earlier. We'll keep you posted.
To make it as simple as possible:
An fsync can only log extent maps for the inode being fsynced; other
inodes can be logged, but not their extent maps.
In order to log extent maps, we need to have the i_mmap_lock (in
write/exclusive mode) and the vfs lock on the inode too.
Symlinks have an inline extent but no extent map.
Directories never have extents.
So if the shrinker gets an extent map with the LOGGING flag, one of
two things is happening:
1) An fsync of one inode is logging extent maps of another. This bug
can cause corruption and data loss, because we must hold those two
locks to prevent races with writes (buffered, direct IO, mmap) and
other operations (truncate, hole punching, etc.), as previously
stated.
2) Something is corrupting an extent map, either inside or outside
(most likely) the fsync code.
Thanks.
>
> Thanks for the help so far!
>
> > > > btrfs_log_inode() is called recursively on conflicting inodes, parent directories, and referenced inodes:
> > > > tree-log.c:5982 btrfs_log_inode(di_inode, log_mode) # dir dentries
> > > > tree-log.c:6339 btrfs_log_inode(inode, LOG_INODE_ALL) # conflicting inodes
> > > > tree-log.c:6900 btrfs_log_inode(di_inode, log_mode) # new dir dentries
> > > > tree-log.c:7389 btrfs_log_inode(dir_inode, LOG_INODE_ALL) # log_all_parents
> > > > tree-log.c:7488 btrfs_log_inode(inode, ...) # referenced inodes
> > > >
> > > > 2. For all of these, btrfs_sync_file holds i_mmap_lock on the original fsync'd inode — not on these side inodes. So a side
> > > > inode X gets LOGGING set on its ems while X's own i_mmap_lock is free.
> >
> > Nop, see above.
> >
> > > >
> > > > So the shrinker can down_read_trylock(X->i_mmap_lock) successfully (nobody holds it) and walk X's tree while X's ems are
> > > > LOGGING-flagged → it removes+frees one → corruption. The interlock only protects the directly-fsync'd inode, not the inodes
> > > > logged as its dependencies.
> >
> > Nop. Unless you can show an actual code path where while fsyncing one
> > inode we log another inode in LOG_INODE_ALL mode
> > Again, the exception is symlinks, but for those, the
> > BTRFS_INODE_NEEDS_FULL_SYNC flag is always set, so we never end up in
> > btrfs_log_changed_extents().
> >
> > > >
> > > >
> > > > > Also, there should be no () after modified_extents (it's not a function name).
> > > > >
> > > > >
> > > >
> > > > Sorry, early morning typo. Fixed in my local repo.
> > > >
> > > > >
> > > > > >
> > > > > > Fix it by having the shrinker skip maps that are being logged, the same
> > > > > > way it skips pinned maps. Such a map is owned by the in-flight fsync and
> > > > > > will become reclaimable again once logging clears the flag.
> > > > > >
> > > > > > Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
> > > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > > > > ---
> > > > > > We've started hitting a number of these problems in our fleet. It
> > > > > > seems to mostly happen on ARM64 architecture, but there have been some
> > > > > > WARN_ONs that popped on x86_64 too.
> > > > > > ---
> > > > > > fs/btrfs/extent_map.c | 8 +++++++-
> > > > > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > > > > > index fce9c5cc0122..128f7800e101 100644
> > > > > > --- a/fs/btrfs/extent_map.c
> > > > > > +++ b/fs/btrfs/extent_map.c
> > > > > > @@ -1166,7 +1166,13 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
> > > > > > em = rb_entry(node, struct extent_map, rb_node);
> > > > > > ctx->scanned++;
> > > > > >
> > > > > > - if (em->flags & EXTENT_FLAG_PINNED)
> > > > > > + /*
> > > > > > + * Skip extent maps that are pinned or are being logged. The
> > > > > > + * i_mmap_lock should prevent this from seeing LOGGING on extent_maps
> > > > > > + * directly associated with inode, but em may be associated with
> > > > > > + * other, dependent inodes and their locks are not held.
> > > > > > + */
> > > > > > + if (em->flags & (EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING))
> > > > > > goto next;
> > > > > >
> > > > > > /*
> > > > > >
> > > > > > ---
> > > > > > base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
> > > > > > change-id: 20260629-btrfs-skip-logging-3e31701d9647
> > > > > >
> > > > > > Best regards,
> > > > > > --
> > > > > > Jeff Layton <jlayton@xxxxxxxxxx>
> > > > > >
> > > > > >
> > > >
> > > > --
> > > > Jeff Layton <jlayton@xxxxxxxxxx>
>
> --
> Jeff Layton <jlayton@xxxxxxxxxx>