Re: [PATCH] btrfs: don't let shrinker touch extent_maps that are being logged
From: Jeff Layton
Date: Mon Jun 29 2026 - 11:02:07 EST
On Mon, 2026-06-29 at 15:19 +0100, Filipe Manana wrote:
> On Mon, Jun 29, 2026 at 2:41 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> >
> > The extent map shrinker can free an extent map that is still owned by an
> > in-flight fsync and still linked on the inode's modified_extents list,
> > corrupting that list and eventually causing an RCU stall.
> >
> > btrfs_scan_inode() currently skips EXTENT_FLAG_PINNED maps, then calls
> > btrfs_remove_extent_mapping() followed by btrfs_free_extent_map():
> >
> > if (em->flags & EXTENT_FLAG_PINNED)
> > goto next;
> > ...
> > btrfs_remove_extent_mapping(inode, em);
> > btrfs_free_extent_map(em);
> >
> > But btrfs_remove_extent_mapping() deliberately does NOT unlink a map that
> > has EXTENT_FLAG_LOGGING set:
> >
> > if (!(em->flags & EXTENT_FLAG_LOGGING))
> > list_del_init(&em->list);
> > remove_em(inode, em);
> >
> > This sets up a UAF situation where a later fsync() can trip over the
> > now-freed extent_map still on the modified_extents() list.
>
> I don't see how that can happen, because fsync always takes the
> inode's i_mmap_lock (in write/exclusive mode), and btrfs_scan_inode()
> takes the same lock too (in shared/read mode).
> EXTENT_FLAG_LOGGING is only set and cleared during fsync, so how can
> the shrinker race with fsync?
>
...and yet, the vmcore shows that the shrinker hit an extent_map that
had LOGGING set while holding the i_mmap_lock.
I had hoped to add an assertion here that LOGGING couldn't be set on
the em's after taking the i_mmap_lock, but that's not viable:
The problem is that the i_mmap_lock coverage isn't complete. I had
Claude assist me with some analysis so take it with a hefty grain of
salt but the problem is that the em tree can contain em's from other
inodes where the i_mmap_lock is not held:
-----------------------8<---------------------
...that inode is in progress, therefore no em in its tree should have LOGGING set. Your invariant is the right one.
But it's violated, because the i_mmap_lock interlock is incomplete. Two facts:
1. tree-log.c never takes i_mmap_lock — grep returns 0 occurrences. Only btrfs_sync_file takes it (file.c:1608/1610), and only
on the one inode being fsync'd.
2. A single fsync logs many other inodes, and each can set LOGGING on its own ems via btrfs_log_changed_extents().
btrfs_log_inode() is called recursively on conflicting inodes, parent directories, and referenced inodes:
tree-log.c:5982 btrfs_log_inode(di_inode, log_mode) # dir dentries
tree-log.c:6339 btrfs_log_inode(inode, LOG_INODE_ALL) # conflicting inodes
tree-log.c:6900 btrfs_log_inode(di_inode, log_mode) # new dir dentries
tree-log.c:7389 btrfs_log_inode(dir_inode, LOG_INODE_ALL) # log_all_parents
tree-log.c:7488 btrfs_log_inode(inode, ...) # referenced inodes
2. For all of these, btrfs_sync_file holds i_mmap_lock on the original fsync'd inode — not on these side inodes. So a side
inode X gets LOGGING set on its ems while X's own i_mmap_lock is free.
So the shrinker can down_read_trylock(X->i_mmap_lock) successfully (nobody holds it) and walk X's tree while X's ems are
LOGGING-flagged → it removes+frees one → corruption. The interlock only protects the directly-fsync'd inode, not the inodes
logged as its dependencies.
> Also, there should be no () after modified_extents (it's not a function name).
>
>
Sorry, early morning typo. Fixed in my local repo.
>
> >
> > Fix it by having the shrinker skip maps that are being logged, the same
> > way it skips pinned maps. Such a map is owned by the in-flight fsync and
> > will become reclaimable again once logging clears the flag.
> >
> > Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps")
> > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > ---
> > We've started hitting a number of these problems in our fleet. It
> > seems to mostly happen on ARM64 architecture, but there have been some
> > WARN_ONs that popped on x86_64 too.
> > ---
> > fs/btrfs/extent_map.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > index fce9c5cc0122..128f7800e101 100644
> > --- a/fs/btrfs/extent_map.c
> > +++ b/fs/btrfs/extent_map.c
> > @@ -1166,7 +1166,13 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_c
> > em = rb_entry(node, struct extent_map, rb_node);
> > ctx->scanned++;
> >
> > - if (em->flags & EXTENT_FLAG_PINNED)
> > + /*
> > + * Skip extent maps that are pinned or are being logged. The
> > + * i_mmap_lock should prevent this from seeing LOGGING on extent_maps
> > + * directly associated with inode, but em may be associated with
> > + * other, dependent inodes and their locks are not held.
> > + */
> > + if (em->flags & (EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING))
> > goto next;
> >
> > /*
> >
> > ---
> > base-commit: dc59e4fea9d83f03bad6bddf3fa2e52491777482
> > change-id: 20260629-btrfs-skip-logging-3e31701d9647
> >
> > Best regards,
> > --
> > Jeff Layton <jlayton@xxxxxxxxxx>
> >
> >
--
Jeff Layton <jlayton@xxxxxxxxxx>