Re: [PATCH] fs/super: skip non-memcg-aware nr_cached_objects in memcg slab shrink
From: Usama Arif
Date: Wed Jun 24 2026 - 11:27:51 EST
On 23/06/2026 19:08, Shakeel Butt wrote:
> On Tue, Jun 09, 2026 at 05:30:47AM -0700, Usama Arif wrote:
>> The super_block shrinker is registered with SHRINKER_MEMCG_AWARE because its
>> dentry and inode LRUs are memcg-aware (via list_lru). But the optional
>> ->nr_cached_objects() hooks that the shrinker also drives are not memcg-aware:
>> btrfs extent maps and xfs inode reclaim operate on filesystem-global
>> state, and shmem's unused-huge shrinker walks a per-superblock shrinklist.
>> None of them filter by sc->memcg.
>
> I see the underlying objects whose count is returned by ->nr_cached_objects()
> hook is memcg charged for shmem and xfs but not for btrfs. Do you envision
> there might be a rare scenario where we have a lot of memory charged to a memcg
> consumed by objects which ->nr_cached_objects() tracks and that memory becomes
> unreclaimable due to this patch?
Hello!
Thanks for the review.
For XFS, xfs_inode is SLAB_ACCOUNT, so a memcg can have memory charged in
XFS inodes sitting in XFS' internal reclaim state. But the current callback
does not target that memcg: xfs_fs_free_cached_objects() calls
xfs_reclaim_inodes_nr(), which walks the mount and reclaims
XFS_ICI_RECLAIM_TAG inodes without checking sc->memcg. So non-root memcg
reclaim was only getting an opportunistic mount-wide reclaim pass here; it
could reclaim the target memcg's inodes by chance, but it could just as well
reclaim inodes charged to other memcgs. This patch removes that cross-memcg
side effect, not a correct memcg-targeted reclaim path.
Those XFS inodes are still reclaimed by XFS' own reclaim worker, which is
queued when reclaimable inodes are tagged and requeues while reclaimable
inodes remain. With the default xfssyncd_centisecs value, that is about every
5 seconds. The root/global superblock shrinker path also continues to call
the XFS callbacks. This will keep the memory reclaimable.
For shmem, shrinklist_len counts inodes whose tail large folio could be
split — splitting itself doesn't free anything; it just lets normal page
LRU reclaim the truncated tail. The folios are on the memcg-aware LRU and
will be aged/reclaimed there independently, so skipping the split from
memcg context just delays the split?
or btrfs extent maps, as you note, these objects are not memcg-charged.
>
>>
>> The mismatch shows up under memcg-heavy slab reclaim. shrink_slab_memcg()
>> calls do_shrink_slab() once per (memcg, NUMA node) pair for every memcg
>> whose bit is set in the per-superblock shrinker bitmap, which on a busy
>> host means hundreds of calls per reclaim pass. Each scan queues the same
>> global shrinker work item that's already kicked from the root path.
>>
>> Because btrfs/xfs global count is typically non-zero on any in-use filesystem,
>> the returned total stays positive even if a memcg's own dentry/inode LRUs
>> are empty. shrink_slab_memcg() therefore never clears the SB shrinker bit
>> in the memcg bitmap, so subsequent reclaim passes from the same memcg
>> re-enter super_cache_count() and pay for the global counter walk again.
>
> What is the main concern? Is it the amount of CPU wasted or are we over
> reclaiming or reclaiming from unrelated memcgs?
>
The primary concern is wasted CPU. On the busy hosts where I saw this,
hundreds of memcgs all repeating that walk per reclaim pass is what made
it visible in profiles.
Another concern is misattribution. Reclaim from memcg X ends up
reclaiming another memcg Y, which could have an affect on Y and
won't provide proper isolation.
>>
>> Restrict ->nr_cached_objects() to the global shrink path (sc->memcg NULL
>> or root). The memcg-aware dentry/inode LRUs keep being counted and
>> scanned per memcg as before; only the global fs-specific hooks are skipped.
>> The root/global shrink path still drives those hooks; only their
>> invocation from non-root memcg slab reclaim is removed.
>>
>> Signed-off-by: Usama Arif <usama.arif@xxxxxxxxx>
>
> I am fine with the stopgap but it would be nice to have proper memcg awareness
> in xfs and shmem callbacks. For btrfs, I am not sure if it makes sense to memcg
> charge btrfs_extent_map objects but at least to decision to skip memcg reclaim
> will be inside the fs callbacks i.e. nr_cached_objects.
>
Agreed that this is the right long term approach. If the preference is to move this down to
the fs own callbacks instead of fs/super.c, I can do that in the revision as well.