Re: [RFC PATCH 0/2] Remove shrinker's nr_deferred
From: Yang Shi
Date: Wed Sep 30 2020 - 13:23:23 EST
On Wed, Sep 30, 2020 at 12:31 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Sat, Sep 26, 2020 at 01:31:36PM -0700, Yang Shi wrote:
> > Hi Dave,
> >
> > I was exploring to make the "nr_deferred" per memcg. I looked into and
> > had some prototypes for two approaches so far:
> > 1. Have per memcg data structure for each memcg aware shrinker, just
> > like what shrinker_map does.
> > 2. Have "nr_deferred" on list_lru_one for memcg aware lists.
> >
> > Both seem feasible, however the latter one looks much cleaner, I just
> > need to add two new APIs for shrinker which gets and sets
> > "nr_deferred" respectively. And, just memcg aware shrinkers need
> > define those two callouts. We just need to care about memcg aware
> > shrinkers, and the most memcg aware shrinkers (inode/dentry, nfs and
> > workingset) use list_lru, so I'd prefer the latter one.
>
> The list_lru is completely separate from the shrinker context. The
> structure that tracks objects in a subsystem is not visible or aware
> of how the high level shrinker scanning algorithms work. Not to
> mention that subsystem shrinkers can be memcg aware without using
> list_lru structures to index objects owned by a given memcg. Hence I
> really don't like the idea of tying the shrinker control data deeply
> into subsystem cache indexing....
I see your points. Yes, makes sense to me. The list_lru is a common
data structure and could be used by other subsystems, not only memcg
aware shrinkers.
Putting shrinker control data in list_lru seems break the layer. So,
option #1 might be more appropriate. The change looks like:
struct mem_cgroup_per_node {
...
struct memcg_shrinker_map __rcu *shrinker_map;
+ struct memcg_shrinker_deferred __rcu *shrinker_deferred;
...
}
>
>
> > But there are two memcg aware shrinkers are not that straightforward
> > to deal with:
> > 1. The deferred split THP. It doesn't use list_lru, but actually I
> > don't worry about this one, since it is not cache just some partial
> > unmapped THPs. I could try to convert it to use list_lru later on or
> > just kill deferred split by making vmscan split partial unmapped THPs.
> > So TBH I don't think it is a blocker.
>
> What a fantastic abuse of the reclaim infrastructure. :/
>
> First it was just defered work. Then it became NUMA_AWARE. THen it
> became MEMCG_AWARE and....
>
> Oh, man what a nasty hack that SHRINKER_NONSLAB flag is so that it
> runs through shrink_slab_memcg() even when memcgs are configured in
> but kmem tracking disabled. We have heaps of shrinkers that reclaim
> from things that aren't slab caches, but this one is just nuts.
>
> > 2. The fs_objects. This one looks weird. It shares the same shrinker
> > with inode/dentry. The only user is XFS currently. But it looks it is
> > not really memcg aware at all, right?
>
> It most definitely is.
>
> The VFS dentry and inode cache reclaim are memcg aware. The
> fs_objects callout is for filesystem level object garbage collection
> that can be done as a result of the dentry and inode caches being
> reclaimed.
>
> i.e. once the VFS has reclaimed the inode attached to the memcg, it
> is no longer attached and accounted to the memcg anymore. It is
> owned by the filesystem at this point, and it is entirely up to the
> filesytem to when it can then be freed. Most filesystems do it in
> the inode cache reclaim via the ->destroy method. XFS, OTOH, tags
> freeable inodes in it's internal radix trees rather than freeing
> them immediately because it still may have to clean the inode before
> it can be freed. Hence we defer freeing of inodes until the
> ->fs_objects pass....
Aha, thanks for elaborating. Now I see what it is doing for...
>
> > They are managed by radix tree
> > which is not per memcg by looking into xfs code, so the "per list_lru
> > nr_deferred" can't work for it. I thought of a couple of ways to
> > tackle it off the top of my head:
> > A. Just ignore it. If the amount of fs_objects are negligible
> > comparing to inode/dentry, then I think it can be just ignored and
> > kept it as is.
>
> Ah, no, they are not negliable. Under memory pressure, the number of
> objects is typically 1/3rd dentries, 1/3rd VFS inodes, 1/3rd fs
> objects to be reclaimed. The dentries and VFS inodes are owned by
> VFS level caches and associated with memcgs, the fs_objects are only
> visible to the filesystem.
>
> > B. Move it out of inode/dentry shrinker. Add a dedicated shrinker
> > for it, for example, sb->s_fs_obj_shrink.
>
> No, they are there because the reclaim has to be kept in exact
> proportion to the dentry and inode reclaim quantities. That's the
> reason I put that code there in the first place: a separate inode
> filesystem cache shrinker just didn't work well at all.
>
> > C. Make it really memcg aware and use list_lru.
>
> Two things. Firstly, objects are owned by the filesystem at this
> point, not memcgs. Memcgs were detatched at the VFS inode reclaim
> layer.
>
> Secondly, list-lru does not scale well enough for the use needed by
> XFS. We use radix trees so we can do lockless batch lookups and
> IO-efficient inode-order reclaim passes. We also have concurrent
> reclaim capabilities because of the lockless tag lookup walks.
> Using a list_lru for this substantially reduces reclaim performance
> and greatly increases CPU usage of reclaim because of contention on
> the internal list lru locks. Been there, measured that....
>
> > I don't have any experience on XFS code, #C seems the most optimal,
> > but should be the most time consuming, I'm not sure if it is worth it
> > or not. So, #B sounds more preferred IMHO.
>
> I think you're going completely in the wrong direction. The problem
> that needs solving is integrating shrinker scanning control state
> with memcgs more tightly, not force every memcg aware shrinker to
> use list_lru for their subsystem shrinker implementations....
Thanks a lot for all the elaboration and advice. Integrating shrinker
scanning control state with memcgs more tightly makes sense to me.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx