Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in case of global reclaim

From: Kirill Tkhai
Date: Mon Mar 19 2018 - 07:25:27 EST


On 19.03.2018 14:06, Kirill Tkhai wrote:
> On 17.03.2018 00:39, Dave Chinner wrote:
>> On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
>>> On 16.03.2018 02:03, Dave Chinner wrote:
>>>> On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
>>>>> On 15.03.2018 20:49, Michal Hocko wrote:
>>>>>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
>>>>>>> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
>>>>>>> So, it's called for memcg reclaim too, e.g. this list is shrinked
>>>>>>> disproportionality to another lists.
>>>>>>>
>>>>>>> This looks confusing, so I'm reporting about this.
>>>>>>> Consider this patch as RFC.
>>>>>>
>>>>>> Could you be more specific about the problem you are trying to solve?
>>>>>> Because we do skip shrinkers which are not memcg aware by
>>>>>> shrink_slab:
>>>>>> /*
>>>>>> * If kernel memory accounting is disabled, we ignore
>>>>>> * SHRINKER_MEMCG_AWARE flag and call all shrinkers
>>>>>> * passing NULL for memcg.
>>>>>> */
>>>>>> if (memcg_kmem_enabled() &&
>>>>>> !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>>>>> continue;
>>>>>>
>>>>>> Or am I missing something?
>>>>>
>>>>> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
>>>>> super_cache_count() is owned and only called by superblock's shrinker,
>>>>> which does have SHRINKER_MEMCG_AWARE flag.
>>>>
>>>> Xfs inodes are accounted to memcgs when they are allocated. All the
>>>> memcg reclaim stuff is done at the VFS inode cache level - all the
>>>> XFS inode cache shrinker does is clean up inodes that are not
>>>> referenced by the VFS inode cache because the memcg aware reclaim
>>>> has already freed them.
>>>>
>>>> i.e. what the XFS inode cache is doing is perfectly reasonable -
>>>> memcg aware inode reclaim is occuring at the VFS level, but on XFS
>>>> that does not free the inodes as they are still referenced
>>>> internally by XFS. However, once the inode is removed from the VFS
>>>> LRU, all memcg information about the inode is destroyed, so there's
>>>> nothing in the XFS layers that cares about memcgs.
>>>
>>> So, after inode is removed from LRU, memory still remains accounted
>>> to memcg till the time they are actually freed. I personally don't
>>> care, just to mention.
>>>
>>>> Hence when the XFS inode shrinker then called to run a
>>>> garbage collection pass on unreferenced inodes - the inodes that
>>>> are now unreferenced in the memcg due to the VFS inode shrinker pass
>>>> - it frees inodes regardless of the memcg context it was called from
>>>> because that information is no longer present in the inode cache.
>>>> Hence we just ignore memcgs at this level.
>>>
>>> But xfs_fs_free_cached_objects() returns number of these freed object
>>> as result to super_cache_scan(), so shrinker interprets them as related
>>> to a memcg, while they may be related to another memcg. This introduces
>>> a disproportion relatively to another shrinkers called to memcg.
>>
>> In what way? All memcgs see tha same values from the backing cache
>> and so try to do the same amount of scanning work. The shrinker
>> accounting simply doesn't care where the objects are scanned from,
>> as long as it comes from the same place as the calculation of the
>> number of objects in the cache it's about to scan.
>
> shrinker->count_objects() result is used to count number of objects,
> do_shrink_slab() should shrink during the call:
>
> freeable = shrinker->count_objects(shrinker, shrinkctl);
>
> Then shrinker takes part of this value:
>
> delta = freeable >> priority;
> delta *= 4;
> do_div(delta, shrinker->seeks);
>
> This is a number, the shrinker tries to shrink during the call.
> Let the priority is DEF_PRIORITY. Then, shrinker try to shrink
> freeable*4/(shrinker->seeks * 2^DEF_PRIORITY) enteties from every
> shrinker. Equal percent of every memcg shrinker.
>
> When XFS shows global number of cached objects in count_objects(),
> shrinker also tryes to shrink the same percent of global objects,
> as for other memcg shrinkers. So, when you have small memcg
> with 128Mb memory allowed and small number of tasks related to it,
> you may meet 1Gb of cached objects, which were queued by another
> big cgroup. So, the small cgroup may shrink number of objects of
> size more, then its own. It's not fair. That's all I'm worry in
> this message.
>
>> The memcg accounting, OTOH, is completely divorced from the
>> shrinker, so if it's still got too much slab memory accounted to it,
>> it will simply run the shrinker again and do some more memory
>> reclaim.
>
> This message is not about OOM, it's about imbalance. See above.
>
>> XFS does this for IO efficiency purposes, not because it's ideal
>> behaviour for memcgs. If we start doing exact memcg-only inode
>> reclaim, we're going back to the bad old days where inode reclaim
>> causes really nasty small random write storms that essentially
>> starve the storage subsystem of all other IO for seconds at a time.
>> That is a *far worse problem* from a system performance POV than
>> having the memcg memory reclaim run a couple more shrinker passes
>> and do a little more work....
>
> Ok, this is a point.
>
>>> Is there a problem? This is what my patch about.
>>
>> You've described some theoretical issue, but not described any user
>> visible or performance impacting behaviour that users actually care
>> about. What reproducable, observable behaviour does it fix/change?
>
> Strange question. Do you fix problems only when you meet a reproducable
> BUG()? This is the kernel, and many problem may be racy. But this
> does not mean, it's prohibited to discuss about them.
>
>>>> So, is there a problem you are actually trying to fix, or is this
>>>> simply a "I don't understand how the superblock shrinkers work,
>>>> please explain" patch?
>>>
>>> I work on some shrinker changes, and just want to understand they don't
>>> touch anything.
>>
>> Oh, goody, another person about to learn the hard way that shrinkers
>> are far more varied and complex than page reclaim :P
>
> It may be a surprise, but I have to say, that all memcg-aware shrinkers
> are based on list_lrus. XFS is exception. So, your words are not about
> shrinkers in general, just about XFS.

Just to clarify. I personally do not care about this problem. Consider
this as RFC/possible error report. If you are not interested in this,
let's stop the discussion.

Kirill