Re: [PATCH] mm: fix hanging shrinker management on long do_shrink_slab

From: Pavel Tikhomirov
Date: Thu Dec 19 2019 - 05:36:06 EST


On 12/10/19 4:20 AM, Dave Chinner wrote:
> On Fri, Dec 06, 2019 at 09:11:25AM -0800, Shakeel Butt wrote:
>> On Thu, Dec 5, 2019 at 6:10 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>>> If a shrinker is blocking for a long time, then we need to
>>> work to fix the shrinker implementation because blocking is a much
>>> bigger problem than just register/unregister.
>>>
>>
>> Yes, we should be fixing the implementations of all shrinkers and yes
>> it is bigger issue but we can also fix register/unregister isolation
>> issue in parallel. Fixing all shrinkers would a tedious and long task
>> and we should not block fixing isolation issue on it.
>
> "fixing all shrinkers" is a bit of hyperbole - you've identified
> only one instance where blocking is causing you problems. Indeed,
> most shrinkers are already non-blocking and won't cause you any
> problems at all.
>
>>> IOWs, we already know that cycling a global rwsem on every
>>> individual shrinker invocation is going to cause noticable
>>> scalability problems. Hence I don't think that this sort of "cycle
>>> the global rwsem faster to reduce [un]register latency" solution is
>>> going to fly because of the runtime performance regressions it will
>>> introduce....
>>>
>>
>> I agree with your scalability concern (though others would argue to
>> first demonstrate the issue before adding more sophisticated scalable
>> code).
>
> Look at the git history. We *know* this is a problem, so anyone
> arguing that we have to prove it can go take a long walk of a short
> plank....
>
>> Most memory reclaim code is written without the performance or
>> scalability concern, maybe we should switch our thinking.
>
> I think there's a lot of core mm and other developers that would
> disagree with you there. With respect to shrinkers, we've been
> directly concerned about performance and scalability of the
> individual instances as well as the infrastructure for at least the
> last decade....
>
> Cheers,
>
> Dave.
>

Thanks a lot for your replies, now I see that the core of the problem is
in nfs hang, before that I was unsure if it's OK to have such a hang in
do_shrink_slab.

I have a possibly bad idea on how my patch can still work. What if we
use unlock/refcount way only for nfs-shrinkers? It will still give a
performance penalty if one has many nfs mounts, but for those who has
little number of nfs mounts the penalty would be less. And this would be
a small isolation improvement for nfs users.

--
Best regards, Tikhomirov Pavel
Software Developer, Virtuozzo.