Re: [PATCH] mm: fix hanging shrinker management on long do_shrink_slab

From: Michal Hocko
Date: Wed Dec 04 2019 - 03:35:20 EST


On Sat 30-11-19 00:45:41, Pavel Tikhomirov wrote:
> We have a problem that shrinker_rwsem can be held for a long time for
> read in shrink_slab, at the same time any process which is trying to
> manage shrinkers hangs.
>
> The shrinker_rwsem is taken in shrink_slab while traversing shrinker_list.
> It tries to shrink something on nfs (hard) but nfs server is dead at
> these moment already and rpc will never succeed. Generally any shrinker
> can take significant time to do_shrink_slab, so it's a bad idea to hold
> the list lock here.

Yes, this is a known problem and people have already tried to address it
in the past. Have you checked previous attempts? SRCU based one
http://lkml.kernel.org/r/153365347929.19074.12509495712735843805.stgit@xxxxxxxxxxxxxxxxxxxxx
but I believe there were others (I only had this one in my notes).
Please make sure to Cc Dave Chinner when posting a next version because
he had some concerns about the change of the behavior.

> We have a similar problem in shrink_slab_memcg, except that we are
> traversing shrinker_map+shrinker_idr there.
>
> The idea of the patch is to inc a refcount to the chosen shrinker so it
> won't disappear and release shrinker_rwsem while we are in
> do_shrink_slab, after that we will reacquire shrinker_rwsem, dec
> the refcount and continue the traversal.

The reference count part makes sense to me. RCU role needs a better
explanation. Also do you have any reason to not use completion for
the final step? Openconding essentially the same concept sounds a bit
awkward to me.

> We also need a wait_queue so that unregister_shrinker can wait for the
> refcnt to become zero. Only after these we can safely remove the
> shrinker from list and idr, and free the shrinker.
[...]
> crash> bt ...
> PID: 18739 TASK: ... CPU: 3 COMMAND: "bash"
> #0 [...] __schedule at ...
> #1 [...] schedule at ...
> #2 [...] rpc_wait_bit_killable at ... [sunrpc]
> #3 [...] __wait_on_bit at ...
> #4 [...] out_of_line_wait_on_bit at ...
> #5 [...] _nfs4_proc_delegreturn at ... [nfsv4]
> #6 [...] nfs4_proc_delegreturn at ... [nfsv4]
> #7 [...] nfs_do_return_delegation at ... [nfsv4]
> #8 [...] nfs4_evict_inode at ... [nfsv4]
> #9 [...] evict at ...
> #10 [...] dispose_list at ...
> #11 [...] prune_icache_sb at ...
> #12 [...] super_cache_scan at ...
> #13 [...] do_shrink_slab at ...

Are NFS people aware of this? Because this is simply not acceptable
behavior. Memory reclaim cannot be block indefinitely or for a long
time. There must be a way to simply give up if the underlying inode
cannot be reclaimed.

I still have to think about the proposed solution. It sounds a bit over
complicated to me.
--
Michal Hocko
SUSE Labs