Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

From: Dave Chinner
Date: Tue May 12 2020 - 21:54:52 EST


On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote:
> On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote:
> > On 08/05/2020 17.46, Paul E. McKenney wrote:
> > > Easy for me to provide "start fast and inefficient mode" and "stop fast
> > > and inefficient mode" APIs for MM to call!
> > >
> > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would
> > > expect them not to nest (as in if you need them to nest, please let
> > > me know). I would not expect these to be invoked all that often (as in
> > > if you do need them to be fast and scalable, please let me know). >
> > > RCU would then be in fast/inefficient mode if either MM told it to be
> > > or if RCU had detected callback overload on at least one CPU.
> > >
> > > Seem reasonable?
> >
> > Not exactly nested calls, but kswapd threads are per numa node.
> > So, at some level nodes under pressure must be counted.
>
> Easy enough, especially given that RCU already "counts" CPUs having
> excessive numbers of callbacks. But assuming that the transitions to/from
> OOM are rare, I would start by just counting them with a global counter.
> If the counter is non-zero, RCU is in fast and inefficient mode.
>
> > Also forcing rcu calls only for cpus in one numa node might be useful.
>
> Interesting. RCU currently evaluates a given CPU by comparing the
> number of callbacks against a fixed cutoff that can be set at boot using
> rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded,
> RCU becomes more aggressive about invoking callbacks on that CPU, for
> example, by sacrificing some degree of real-time response. I believe
> that this heuristic would also serve the OOM use case well.

So one of the things that I'm not sure people have connected here is
that memory reclaim done by shrinkers is one of the things that
drives huge numbers of call_rcu() callbacks to free memory via rcu.
If we are reclaiming dentries and inodes, then we can be pushing
thousands to hundreds of thousands of objects into kfree_rcu()
and/or direct call_rcu() calls to free these objects in a single
reclaim pass.

Hence the trigger for RCU going into "excessive callback" mode
might, in fact, be kswapd running a pass over the shrinkers. i.e.
memory reclaim itself can be responsible for pushing RCU into this "OOM
pressure" situation.

So perhaps we've missed a trick here by not having the memory
reclaim routines trigger RCU callbacks at the end of a priority
scan. The shrinkers have queued the objects for freeing, but they
haven't actually been freed yet and so things like slab pages
haven't actually been returned to the free pool even though the
shrinkers have said "freed this many objects"...

i.e. perhaps the right solution here is a "rcu_run_callbacks()"
function that memory reclaim calls before backing off and/or winding
up reclaim priority.

> > I wonder if direct-reclaim should at some stage simply wait for RCU QS.
> > I.e. call rcu_barrier() or similar somewhere before invoking OOM.
>
> The rcu_oom_count() function in the patch starting this thread returns the
> total number of outstanding callbacks queued on all CPUs. So one approach
> would be to invoke this function, and if the return value was truly
> huge (taking size of memory and who knows that all else into account),
> do the rcu_barrier() to wait for RCU to clear its current backlog.

The shrinker scan control structure has a node mask in it to
indicate what node (and hence CPUs) it should be reclaiming from.
This information comes from the main reclaim scan routine, so it
would be trivial to feed straight into the RCU code to have it
act on just the CPUs/node that we are reclaiming memory from...

> On the NUMA point, it would be dead easy for me to supply a function
> that returned the number of callbacks on a given CPU, which would allow
> you to similarly evaluate a NUMA node, a cgroup, or whatever.

I'd think it runs the other way around - we optimisitically call the
RCU layer to do cleanup, and the RCU layer decides if there's enough
queued callbacks on the cpus/node to run callbacks immediately. It
would even be provided with the scan priority to indicate the level
of desperation memory reclaim is under....

> > All GFP_NOFAIL users should allow direct-reclaim, thus this loop
> > in page_alloc shouldn't block RCU and doesn't need special care.
>
> I must defer to you guys on this. The main caution is the duration of
> direct reclaim. After all, if it is too long, the kfree_rcu() instance
> would have been better of just invoking synchronize_rcu().

Individual callers of kfree_rcu() have no idea of the load on RCU,
nor how long direct reclaim is taking. Calling synchronize_rcu()
incorrectly has pretty major downsides to it, so nobody should be
trying to expedite kfree_rcu() unless there is a good reason to do
so (e.g. at unmount to ensure everything allocated by a filesystem
has actually been freed). Hence I'd much prefer the decision to
expedite callbacks is made by the RCU subsystem based on it's known
callback load and some indication of how close memory reclaim is to
declaring OOM...

Cheers,

Dave.

--
Dave Chinner
david@xxxxxxxxxxxxx