Re: [PATCH 1/1] rcu/tree: add emergency pool for headless case

From: Uladzislau Rezki
Date: Sun Apr 05 2020 - 13:21:20 EST


On Sat, Apr 04, 2020 at 03:51:29PM -0400, Joel Fernandes wrote:
> On Fri, Apr 03, 2020 at 07:30:51PM +0200, Uladzislau Rezki (Sony) wrote:
> > Maintain an emergency pool for each CPU with some
> > extra objects. There is read-only sysfs attribute,
> > the name is "rcu_nr_emergency_objs". It reflects
> > the size of the pool. As for now the default value
> > is 3.
> >
> > The pool is populated when low memory condition is
> > detected. Please note it is only for headless case
> > it means when the regular SLAB is not able to serve
> > any request, the pool is used.
> >
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
>
> Hi Vlad,
>
> One concern I have is this moves the problem a bit further down. My belief is
> we should avoid the likelihood of even needing an rcu_head allocated for the
> headless case, to begin with - than trying to do damage-control when it does
> happen. The only way we would end up needing an rcu_head is if we could not
> allocate an array.
>
Let me share my view on all such caching. I think that now it becomes less as
the issue, because of we have now https://lkml.org/lkml/2020/4/2/383 patch.
I see that it does help a lot. I tried to simulate low memory condition and
apply high memory pressure with that. I did not manage to trigger the
"synchronize rcu" path at all. It is because of using much more permissive
parameters when we request a memory from the SLAB(direct reclaim, etc...).

>
> So instead of adding a pool for rcu_head allocations, how do you feel about
> pre-allocation of the per-cpu cache array instead, which has the same effect
> as you are intending?
>
In the v2 i have a list of such objects. It is also per-CPU(it is scaled to CPUs),
but the difference is, those objects require much less memory, it is 8 + sizeof(struct
rcu_head) bytes comparing to one page. Therefore the memory footprint is lower.

I have doubts that we would ever hit this emergency list, because of mentioned
above patch, but from the other hand i can not say and guarantee 100%. Just in
case, we may keep it.

Paul, could you please share your view and opinion? It would be appreciated :)

> This has 3 benefits:
> 1. It scales with number of CPUs, no configuration needed.
> 2. It makes the first kfree_rcu() faster and less dependent on an allocation
> succeeding.
> 3. Much simpler code, no new structures or special handling.
> 4. In the future we can extend it to allocate more than 2 pages per CPU using
> the same caching mechanism.
>
> The obvious drawback being its 2 pages per CPU but at least it scales by
> number of CPUs. Something like the following (just lightly tested):
>
> ---8<-----------------------
>
> From: "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx>
> Subject: [PATCH] rcu/tree: Preallocate the per-cpu cache for kfree_rcu()
>
> In recent changes, we have made it possible to use kfree_rcu() without
> embedding an rcu_head in the object being free'd. This requires dynamic
> allocation. In case dynamic allocation fails due to memory pressure, we
> would end up synchronously waiting for an RCU grace period thus hurting
> kfree_rcu() latency.
>
> To make this less probable, let us pre-allocate the per-cpu cache so we
> depend less on the dynamic allocation succeeding. This also has the
> effect of making kfree_rcu() slightly faster at run time.
>
> Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> ---
> kernel/rcu/tree.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 6172e6296dd7d..9fbdeb4048425 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4251,6 +4251,11 @@ static void __init kfree_rcu_batch_init(void)
> krcp->krw_arr[i].krcp = krcp;
> }
>
> + krcp->bkvcache[0] = (struct kvfree_rcu_bulk_data *)
> + __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> + krcp->bkvcache[1] = (struct kvfree_rcu_bulk_data *)
> + __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> +
> INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> krcp->initialized = true;
> }
We pre-allocate it, but differently comparing with your proposal :) I do not see
how it can improve things. The difference is you do it during initializing or booting
phase. In case of current code it will pre-allocate and cache one page after first
calling of the kvfree_call_rcu(), say in one second. So basically both variants are
the same.

But i think that we should allow to be used two pages as cached ones, no matter
whether it is vmalloc ptrs. or SLAB ones. So basically, two cached pages can be
used by vmalloc path and SLAB path. And probably it makes sense because of two
phases: one is when we collect pointers, second one is memory reclaim path. Thus
one page per one phase, i.e. it would be paired.

Thanks, Joel!

--
Vlad Rezki