Re: [PATCH 1/1] rcu/tree: add emergency pool for headless case
From: Uladzislau Rezki
Date: Mon Apr 06 2020 - 08:57:10 EST
Hello, Joel.
> > >
> > > Hi Vlad,
> > >
> > > One concern I have is this moves the problem a bit further down. My belief is
> > > we should avoid the likelihood of even needing an rcu_head allocated for the
> > > headless case, to begin with - than trying to do damage-control when it does
> > > happen. The only way we would end up needing an rcu_head is if we could not
> > > allocate an array.
> > >
> > Let me share my view on all such caching. I think that now it becomes less as
> > the issue, because of we have now https://lkml.org/lkml/2020/4/2/383 patch.
> > I see that it does help a lot. I tried to simulate low memory condition and
> > apply high memory pressure with that. I did not manage to trigger the
> > "synchronize rcu" path at all. It is because of using much more permissive
> > parameters when we request a memory from the SLAB(direct reclaim, etc...).
>
> That's a good sign that we don't hit this path in your tests.
>
Just one request, of course if you have a time :) Could you please
double check on your test environment to stress the system to check
if you also can not hit it?
How i test it. Please apply below patch:
<snip>
t a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5e26145e9ead..25f7ac8583e1 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3203,6 +3203,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
if (head) {
ptr = (void *) head - (unsigned long) func;
+ head = NULL;
} else {
/*
* Please note there is a limitation for the head-less
@@ -3233,16 +3234,18 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
* Under high memory pressure GFP_NOWAIT can fail,
* in that case the emergency path is maintained.
*/
- success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
- if (!success) {
+ /* success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); */
+ /* if (!success) { */
/* Is headless object? */
if (head == NULL) {
/* Drop the lock. */
krc_this_cpu_unlock(krcp, flags);
head = attach_rcu_head_to_object(ptr);
- if (head == NULL)
+ if (head == NULL) {
+ success = false;
goto inline_return;
+ }
/* Take it back. */
krcp = krc_this_cpu_lock(&flags);
@@ -3267,7 +3270,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
*/
expedited_drain = true;
success = true;
- }
+ /* } */
WRITE_ONCE(krcp->count, krcp->count + 1);
@@ -3297,7 +3300,9 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
if (!rcu_kfree_nowarn)
WARN_ON_ONCE(1);
debug_rcu_head_unqueue(ptr);
- synchronize_rcu();
+ /* synchronize_rcu(); */
+ printk(KERN_ERR "-> hit synchronize_rcu() path.\n");
+ trace_printk("-> hit synchronize_rcu() path.\n");
kvfree(ptr);
}
}
<snip>
lower the memory size and run kfree rcu tests. It would be appreciated.
>
> I guess also, with your latest patch on releasing the lock to be in a
> non-atomic context, and then doing the allocation, it became even more
> permissive? If you drop that patch and tried, do you still not hit the
> synchronous path more often?
>
Yep. If i drop the patch, i can hit it.
>
> Could you also try introducing memory pressure by reducing your system's
> total memory and see how it behaves?
>
I simulated low memory condition by setting the system memory to 145MB.
That was the minimum amount the KVM system was capable of booting. After
that i used kfree rcu tests to simulate memory pressure.
> > > So instead of adding a pool for rcu_head allocations, how do you feel about
> > > pre-allocation of the per-cpu cache array instead, which has the same effect
> > > as you are intending?
> > >
> > In the v2 i have a list of such objects. It is also per-CPU(it is scaled to CPUs),
> > but the difference is, those objects require much less memory, it is 8 + sizeof(struct
> > rcu_head) bytes comparing to one page. Therefore the memory footprint is lower.
>
> Yes, true. That is one drawback is it higher memory usage. But if you have at
> least 1 kfree_rcu() request an each CPU, then pre-allocation does not
> increase memory usage any more than it already has right now. Unless, we
> extend my proposal to cache more than 2 pages per-cpu which I think you
> mentioned below.
>
If we cache two pages per-CPU, i think that is fine. When it comes to increasing
it, it can be a bit wasting. For example consider 128 CPUs system.
> > I have doubts that we would ever hit this emergency list, because of mentioned
> > above patch, but from the other hand i can not say and guarantee 100%. Just in
> > case, we may keep it.
>
> You really have to hit OOM in your tests to trigger it I suppose. Basically
> the emergency pool improves situation under OOM, but otherwise does not
> improve it due to the direct-reclaim that happens as you mentioned. Right?
>
See above how i simulated it. Direct reclaim is our GFP_KERNEL + other flags
helper. If even after reclaim process there is no memory, then emergency list
is supposed to be used. But we can drop this patch, i mean "emergency list"
if we agree on it. The good point would be if you could stress your system
by the i did. See above description :)
> > Paul, could you please share your view and opinion? It would be appreciated :)
> >
> > > This has 3 benefits:
> > > 1. It scales with number of CPUs, no configuration needed.
> > > 2. It makes the first kfree_rcu() faster and less dependent on an allocation
> > > succeeding.
> > > 3. Much simpler code, no new structures or special handling.
> > > 4. In the future we can extend it to allocate more than 2 pages per CPU using
> > > the same caching mechanism.
> > >
> > > The obvious drawback being its 2 pages per CPU but at least it scales by
> > > number of CPUs. Something like the following (just lightly tested):
> > >
> > > ---8<-----------------------
> > >
> > > From: "Joel Fernandes (Google)" <joel@xxxxxxxxxxxxxxxxx>
> > > Subject: [PATCH] rcu/tree: Preallocate the per-cpu cache for kfree_rcu()
> > >
> > > In recent changes, we have made it possible to use kfree_rcu() without
> > > embedding an rcu_head in the object being free'd. This requires dynamic
> > > allocation. In case dynamic allocation fails due to memory pressure, we
> > > would end up synchronously waiting for an RCU grace period thus hurting
> > > kfree_rcu() latency.
> > >
> > > To make this less probable, let us pre-allocate the per-cpu cache so we
> > > depend less on the dynamic allocation succeeding. This also has the
> > > effect of making kfree_rcu() slightly faster at run time.
> > >
> > > Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> > > ---
> > > kernel/rcu/tree.c | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 6172e6296dd7d..9fbdeb4048425 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -4251,6 +4251,11 @@ static void __init kfree_rcu_batch_init(void)
> > > krcp->krw_arr[i].krcp = krcp;
> > > }
> > >
> > > + krcp->bkvcache[0] = (struct kvfree_rcu_bulk_data *)
> > > + __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > > + krcp->bkvcache[1] = (struct kvfree_rcu_bulk_data *)
> > > + __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > > +
> > > INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
> > > krcp->initialized = true;
> > > }
> > We pre-allocate it, but differently comparing with your proposal :) I do not see
> > how it can improve things. The difference is you do it during initializing or booting
> > phase. In case of current code it will pre-allocate and cache one page after first
> > calling of the kvfree_call_rcu(), say in one second. So basically both variants are
> > the same.
>
> Well, one proposal is only 5 lines extra ;-). That has got to be at least a
> bit appealing ;-) ;-).
>
:)
> > But i think that we should allow to be used two pages as cached ones, no matter
> > whether it is vmalloc ptrs. or SLAB ones. So basically, two cached pages can be
> > used by vmalloc path and SLAB path. And probably it makes sense because of two
> > phases: one is when we collect pointers, second one is memory reclaim path. Thus
> > one page per one phase, i.e. it would be paired.
>
> You are saying this with regard to my proposal right? I agree number of
> pages could be increased. The caching mechanism already in-place could be
> starting point for that extension.
>
We already have two pages. What we need is to allow to use them in both
paths, vmalloc one and SLAB one, i mean reclaim path. At least that fits
well to the collecting/reclaim phases.
Thanks for comments!
--
Vlad Rezki