Re: [RFC-PATCH 1/2] mm: Add __GFP_NO_LOCKS flag
From: Paul E. McKenney
Date: Wed Aug 19 2020 - 19:07:42 EST
On Wed, Aug 19, 2020 at 01:26:02AM +0200, Thomas Gleixner wrote:
> Paul,
>
> On Tue, Aug 18 2020 at 10:13, Paul E. McKenney wrote:
> > On Tue, Aug 18, 2020 at 06:55:11PM +0200, Thomas Gleixner wrote:
> >> On Tue, Aug 18 2020 at 09:13, Paul E. McKenney wrote:
> >> > On Tue, Aug 18, 2020 at 04:43:14PM +0200, Thomas Gleixner wrote:
> >> >> Throttling the flooder is incresing robustness far more than reducing
> >> >> cache misses.
> >> >
> >> > True, but it takes time to identify a flooding event that needs to be
> >> > throttled (as in milliseconds). This time cannot be made up.
> >>
> >> Not really. A flooding event will deplete your preallocated pages very
> >> fast, so you have to go into the allocator and get new ones which
> >> naturally throttles the offender.
> >
> > Should it turn out that we can in fact go into the allocator, completely
> > agreed.
>
> You better can for any user space controllable flooding source.
For the memory being passed to kvfree_rcu(), but of course.
However, that memory has just as good a chance of going deeply into the
allocator when being freed as when being allocated. In contrast, the
page of pointers that kvfree_rcu() attempts to allocate can be handed back
via per-CPU variables, avoiding lengthy time on the allocator's free path.
Yes, yes, in theory we could make a devil's pact with potential flooders
so that RCU directly handed memory back to them via a back channel as
well, but in practice that sounds like an excellent source of complexity
and bugs.
> >> So if your open/close thing uses the new single argument free which has
> >> to be called from sleepable context then the allocation either gives you
> >> a page or that thing has to wait. No fancy extras.
> >
> > In the single-argument kvfree_rcu() case, completely agreed.
> >
> >> You still can have a page reserved for your other regular things and
> >> once that it gone, you have to fall back to the linked list for
> >> those. But when that happens the extra cache misses are not your main
> >> problem anymore.
> >
> > The extra cache misses are a problem in that case because they throttle
> > the reclamation, which anti-throttles the producer, especially in the
> > case where callback invocation is offloaded.
>
> You still did not explain which contexts can create flooding. I gave you
> a complete list a few mails ago, but you still did not tell which of the
> contexts can cause flooding.
Message-ID: <87tux4kefm.fsf@xxxxxxxxxxxxxxxxxxxxxxx>, correct?
In case #1 (RCU call flooding), the page of pointers is helpful.
In case #2 (RCU not being able to run and mop up the backlog), allocating
pages of pointers is unhelpful, given that doing so simply speeds up the
potential OOM. My thought is to skip kvfree_rcu()/call_rcu() pointer-page
allocation once the current grace period's duration exceeds one second.
> If it's any context which is not sleepable or controllable in any way,
> then any attempt to mitigate it is a lost battle:
>
> A dependency on allocating memory to free memory is a dead end by
> definition.
Any attempt to mitigate that -lacks- -a- -fallback- is a losing battle.
> Any flooder which is uncontrollable is a bug and no matter what kind of
> hacks you provide, it will be able to bring the whole thing down.
On to the sources of flooding, based on what reality has managed to
teach me thus far:
1) User space via syscalls, e.g. open/close
These are definitely in scope.
2) Kernel thread
In general, these are out of scope, as you would expect.
For completeness, based on rcutorture-induced callback flooding,
if the kernel thread's loop contains at least one cond_resched()
for CONFIG_PREEMPT_NONE=y on the one hand, at least one schedule()
for preemptible kernels running NO_HZ_FULL, and at least one
point during which preemption is possible otherwise.
As discussed at Plumbers last year, the sysadm can always
configure callback offloading in such a way that RCU has no
chance of keeping up with a flood. Then again, the sysadm can
also always crash the system in all sorts of interesting ways,
so what is one more?
But please note that rcutorture does -not- recycle the flooded
memory via kfree(), and thus avoids any problems with kfree()
needing to dive deep into the allocator. What rcutorture
does instead is to pass the memory back to the flooder using
a dedicated queue. At the end of the test, the whole mess
is kfree()ed.
And all of these will be solved (as you would hope) by "Don't do
that!!!", as laid out in the paragraphs above. Of course, the
exact form of "Don't do that" will no doubt change over time, but
this code is in the kernel and therefore can be changed as needed.
3) Softirq
Flooding from within the confines of a single softirq handler
is of course out of scope. There are all sorts of ways for the
softirq-handler writer to deal with this, for example, dropping
into workqueue context to do the allocation, which brings things
back to #2 (kernel thread).
I have not experimented with this, nor do I intend to.
4) Device interrupt
Likewise, flooding from within the confines of a single device
interrupt handler is out of scope. As with softirq handlers,
there are all sorts of ways for the device-driver writer to deal
with this, for example, dropping into workqueue context to do
the allocation, which brings things back to #2 (kernel thread).
Again, as with softirq, I have not experimented with this,
nor do I intend to.
5) System interrupts, deep atomic context, NMI ...
System interrupts have the same callback-flooding constraints
as device interrupts, correct? (I am thinking that by "system
interrupt" you mean things like the scheduling-clock interrupt,
except that I have always thought of this interrupt as being a
form of device interrupt.)
I have no idea what "deep atomic context" but it does sound
intriguing. ;-) If you mean the entry/exit code that is not
allowed to be traced, then you cannot allocate, call_rcu(),
kfree_rcu, or kvfree_rcu() from that context anyway.
NMI handlers are not allowed to allocate or to invoke either
call_rcu(), kfree_rcu(), or kvfree_rcu(), so they are going to
need help from some other execution context if they are going
to do any flooding at all.
In short, the main concern is flooding driven one way or another from
user space, whether via syscalls, traps, exceptions, or whatever.
Therefore, of the above list, I am worried only about #1.
(OK, OK, I am worried about #2-#4, but only from the perspective of
knowing what to tell the developer not to do.)
> So far this looks like you're trying to cure the symptoms, which is
> wrong to begin with.
>
> If the flooder is controllable then there is no problem with cache
> misses at all unless the RCU free callbacks are not able to catch up
> which is yet another problem which you can't cure by allocating more
> memory.
As much as I might like to agree with that statement, the possibility
of freeing (not just allocation) going deep into the allocator leads me
to believe that reality might have a rather different opinion.
And the only way to find out is to try it. I therefore propose that
the eventual patch series be split into three parts:
1. Maintain separate per-CPU cache of pages of pointers dedicated
to kfree_rcu() and kvfree_rcu(), and later also to call_rcu().
If a given cache is empty, the code takes the fallback (either
use the rcu_head structure or block on synchronize_rcu(),
depending), but the code also asks for an allocation from a
sleepable context in order to replenish the cache. (We could
use softirq for kfree_rcu() and kvfree_rcu(), but call_rcu()
would have deadlock issues with softirq.)
2. Peter's patch or similar.
3. Use of Peter's patch.
If the need for #2 and #3 are convincing, well and good. If not, I
create a tag for them in the -rcu tree so that they can be found quickly
in case reality decides to express its opinion at some inconvenient time
at some inconvenient scale.
Thoughts?
Thanx, Paul