Re: [RFC-PATCH 1/2] mm: Add __GFP_NO_LOCKS flag
From: Thomas Gleixner
Date:  Sat Aug 15 2020 - 17:55:31 EST
Paul,
On Fri, Aug 14 2020 at 16:41, Paul E. McKenney wrote:
> On Sat, Aug 15, 2020 at 01:14:53AM +0200, Thomas Gleixner wrote:
>> As a matter of fact I assume^Wdeclare that removing struct rcu_head which
>> provides a fallback is not an option at all. I know that you want to,
>> but it wont work ever. Dream on, but as we agreed on recently there is
>> this thing called reality which ruins everything.
>
> For call_rcu(), agreed.  For kfree_rcu(), we know what the callback is
> going to do, plus single-argument kfree_rcu() can only be invoked from
> sleepable context.  (If you want to kfree_rcu() from non-sleepable
> context, that will cost you an rcu_head in the data structure being
> freed.)
kfree_rcu() as of today is just a conveniance wrapper around
call_rcu(obj, rcu) which can be called from any context and it still
takes TWO arguments.
Icepack?
So if you come up with a new kfree_rcu_magic(void *obj) single argument
variant which can only be called from sleepable contexts then this does
not require any of the raw lock vs. non raw hacks at all because you can
simply allocate without holding the raw lock in the rare case that you
run out of storage space. With four 4k pages preallocated per CPU that's
every 2048 invocations per CPU on 64bit.
So if you run into that situation then you drop the lock and then it's
racy because you might be preempted or migrated after dropping the lock
and you might have done a useless allocation, but that does not justify
having a special allocator just for that? You have an extra page, so
what?
To prevent subsequent callers to add to the allocation race you simply
can let them wait on the first allocating attempt to finish That avoids
more pointless allocations and as a side effect prevents all of them to
create more pressure by continuing their open/close loop naturally
without extra work.
> So if the single-argument kfree_rcu() case gets hit with a
> memory-allocation failure, it can fall back to waiting for a grace
> period and doing the free.  Of course, grace-period waits have horrible
> latency, but under OOM life is hard.  If this becomes a problem in
> non-OOM situations due to the lockless caches becoming empty, we will
> have to allocate memory if needed before acquiring the lock with the
> usual backout logic.  Doing that means that we can let the allocator
> acquire locks and maybe even do a little bit of blocking, so that the
> inline grace-period-wait would only happen if the system was well and
> truly OOMed.
No. It dropped the rcu internal lock and does a regular GFP_KENRNEL
allocation which waits for the page to become available. Which is a good
thing in the open/close scenario because it throttles the offender.
>> For normal operations a couple of pages which can be preallocated are
>> enough. What you are concerned of is the case where you run out of
>> pointer storage space.
>
> Agreed.
>
>> There are two reasons why that can happen:
>> 
>>       1) RCU call flooding
>>       2) RCU not being able to run and mop up the backlog
>> 
>> #1 is observable by looking at the remaining storage space and the RCU
>>    call frequency
>> 
>> #2 is uninteresting because it's caused by RCU being stalled / delayed
>>    e.g. by a runaway of some sorts or a plain RCU usage bug.
>>    
>>    Allocating more memory in that case does not solve or improve anything.
>
> Yes, #2 is instead RCU CPU stall warning territory.
>
> If this becomes a problem, one approach is to skip the page-of-pointers
> allocation if the grace period is more than (say) one second old.  If
> the grace period never completes, OOM is unavoidable, but this is a way
> of putting it off for a bit.
Don't even think about optimizing your new thing for #2. It's a
pointless exercise. If the task which runs into the 'can't allocate'
case then is sleeps and waits. End of story.
>> So the interesting case is #1. Which means we need to look at the
>> potential sources of the flooding:
>> 
>>     1) User space via syscalls, e.g. open/close
>>     2) Kernel thread
>>     3) Softirq
>>     4) Device interrupt
>>     5) System interrupts, deep atomic context, NMI ...
>> 
>> #1 trivial fix is to force switching to an high prio thread or a soft
>>    interrupt which does the allocation
>> 
>> #2 Similar to #1 unless that thread loops with interrupts, softirqs or
>>    preemption disabled. If that's the case then running out of RCU
>>    storage space is the least of your worries.
>> 
>> #3 Similar to #2. The obvious candidates (e.g. NET) for monopolizing a
>>    CPU have loop limits in place already. If there is a bug which fails
>>    to care about the limit, why would RCU care and allocate more memory?
>> 
>> #4 Similar to #3. If the interrupt handler loops forever or if the
>>    interrupt is a runaway which prevents task/softirq processing then
>>    RCU free performance is the least of your worries.
>> 
>> #5 Clearly a bug and making RCU accomodate for that is beyond silly.
>> 
>> So if call_rcu() detects that the remaining storage space for pointers
>> goes below the critical point or if it observes high frequency calls
>> then it simply should force a soft interrupt which does the allocation.
>
> Unless call_rcu() has been invoked with scheduler locks held.  But
> eventually call_rcu() should be invoked with interrupts enabled, and at
> that point it would be safe to raise_softirq(), wake_up(), or
> whatever.
If this atomic context corner case is hit within a problematic context
then we talk about the RCU of today and not about the future single
argument thing. And that oldschool RCU has a fallback. We are talking
about pressure corner cases and you really want to squeeze out the last
cache miss?  What for? If there is pressure then these cache misses are
irrelevant.
Thanks,
        tglx