Re: [RFC-PATCH 1/2] mm: Add __GFP_NO_LOCKS flag

From: Paul E. McKenney
Date: Fri Aug 14 2020 - 10:14:35 EST


On Fri, Aug 14, 2020 at 10:30:37AM +0200, Peter Zijlstra wrote:
> On Fri, Aug 14, 2020 at 01:59:04AM +0200, Thomas Gleixner wrote:
> > On Fri, Aug 14 2020 at 00:06, peterz wrote:
> > > I'm still not getting it, how do we end up trying to allocate memory
> > > from under raw spinlocks if you're not allowed to use kfree_rcu() under
> > > one ?
> > >
> > > Can someone please spell out the actual problem?
> >
> > Your actual problem is the heat wave. Get an icepack already :)
>
> Sure, but also much of the below wasn't stated anywhere in the thread I
> got Cc'ed on. All I got was half arsed solutions without an actual
> problem statement.
>
> > To set things straight:
> >
> > 1) kfree_rcu() which used to be just a conveniance wrapper around
> > call_rcu() always worked in any context except NMI.
> >
> > Otherwise RT would have never worked or would have needed other
> > horrible hacks to defer kfree in certain contexts including
> > context switch.
>
> I've never used kfree_rcu(), htf would I know.

Doing this to kfree_rcu() is the first step. We will also be doing this
to call_rcu(), which has some long-standing invocations from various
raw contexts, including hardirq handler.

> > 2) RCU grew an optimization for kfree_rcu() which avoids the linked
> > list cache misses when a large number of objects is freed via
> > kfree_rcu(). Paul says it speeds up post grace period processing by
> > a factor of 8 which is impressive.
> >
> > That's done by avoiding the linked list and storing the object
> > pointer in an array of pointers which is page sized. This array is
> > then freed in bulk without having to touch the rcu head linked list
> > which obviously avoids cache misses.
> >
> > Occasionally kfree_rcu() needs to allocate a page for this but it
> > can fallback to the linked list when the allocation fails.
> >
> > Inconveniantly this allocation happens with an RCU internal raw
> > lock held, but even if we could do the allocation outside the RCU
> > internal lock this would not solve the problem that kfree_rcu() can
> > be called in any context except NMI even on RT.
> >
> > So RT forbids allocations from truly atomic contexts even with
> > GFP_ATOMIC/NOWAIT. GFP_ATOMIC is kinda meaningless on RT because
> > everything which calls it is in non-atomic context :) But still
> > GFP_ATOMIC or GFP_NOWAIT retain the semantics of !RT and do not go
> > down into deeper levels or wait for pages to become available.
>
> Right so on RT just cut out the allocation and unconditionally make it
> NULL. Since it needs to deal with GFP_ATOMIC|GFP_NOWARN failing the
> allocation anyway.

Except that this is not just RT due to CONFIG_PROVE_RAW_LOCK_NESTING=y.

> > 3) #2 upsets lockdep (with the raw lock nesting enabled) rightfully
> > when RCU tries to allocate a page, the lockless page cache is empty
> > and it acquires zone->lock which is a regular spinlock
>
> There was no lockdep splat in the thread.

I don't see the point of including such a splat given that we know that
lockdep is supposed to splat in this situation.

> > 4) As kfree_rcu() can be used from any context except NMI and RT
> > relies on it we ran into a circular dependency problem.
>
> Well, which actual usage sites are under a raw spinlock? Most of the
> ones I could find are not.

There are some on their way in, but this same optimization will be applied
to call_rcu(), and there are no shortage of such call sites in that case.
And these call sites have been around for a very long time.

> > If we disable that feature for RT then the problem would be solved
> > except that lockdep still would complain about taking zone->lock
> > within a forbidden context on !RT kernels.
> >
> > Preventing that feature because of that is not a feasible option
> > either. Aside of that we discuss this postfactum, IOW the stuff is
> > upstream already despite the problem being known before.
>
> Well, that's a fail :-( I tought we were supposed to solve known issues
> before shit got merged.

The enforcement is coming in and the kfree_rcu() speed up is coming in
at the same time. The call_rcu() speedup will appear later.

> > 5) Ulad posted patches which introduce a new GFP allocation mode
> > which makes the allocation fail if the lockless cache is empty,
> > i.e. it does not try to touch zone->lock in that case.
> >
> > That works perfectly fine on RT and !RT, makes lockdep happy and
> > Paul is happy as well.
> >
> > If the lockless cache, which works perfectly fine on RT, is empty
> > then the performance of kfree_rcu() post grace period processing is
> > probably not making the situation of the system worse.
> >
> > Michal is not fond of the new GFP flag and wants to explore options
> > to avoid that completely. But so far there is no real feasible
> > solution.
>
> Not only Michal, those patches looked pretty horrid.

They are the first round, and will be improved.

> > A) It was suggested to make zone->lock raw, but that's a total
> > disaster latency wise. With just a kernel compile (!RT kernel)
> > spinwait times are in the hundreds of microseconds.
>
> Yeah, I know, been there done that, like over a decade ago :-)
>
> > B) Michal suggested to have GFP_RT_ATOMIC defined to 0 which would
> > not require a new GFP bit and bail out when RT is enabled and
> > the context is atomic.
>
> I would've written the code like:
>
> void *mem = NULL;
>
> ...
> #ifndef CONFIG_PREEMPT_RT
> mem = kmalloc(PAGE_SIZE, GFP_ATOMIC|GFP_NOWAIT);
> #endif
> if (!mem)
> ....
>
> Seems much simpler and doesn't pollute the GFP_ space.

But fails in the CONFIG_PROVE_RAW_LOCK_NESTING=y case when lockdep
is enabled.

> > D) To solve the lockdep issue of #B Michal suggested to have magic
> > lockdep annotations which allow !RT kernels to take zone->lock
> > from the otherwise forbidden contexts because on !RT this lock
> > nesting could be considered a false positive.
> >
> > But this would be horrors of some sorts because there might be
> > locks further down which then need the same treatment or some
> > general 'yay, I'm excempt from everything' annotation which is
> > short of disabling lockdep completly.
> >
> > Even if that could be solved and made correct for both RT and
> > !RT then this opens the can of worms that this kind of
> > annotation would show up all over the place within no time for
> > the completely wrong reasons.
>
> So due to this heat crap I've not slept more than a few hours a night
> for the last week, I'm not entirely there, but we already have a bunch
> of lockdep magic for this raw nesting stuff.

Ouch! Here is hoping for cooler weather soon!

> But yes, we need to avoid abuse, grep for lockdep_off() :-( This
> drivers/md/dm-cache-target.c thing is just plain broken. It sorta
> 'works' on mainline (and even there it can behave badly), but on RT it
> will come apart with a bang.
>
> > Paul compiled this options list:
> >
> > > 1. Prohibit invoking allocators from raw atomic context, such
> > > as when holding a raw spinlock.
> >
> > Clearly the simplest solution but not Pauls favourite and
> > unfortunately he has a good reason.
>
> Which isn't actually stated anywhere I suppose ?

Several times, but why not one more? ;-)

In CONFIG_PREEMPT_NONE=y kernels, which are heavily used, and for which
the proposed kfree_rcu() and later call_rcu() optimizations are quite
important, there is no way to tell at runtime whether or you are in
atomic raw context.

> > > 2. Adding a GFP_ flag.
> >
> > Michal does not like the restriction for !RT kernels and tries to
> > avoid the introduction of a new allocation mode.
>
> Like above, I tend to be with Michal on this, just wrap the actual
> allocation in CONFIG_PREEMPT_RT, the code needs to handle a NULL pointer
> there anyway.

That disables the optimization in the CONFIG_PREEMPT_NONE=y case,
where it really is needed.

> > > 3. Reusing existing GFP_ flags/values/whatever to communicate
> > > the raw-context information that was to be communicated by
> > > the new GFP_ flag.
> > >
> > > 4. Making lockdep forgive acquiring spinlocks while holding
> > > raw spinlocks, but only in CONFIG_PREEMPT_NONE=y kernels.
>
> Uhh, !CONFIG_PREEMPT_RT, the rest is 'fine'.

I would be OK with either. In CONFIG_PREEMPT_NONE=n kernels, the
kfree_rcu() code could use preemptible() to determine whether it was safe
to invoke the allocator. The code in kfree_rcu() might look like this:

mem = NULL;
if (IS_ENABLED(CONFIG_PREEMPT_NONE) || preemptible())
mem = __get_free_page(...);

Is your point is that the usual mistakes would then be caught by the
usual testing on CONFIG_PREEMPT_NONE=n kernels?

> > These are not seperate of each other as #3 requires #4. The most
> > horrible solution IMO from a technical POV as it proliferates
> > inconsistency for no good reaosn.
> >
> > Aside of that it'd be solving a problem which does not exist simply
> > because kfree_rcu() does not depend on it and we really don't want to
> > set precedence and encourage the (ab)use of this in any way.
>
> My preferred solution is 1, if you want to use an allocator, you simply
> don't get to play under raw_spinlock_t. And from a quick grep, most
> kfree_rcu() users are not under raw_spinlock_t context.

There is at least one on its way in, but more to the point, we will
need to apply this same optimization to call_rcu(), which is used in
raw atomic context, including from hardirq handlers.

> This here sounds like someone who wants to have his cake and eat it too.

Just looking for a non-negative sized slice of cake, actually!

> I'll try and think about a lockdep annotation that isn't utter crap, but
> that's probably next week, I need this heat to end and get a few nights
> sleep, I'm an utter wreck atm.

Again, here is hoping for cooler weather!

Thanx, Paul