Re: crazy idea: big percpu lock (Re: task isolation)

From: Andy Lutomirski
Date: Wed Oct 28 2015 - 14:45:52 EST

On Wed, Oct 28, 2015 at 11:42 AM, Chris Metcalf <cmetcalf@xxxxxxxxxx> wrote:
> On 10/08/2015 05:25 PM, Andy Lutomirski wrote:
>> This whole isolation vs vmstat, etc thing made me think:
>> It seems to me that a big part of the problem is that there's all
>> kinds of per-cpu deferred housekeeping work that can be done on the
>> CPU in question without any complicated or heavyweight locking but
>> that can't be done remotely without a mess. This presumably includes
>> vmstat, draining the LRU list, etc. This is a problem for allowing
>> CPUs to spend a long time without any interrupts.
>> I want to propose a new primitive that might go a long way toward
>> solving this issue. The new primitive would be called the "big percpu
>> lock". Non-nohz CPUs would hold their big percpu lock all the time.
>> Nohz CPUs would hold it all the time unless idle. Full nohz cpus
>> would hold it all the time except when idle or in user mode. No CPU
>> promises to hold it while processing an NMI or similar NMI-like work.
>> This should help in a ton of cases.
>> For vunmap global kernel TLB flushes, we could stick the flushes in a
>> list of deferred flushes to be processed on entry, and that list would
>> be protected by the big percpu lock. For any kind of draining of
>> non-NMI-safe percpu data (LRU, vmstat, whatever), we could have a
>> housekeeping cpu try to do it using the big percpu lock
>> There's a race here that affects task isolation. On exit to user
>> mode, there's no obvious way to tell that an IPI is already pending.
>> We could add that, too: whenever we send an IPI to a nohz_full CPU, we
>> increment a percpu pending IPI count, then try to get the big percpu
>> lock, and then, if we fail, send the IPI. IOW, we might want a helper
>> that takes a remote big percpu lock or calls a remote function that
>> guards against this race.
>> Thoughts? Am I nuts?
> The Tilera code has support for avoiding TLB flushes to kernel VAs
> while running in userspace on nohz_full cores, but I didn't try to
> upstream it yet because it is generally less critical than the other
> stuff.
> The model I chose is to have a per-cpu state that indicates whether
> the core is in kernel space, in user space, or in user space with
> a TLB flush pending. On entry to user space with task isolation
> in effect we just set the state to "user". When doing a remote
> TLB flush we decide whether or not to actually issue the flush by
> doing a cmpxchg() from "user" to "user pending", and if the
> old state was either "user" or "user pending", we don't issue the
> flush. Finally, on entry to the kernel for a task-isolation task we
> do an atomic xchg() to set the state to "kernel", and if we discover
> a flush was pending, we just globally flush the kernel's full VA range
> (no real reason to optimize for this case).

This sounds like it belongs in the generic context tracking code, or
at least the tracking part and the option to handle deferred work
should go there.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at