Re: crazy idea: big percpu lock (Re: task isolation)

From: Chris Metcalf
Date: Wed Oct 28 2015 - 14:42:34 EST

On 10/08/2015 05:25 PM, Andy Lutomirski wrote:
This whole isolation vs vmstat, etc thing made me think:

It seems to me that a big part of the problem is that there's all
kinds of per-cpu deferred housekeeping work that can be done on the
CPU in question without any complicated or heavyweight locking but
that can't be done remotely without a mess. This presumably includes
vmstat, draining the LRU list, etc. This is a problem for allowing
CPUs to spend a long time without any interrupts.

I want to propose a new primitive that might go a long way toward
solving this issue. The new primitive would be called the "big percpu
lock". Non-nohz CPUs would hold their big percpu lock all the time.
Nohz CPUs would hold it all the time unless idle. Full nohz cpus
would hold it all the time except when idle or in user mode. No CPU
promises to hold it while processing an NMI or similar NMI-like work.

This should help in a ton of cases.

For vunmap global kernel TLB flushes, we could stick the flushes in a
list of deferred flushes to be processed on entry, and that list would
be protected by the big percpu lock. For any kind of draining of
non-NMI-safe percpu data (LRU, vmstat, whatever), we could have a
housekeeping cpu try to do it using the big percpu lock

There's a race here that affects task isolation. On exit to user
mode, there's no obvious way to tell that an IPI is already pending.
We could add that, too: whenever we send an IPI to a nohz_full CPU, we
increment a percpu pending IPI count, then try to get the big percpu
lock, and then, if we fail, send the IPI. IOW, we might want a helper
that takes a remote big percpu lock or calls a remote function that
guards against this race.

Thoughts? Am I nuts?

The Tilera code has support for avoiding TLB flushes to kernel VAs
while running in userspace on nohz_full cores, but I didn't try to
upstream it yet because it is generally less critical than the other

The model I chose is to have a per-cpu state that indicates whether
the core is in kernel space, in user space, or in user space with
a TLB flush pending. On entry to user space with task isolation
in effect we just set the state to "user". When doing a remote
TLB flush we decide whether or not to actually issue the flush by
doing a cmpxchg() from "user" to "user pending", and if the
old state was either "user" or "user pending", we don't issue the
flush. Finally, on entry to the kernel for a task-isolation task we
do an atomic xchg() to set the state to "kernel", and if we discover
a flush was pending, we just globally flush the kernel's full VA range
(no real reason to optimize for this case).

This is basically equivalent to your lock model, where you would
remotely trylock, and if you succeed, set a bit for the core indicating
it needs a kernel TLB flush, and if you fail, just doing the remote
flush yourself. And, on kernel entry for task isolation, you lock
(possibly waiting while someone updates the kernel TLB flush state)
and then if the kernel TLB flush bit is on, do the flush before
completing the entry to the kernel.

But, it turns out you also need to keep track of whether TLB flushes
are currently pending for a given core, since you could start a
remote TLB flush to a task-isolation core just as it was getting
ready to return to userspace. Since the caller issues these flushes
synchronously, we would just bump a counter atomically for the
remote core before issuing, and decrement it when it was done.
Then when returning to userspace, we first flip the bit saying that
we are now in the "user" state, and then we actually spin and wait
for the counter to hit zero as well, in case a TLB flush was in progress.

For the tilegx architecture we had support for modifying how pages
were statically cache homed, and this caused a lot of these TLB
flushes since we needed to adjust the kernel linear mapping as
well as the userspace mappings. It's a lot less common otherwise,
just vunmap and the like, but still a good thing for a follow-up patch.

Chris Metcalf, EZChip Semiconductor

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at