Re: crazy idea: big percpu lock (Re: task isolation)

From: Christoph Lameter
Date: Thu Oct 08 2015 - 18:01:10 EST


On Thu, 8 Oct 2015, Andy Lutomirski wrote:

> It seems to me that a big part of the problem is that there's all
> kinds of per-cpu deferred housekeeping work that can be done on the
> CPU in question without any complicated or heavyweight locking but
> that can't be done remotely without a mess. This presumably includes
> vmstat, draining the LRU list, etc. This is a problem for allowing
> CPUs to spend a long time without any interrupts.

Well its not a problem if the task does a prctl to ask for the kernel to
quiet down. In that case we can simply flush all the pending stuff on the
cpu that owns the percpu section.

> I want to propose a new primitive that might go a long way toward
> solving this issue. The new primitive would be called the "big percpu
> lock". Non-nohz CPUs would hold their big percpu lock all the time.
> Nohz CPUs would hold it all the time unless idle. Full nohz cpus
> would hold it all the time except when idle or in user mode. No CPU
> promises to hold it while processing an NMI or similar NMI-like work.

Not sure that there is an issue to solve. So this is a lock per cpu that
signals that the processor can handle its per cpu data alone. If its not
held then other cpus can access the percpu data remotely?

> This should help in a ton of cases.
>
> For vunmap global kernel TLB flushes, we could stick the flushes in a
> list of deferred flushes to be processed on entry, and that list would
> be protected by the big percpu lock. For any kind of draining of
> non-NMI-safe percpu data (LRU, vmstat, whatever), we could have a
> housekeeping cpu try to do it using the big percpu lock

Ok what is the problem with using the cpu that owns the percpu data to
flush it? Or simply ignore the situation until the cpu is entering the
kernel again? Caches can be useful later again when the process wants to
allocate memory etc. We would have to repopulate them if we flush them.

> There's a race here that affects task isolation. On exit to user
> mode, there's no obvious way to tell that an IPI is already pending.
> We could add that, too: whenever we send an IPI to a nohz_full CPU, we
> increment a percpu pending IPI count, then try to get the big percpu
> lock, and then, if we fail, send the IPI. IOW, we might want a helper
> that takes a remote big percpu lock or calls a remote function that
> guards against this race.
>
> Thoughts? Am I nuts?

Generally having a lock that signals that other can access the per cpu
data may make sense. However, what is the overhead of handling that lock?

One definitely does not want to handle that in latency critical sections.

And one cannot handle the lock in interrupt disabled sections like IPIs.
But if one can remotely acquire that lock then no IPI is needed anymore if
the only thing we want to do is manipulate per cpu data.

There is a complication that many of these flushing functions are written
using this_cpu operations that can only be run on the cpu owning the per
cpu section because the per cpu base is different on other processors. If
you want to change that then more expensive instructions have to be used.
So you end up with two different versions of the function.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/