task isolation discussion at Linux Plumbers

From: Chris Metcalf
Date: Sat Nov 05 2016 - 00:19:55 EST


A bunch of people got together this week at the Linux Plumbers
Conference to discuss nohz_full, task isolation, and related stuff.
(Thanks to Thomas for getting everyone gathered at one place and time!)

Here are the notes I took; I welcome any corrections and follow-up.


== rcu_nocbs ==

We started out by discussing this option. It is automatically enabled
by nohz_full, but we spent a little while side-tracking on the
implementation of one kthread per rcu flavor per core. The suggestion
was made (by Peter or Andy; I forget) that each kthread could handle
all flavors per core by using a dedicated worklist. It certainly
seems like removing potentially dozens or hundreds of kthreads from
larger systems will be a win if this works out.

Paul said he would look into this possibility.


== Remote statistics ==

We discussed the possibility of remote statistics gathering, i.e. load
average etc. The idea would be that we could have housekeeping
core(s) periodically iterate over the nohz cores to load their rq
remotely and do update_current etc. Presumably it should be possible
for a single housekeeping core to handle doing this for all the
nohz_full cores, as we only need to do it quite infrequently.

Thomas suggested that this might be the last remaining thing that
needed to be done to allow disabling the current behavior of falling
back to a 1 Hz clock in nohz_full.

I believe Thomas said he had a patch to do this already.


== Remote LRU cache drain ==

One of the issues with task isolation currently is that the LRU cache
drain must be done prior to entering userspace, but it requires
interrupts enabled and thus can't be done atomically. My previous
patch series have handled this by checking with interrupts disabled,
but then looping around with interrupts enabled to try to drain the
LRU pagevecs. Experimentally this works, but it's not provable that
it terminates, which is worrisome. Andy suggested adding a percpu
flag to disable creation of deferred work like LRU cache pages.

Thomas suggested using an RT "local lock" to guard the LRU cache
flush; he is planning on bringing the concept to mainline in any case.
However, after some discussion we converged on simply using a spinlock
to guard the appropriate resources. As a result, the
lru_add_drain_all() code that currently queues work on each remote cpu
to drain it, can instead simply acquire the lock and drain it remotely.
This means that a task isolation task no longer needs to worry about
being interrupted by SMP function call IPIs, so we don't have to deal
with this in the task isolation framework any more.

I don't recall anyone else volunteering to tackle this, so I will plan
to look at it. The patch to do that should be orthogonal to the
revised task isolation patch series.


== Quiescing vmstat ==

Another issue that task isolation handles is ensuring that the vmstat
worker is quiesced before returning to user space. Currently we
cancel the vmstat delayed work, then invoke refresh_cpu_vm_stats().
Currently neither of these things appears safe to do in the
interrupts-disabled context just before return to userspace, because
they both can call schedule(): refresh_cpu_vm_stats() via a
cond_resched() under CONFIG_NUMA, and cancel_delayed_work_sync() via a
schedule() in __cancel_work_timer().

Christoph offered to work with me to make sure that we could do the
appropriate quiescing with interrupts disabled, and seemed confident
it should be doable.


== Remote kernel TLB flush ==

Andy then brought up the issue of remote kernel TLB flush, which I've
been trying to sweep under the rug for the initial task isolation
series. Remote TLB flush causes an interrupt on many systems (x86 and
tile, for example, although not arm64), so to the extent that it
occurs frequently, it becomes important to handle for task isolation.
With the recent addition of vmap kernel stacks, this becomes suddenly
much more important than it used to be, to the point where we now
really have to handle it for task isolation.

The basic insight here is that you can safely skip interrupting
userspace cores when you are sending remote kernel TLB flushes, since
by definition they can't touch the kernel pages in question anyway.
Then you just need to guarantee to flush the kernel TLB space next
time the userspace task re-enters the kernel.

The original Tilera dataplane code handled this by tracking task state
(kernel, user, or user-flushed) and manipulating the state atomically
at TLB flush time and kernel entry time. After some discussion of the
overheads of such atomics, Andy pointed out that there is already an
atomic increment being done in the RCU code, and we should be able to
leverage that word to achieve this effect. The idea is that remote
cores would do a compare-exchange of 0 to 1, which if it succeeded
would indicate that the remote core was in userspace and thus didn't
need to be IPI'd, but that it was now tagged for a kernel flush next
time the remote task entered the kernel. Then, when the remote task
enters the kernel, it does an atomic update of its own dynticks and
discovers the low bit set, it does a kernel TLB flush before
continuing.

It was agreed that this makes sense to do unconditionally, since it's
not just helpful for nohz_full and task isolation, but also for idle,
since interrupting an idle core periodically just to do repeated
kernel tlb flushes isn't good for power consumption.

One open question is whether we discover the low bit set early enough
in kernel entry that we can trust that we haven't tried to touch any
pages that have been invalidated in the TLB.

Paul agreed to take a look at implementing this.


== Optimizing vfree via RCU ==

An orthogonal issue was also brought up, which is whether we could use
RCU to handle the kernel TLB flush from freeing vmaps; presumably if
we have enough vmap space, we can arrange to return the freed VA space
via RCU, and simply defer the TLB flush until the next grace period.

I'm not sure if this is practical if we encounter a high volume of
vfrees, but I don't think we really reached a definitive agreement on
it during the discussion either.


== Disabling the dyn tick ==

One issue that the current task isolation patch series encounters is
when we request disabling the dyntick, but it doesn't happen. At the
moment we just wait until the the tick is properly disabled, by
busy-waiting in the kernel (calling schedule etc as needed). No one
is particularly fond of this scheme. The consensus seems to be to try
harder to figure out what is going on, fix whatever problems exist,
then consider it a regression going forward if something causes the
dyntick to become difficult to disable again in the future. I will
take a look at this and try to gather more data on if and when this is
happening in 4.9.


== Missing oneshot_stopped callbacks ==

I raised the issue that various clock_event_device sources don't
always support oneshot_stopped, which can cause an additional
final interrupt to occur after the timer infrastructure believes the
interrupt has been stopped. I have patches to fix this for tile and
arm64 in my patch series; Thomas volunteered to look at adding
equivalent support for x86.


Many thanks to all those who participated in the discussion.
Frederic, we wished you had been there!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com