Re: [PATCH v15 04/13] task_isolation: add initial support

From: Chris Metcalf
Date: Mon Sep 12 2016 - 16:59:46 EST

Next message: Tim Chen: "Re: [PATCH v3 4/8] sched, x86: use arch_update_cpu_topology to indicate x86 need sched domain rebuild"
Previous message: Paolo Bonzini: "[GIT PULL] KVM fixes for 4.8-rc7"
In reply to: Andy Lutomirski: "Re: [PATCH v15 04/13] task_isolation: add initial support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 9/12/2016 1:41 PM, Andy Lutomirski wrote:

On Sep 9, 2016 1:40 PM, "Chris Metcalf" <cmetcalf@xxxxxxxxxxxx> wrote:

On 9/2/2016 1:28 PM, Andy Lutomirski wrote:

On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@xxxxxxxxxxxx> wrote:

On 8/30/2016 3:50 PM, Andy Lutomirski wrote:

On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@xxxxxxxxxxxx> wrote:

So to pop up a level, what is your actual concern about the existing
"do it in a loop" model? The macrology currently in use means there
is zero cost if you don't configure TASK_ISOLATION, and the software
maintenance cost seems low since the idioms used for task isolation
in the loop are generally familiar to people reading that code.

My concern is that it's not obvious to readers of the code that the
loop ever terminates. It really ought to, but it's doing something
very odd. Normally we can loop because we get scheduled out, but
actually blocking in the return-to-userspace path, especially blocking
on a condition that doesn't have a wakeup associated with it, is odd.

True, although, comments :-)

Regardless, though, this doesn't seem at all weird to me in the
context of the vmstat and lru stuff, though. It's exactly parallel to
the fact that we loop around on checking need_resched and signal, and
in some cases you could imagine multiple loops around when we schedule
out and get a signal, so loop around again, and then another
reschedule event happens during signal processing so we go around
again, etc. Eventually it settles down. It's the same with the
vmstat/lru stuff.

Only kind of.

When we say, effectively, while (need_resched()) schedule();, we're
not waiting for an event or condition per se. We're runnable (in the
sense that userspace wants to run and we're not blocked on anything)
the entire time -- we're simply yielding to some other thread that is
also runnable. So if that loop runs forever, it either means that
we're at low priority and we genuinely shouldn't be running or that
there's a scheduler bug.

If, on the other hand, we say while (not quiesced) schedule(); (or
equivalent), we're saying that we're *not* really ready to run and
that we're waiting for some condition to change. The condition in
question is fairly complicated and won't wake us when we are ready. I
can also imagine the scheduler getting rather confused, since, as far
as the scheduler knows, we are runnable and we are supposed to be
running.

So, how about a code structure like this?

In the main return-to-userspace loop where we check TIF flags,
we keep the notion of our TIF_TASK_ISOLATION flag that causes
us to invoke a task_isolation_prepare() routine. This routine
does the following things:

1. As you suggested, set a new TIF bit (or equivalent) that says the
system should no longer create deferred work on this core, and then
flush any necessary already-deferred work (currently, the LRU cache
and the vmstat stuff). We never have to go flush the deferred work
again during this task's return to userspace. Note that this bit can
only be set on a core marked for task isolation, so it can't be used
for denial of service type attacks on normal cores that are trying to
multitask normal Linux processes.

I think it can't be a TIF flag unless you can do the whole mess with
preemption off because, if you get preempted, other tasks on the CPU
won't see the flag. You could do it with a percpu flag, I think.

Yes, a percpu flag - you're right. I think it will make sense for this to
be a flag declared in linux/isolation.h which can be read by vmstat, LRU, etc.

2. Check if the dyntick is stopped, and if not, wait on a completion
that will be set when it does stop. This means we may schedule out at
this point, but when we return, the deferred work stuff is still safe
since your bit is still set, and in principle the dyn tick is
stopped.

Then, after we disable interrupts and re-read the thread-info flags,
we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
set that would keep us in the loop. This will always end up happening
on each return to userspace, since the only thing that actually clears
the bit is a prctl() call. When that happens we know we are about to
return to userspace, so we call task_isolation_ready(), which now has
two things to do:

Would it perhaps be more straightforward to do the stuff before the
loop and not check TIF_TASK_ISOLATION in the loop?

We can certainly play around with just not looping in this case, but
in particular I can imagine an isolated task entering the kernel and
then doing something that requires scheduling a kernel task. We'd
clearly like that other task to run before the isolated task returns to
userspace. But then, that other task might do something to re-enable
the dyntick. That's why we'd like to recheck that dyntick is off in
the loop after each potential call to schedule().

1. We check that the dyntick is in fact stopped, since it's possible
that a race condition led to it being somehow restarted by an interrupt.
If it is not stopped, we go around the loop again so we can go back in
to the completion discussed above and wait some more. This may merit
a WARN_ON or other notice since it seems like people aren't convinced
there are things that could restart it, but honestly the dyntick stuff
is complex enough that I think a belt-and-suspenders kind of test here
at the last minute is just defensive programming.

Seems reasonable. But maybe this could go after the loop and, if the
dyntick is back, it could be treated like any other kernel bug that
interrupts an isolated task? That would preserve more of the existing
code structure.

Well, we can certainly try it that way. If I move it out and my testing
doesn't trigger the bug, that's at least an initial sign that it might be
OK. But I worry/suspect that it will trip up at some point in some use
case and we'll have to fix it at that point.

If that works, it could go in user_enter().

Presumably with trace_user_enter() and vtime_user_enter()
in __context_tracking_enter()?

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

Next message: Tim Chen: "Re: [PATCH v3 4/8] sched, x86: use arch_update_cpu_topology to indicate x86 need sched domain rebuild"
Previous message: Paolo Bonzini: "[GIT PULL] KVM fixes for 4.8-rc7"
In reply to: Andy Lutomirski: "Re: [PATCH v15 04/13] task_isolation: add initial support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]