On Sep 9, 2016 1:40 PM, "Chris Metcalf" <cmetcalf@xxxxxxxxxxxx> wrote:
On 9/2/2016 1:28 PM, Andy Lutomirski wrote:I think it can't be a TIF flag unless you can do the whole mess with
On Sep 2, 2016 7:04 AM, "Chris Metcalf" <cmetcalf@xxxxxxxxxxxx> wrote:
On 8/30/2016 3:50 PM, Andy Lutomirski wrote:Only kind of.
On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf <cmetcalf@xxxxxxxxxxxx> wrote:
So to pop up a level, what is your actual concern about the existingMy concern is that it's not obvious to readers of the code that the
"do it in a loop" model? The macrology currently in use means there
is zero cost if you don't configure TASK_ISOLATION, and the software
maintenance cost seems low since the idioms used for task isolation
in the loop are generally familiar to people reading that code.
loop ever terminates. It really ought to, but it's doing something
very odd. Normally we can loop because we get scheduled out, but
actually blocking in the return-to-userspace path, especially blocking
on a condition that doesn't have a wakeup associated with it, is odd.
True, although, comments :-)
Regardless, though, this doesn't seem at all weird to me in the
context of the vmstat and lru stuff, though. It's exactly parallel to
the fact that we loop around on checking need_resched and signal, and
in some cases you could imagine multiple loops around when we schedule
out and get a signal, so loop around again, and then another
reschedule event happens during signal processing so we go around
again, etc. Eventually it settles down. It's the same with the
vmstat/lru stuff.
When we say, effectively, while (need_resched()) schedule();, we're
not waiting for an event or condition per se. We're runnable (in the
sense that userspace wants to run and we're not blocked on anything)
the entire time -- we're simply yielding to some other thread that is
also runnable. So if that loop runs forever, it either means that
we're at low priority and we genuinely shouldn't be running or that
there's a scheduler bug.
If, on the other hand, we say while (not quiesced) schedule(); (or
equivalent), we're saying that we're *not* really ready to run and
that we're waiting for some condition to change. The condition in
question is fairly complicated and won't wake us when we are ready. I
can also imagine the scheduler getting rather confused, since, as far
as the scheduler knows, we are runnable and we are supposed to be
running.
So, how about a code structure like this?
In the main return-to-userspace loop where we check TIF flags,
we keep the notion of our TIF_TASK_ISOLATION flag that causes
us to invoke a task_isolation_prepare() routine. This routine
does the following things:
1. As you suggested, set a new TIF bit (or equivalent) that says the
system should no longer create deferred work on this core, and then
flush any necessary already-deferred work (currently, the LRU cache
and the vmstat stuff). We never have to go flush the deferred work
again during this task's return to userspace. Note that this bit can
only be set on a core marked for task isolation, so it can't be used
for denial of service type attacks on normal cores that are trying to
multitask normal Linux processes.
preemption off because, if you get preempted, other tasks on the CPU
won't see the flag. You could do it with a percpu flag, I think.
2. Check if the dyntick is stopped, and if not, wait on a completionWould it perhaps be more straightforward to do the stuff before the
that will be set when it does stop. This means we may schedule out at
this point, but when we return, the deferred work stuff is still safe
since your bit is still set, and in principle the dyn tick is
stopped.
Then, after we disable interrupts and re-read the thread-info flags,
we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
set that would keep us in the loop. This will always end up happening
on each return to userspace, since the only thing that actually clears
the bit is a prctl() call. When that happens we know we are about to
return to userspace, so we call task_isolation_ready(), which now has
two things to do:
loop and not check TIF_TASK_ISOLATION in the loop?
1. We check that the dyntick is in fact stopped, since it's possibleSeems reasonable. But maybe this could go after the loop and, if the
that a race condition led to it being somehow restarted by an interrupt.
If it is not stopped, we go around the loop again so we can go back in
to the completion discussed above and wait some more. This may merit
a WARN_ON or other notice since it seems like people aren't convinced
there are things that could restart it, but honestly the dyntick stuff
is complex enough that I think a belt-and-suspenders kind of test here
at the last minute is just defensive programming.
dyntick is back, it could be treated like any other kernel bug that
interrupts an isolated task? That would preserve more of the existing
code structure.
If that works, it could go in user_enter().