On Fri, Jun 03, 2016 at 03:32:04PM -0400, Chris Metcalf wrote:
On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:
On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:Good, although that quiescing on kernel return must be an option.
On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:So what happens if an interrupt does occur?
TL;DR: Let's make an explicit decision about whether task isolationBut then in this mode, what happens when an interrupt triggers.
should be "persistent" or "one-shot". Both have some advantages.
=====
An important high-level issue is how "sticky" task isolation mode is.
We need to choose one of these two options:
"Persistent mode": A task switches state to "task isolation" mode
(kind of a level-triggered analogy) and stays there indefinitely. It
can make a syscall, take a page fault, etc., if it wants to, but the
kernel protects it from incurring any further asynchronous interrupts.
This is the model I've been advocating for.
In the "base" task isolation mode, you just take the interrupt, then
wait to quiesce any further kernel timer ticks, etc., and return to
the process. This at least limits the damage to being a single
interruption rather than potentially additional ones, if the interrupt
also caused timers to get queued, etc.
Can you spell out why you think turning it off is helpful? I'll admit
this is the default mode in the commercial version of task isolation
that we ship, and was also the default in the first LKML patch series.
But on consideration I haven't found scenarios where skipping the
quiescing is helpful. Admittedly you get out of the kernel faster,
but then you're back in userspace and vulnerable to yet more
unexpected interrupts until the timer quiesces. If you're asking for
task isolation, this is surely not what you want.
I just feel that quiescing, on the way back to user after an unwanted
interruption, is awkward. The quiescing should work once and for all
on return back from the prctl. If we still get disturbed afterward,
either the quiescing is buggy or incomplete, or something is on the
way that can not be quiesced.
I'm not actually sure whatThey are not all deterministic. For example a breakpoint, a step, a trap
you're recommending we do to avoid exceptions. Since they're
synchronous and deterministic, we can't really avoid them if the
program wants to issue them. For example, mmap() some anonymous
memory and then start running, and you'll take exceptions each time
you touch a page in that mapped region. I'd argue it's an application
bug; one should enable "strict" mode to catch and deal with such bugs.
can be set up by another process. So this is not entirely under the control
of the user.
That's true, but I'd argue the behavior in that case should be that you can
raise that kind of exception validly (so you can debug), and then you should
quiesce on return to userspace so the application doesn't see additional
exceptions.
I don't see how we can quiesce such things.
There are two ways you could handle debugging:
1. Require the program to set the flag that says it doesn't want a signal
when it is interrupted (so you can interrupt it to debug it, and not kill it);
That's rather about exceptions, right?
Here's what I am inclined towards:
- Default mode (hard isolation / "strict") - leave userspace, get a signal, no exceptions.
Ok.
- "No signal" mode - leave userspace synchronously (syscall/exception), get quiesced on
return, no signals. But asynchronous interrupts still cause a signal since they are
not expected to occur.
So only interrupt cause a signal in this mode? Exceptions and syscalls are permitted, right?
- Soft mode (I don't think we want this) - like "no signal" except you don't even quiesce
on return to userspace, and asynchronous interrupts don't even cause a signal.
It's basically "best effort", just nohz_full plus the code that tries to get things
like LRU or vmstat to run before returning to userspace. I think there isn't enough
"value add" to make this a separate mode, though.
I can imagine HPC to be willing this mode.
You're right that migration conflicts with task isolation. ButYes.
certainly, if a task has enabled "strict" semantics, it can't migrate;
it will lose task isolation entirely and get a signal instead,
regardless of whether it calls sched_setaffinity() on itself, or if
someone else changes its affinity and it gets a kick.
However, if a task doesn't have strict mode enabled, it can callThat doesn't look sane. The user asks the kernel to get away as much
sched_setaffinity() and force itself onto a non-task_isolation cpu and
it won't get any isolation until it schedules itself back onto a
task_isolation cpu, at which point it wakes up on the new cpu with
hard isolation still in effect. I can make up reasons why this sort
of thing might be useful, but it's probably a corner case.
as it can but if we are in a non-nohz-full CPU we know we can't provide that
service (or rather that non-service).
So we would refuse to enter in task isolation mode if it doesn't run in a
full dynticks CPUs whereas we accept that it migrates later to a periodic
CPU?. This isn't consistent.
Yes, and originally I made that consistent by not checking when it started
up, either, but I was subsequently convinced that the checks were good for
sanity.
Sure sanity checks are good but if you refuse the prctl with returning an error
on the basis of this sanity condition, the task shouldn't be able to later reach
that insanity state without being properly kicked out of the feature provided by
the prctl().
Otherwise perhaps just drop a warning.
Googling "Zero-Overhead Linux" does take you to some discussionsSo those workloads couldn't stand an interrupt? Like they would like a signal
of customers that have used this functionality.
and exit the strict mode if it happens?
Correct, they couldn't tolerate interrupts. If one happened, it would cause packets to
be dropped and some kind of logging would fire to report the problem.
Ok. And is it this mode you're interested in? Isn't quiescing an issue in this mode?
So maybe something like this:
PR_TASK_ISOLATION_ENABLE - turn on basic strict/signaling mode
PR_TASK_ISOLATION_ALLOW_SYSCALLS - for syscalls, no signal, just quiesce before return
PR_TASK_ISOLATION_ALLOW_EXCEPTIONS - for all exceptions, no signal, quiesce before return
It might make sense to say you would allow page faults, for example, but not general
exceptions. But my guess is that the exception-related stuff really does need an
application use case to account for it. I would say for the initial support of task
isolation, we have a clearly-understood model for allowing syscalls (e.g. stuff
like generating diagnostics on error or slow paths), but not really a model for
understanding why users would want to take exceptions, so I'd say let's omit
that initially, and maybe just add the _ALLOW_SYSCALLS flag.
Ok. That interface looks better. At least we can start with just PR_TASK_ISOLATION_ENABLE which
does strict pure isolation mode and have future flags for more granularity.