Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality

From: Andy Lutomirski
Date: Tue Sep 29 2015 - 13:58:18 EST

On Tue, Sep 29, 2015 at 10:42 AM, Chris Metcalf <cmetcalf@xxxxxxxxxx> wrote:
> On 09/28/2015 06:43 PM, Andy Lutomirski wrote:
>> Why are we treating alarms as something that should defer entry to
>> userspace? I think it would be entirely reasonable to set an alarm
>> for ten minutes, ask for isolation, and then think hard for ten
>> minutes.
>> A bigger issue would be if there's an RT task that asks for isolation
>> and a bunch of other stuff (most notably KVM hosts) running with
>> uncontrained affinity at full load. If task_isolation_enter always
>> sleeps, then your KVM host will get scheduled, and it'll ask for a
>> user return notifier on the way out, and you might just loop forever.
>> Can this happen?
> task_isolation_enter() doesn't sleep - it spins. This is intentional,
> because the point is that there should be nothing else that
> could be scheduled on that cpu. We're just waiting for any
> pending kernel management timer interrupts to fire.
> In any case, you normally wouldn't have a KVM host running
> on an isolcpus, nohz_full cpu, unless it was the only thing
> running there, I imagine (just as would be true for any other
> host process).

The problem is that AFAICT systemd (and possibly other things) makes
is rather painful to guarantee that nothing low-priority (systemd
itself) would schedule on an arbitrary CPU. I would hope that merely
setting affinity and RT would be enough to get isolation without
playing fancy cgroup games. Maybe not.

>> ISTM something's suboptimal with the inner workings of all this if
>> task_isolation_enter needs to sleep to wait for an event that isn't
>> scheduled for the immediate future (e.g. already queued up as an
>> interrupt).
> Scheduling a timer for 10 minutes away is typically done by
> scheduling timers for the max timer granularity (which could
> be just a few seconds) and then waking up a couple of hundred
> times between now and now+10 minutes. Doing this breaks
> the task isolation guarantee, so we can't return to userspace
> while something like that is pending. You'd have to do it
> by polling in userspace to avoid the unexpected interrupts.

Really? That sucks. Hopefully we can fix it.

> I suppose if your hardware supported it, you could imagine
> a mode where userspace can request an alarm a specific
> amount of time in the future, and the task isolation code
> would then ignore an alarm that was going off at that
> specific time. But I'm not sure what hardware does support
> that (I know tile uses the "few seconds and re-arm" model),
> and it seems like a pretty corner use-case. We could
> certainly investigate adding such support later, but I don't see
> it as part of the core value proposition for task isolation.

Intel chips Sandy Bridge and newer certainly support this. Other chips
might support it as well. Whether the kernel is able to program the
TSC deadline timer like that is a different question.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at