Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality

From: Andy Lutomirski
Date: Tue Sep 29 2015 - 13:58:18 EST


On Tue, Sep 29, 2015 at 10:42 AM, Chris Metcalf <cmetcalf@xxxxxxxxxx> wrote:
> On 09/28/2015 06:43 PM, Andy Lutomirski wrote:
>>
>> Why are we treating alarms as something that should defer entry to
>> userspace? I think it would be entirely reasonable to set an alarm
>> for ten minutes, ask for isolation, and then think hard for ten
>> minutes.
>>
>> A bigger issue would be if there's an RT task that asks for isolation
>> and a bunch of other stuff (most notably KVM hosts) running with
>> uncontrained affinity at full load. If task_isolation_enter always
>> sleeps, then your KVM host will get scheduled, and it'll ask for a
>> user return notifier on the way out, and you might just loop forever.
>> Can this happen?
>
>
> task_isolation_enter() doesn't sleep - it spins. This is intentional,
> because the point is that there should be nothing else that
> could be scheduled on that cpu. We're just waiting for any
> pending kernel management timer interrupts to fire.
>
> In any case, you normally wouldn't have a KVM host running
> on an isolcpus, nohz_full cpu, unless it was the only thing
> running there, I imagine (just as would be true for any other
> host process).

The problem is that AFAICT systemd (and possibly other things) makes
is rather painful to guarantee that nothing low-priority (systemd
itself) would schedule on an arbitrary CPU. I would hope that merely
setting affinity and RT would be enough to get isolation without
playing fancy cgroup games. Maybe not.

>
>> ISTM something's suboptimal with the inner workings of all this if
>> task_isolation_enter needs to sleep to wait for an event that isn't
>> scheduled for the immediate future (e.g. already queued up as an
>> interrupt).
>
>
> Scheduling a timer for 10 minutes away is typically done by
> scheduling timers for the max timer granularity (which could
> be just a few seconds) and then waking up a couple of hundred
> times between now and now+10 minutes. Doing this breaks
> the task isolation guarantee, so we can't return to userspace
> while something like that is pending. You'd have to do it
> by polling in userspace to avoid the unexpected interrupts.
>

Really? That sucks. Hopefully we can fix it.

> I suppose if your hardware supported it, you could imagine
> a mode where userspace can request an alarm a specific
> amount of time in the future, and the task isolation code
> would then ignore an alarm that was going off at that
> specific time. But I'm not sure what hardware does support
> that (I know tile uses the "few seconds and re-arm" model),
> and it seems like a pretty corner use-case. We could
> certainly investigate adding such support later, but I don't see
> it as part of the core value proposition for task isolation.
>

Intel chips Sandy Bridge and newer certainly support this. Other chips
might support it as well. Whether the kernel is able to program the
TSC deadline timer like that is a different question.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/