Re: [PATCH v7 07/11] arch/x86: enable task isolation functionality

From: Andy Lutomirski
Date: Mon Sep 28 2015 - 18:44:20 EST


On Mon, Sep 28, 2015 at 2:57 PM, Chris Metcalf <cmetcalf@xxxxxxxxxx> wrote:
> On 09/28/2015 04:59 PM, Andy Lutomirski wrote:
>>
>> On Mon, Sep 28, 2015 at 11:17 AM, Chris Metcalf <cmetcalf@xxxxxxxxxx>
>> wrote:
>>>
>>> In prepare_exit_to_usermode(), we would like to call
>>> task_isolation_enter() on every return to userspace, and like
>>> other work items, we would like to recheck for more work after
>>> calling it, since it will enable interrupts internally.
>>>
>>> However, if task_isolation_enter() is the only work item,
>>> and it has already been called once, we don't want to continue
>>> calling it in a loop. We don't have a dedicated TIF flag for
>>> task isolation, and it wouldn't make sense to have one, since
>>> we'd want to set it before starting exit every time, and then
>>> clear it the first time around the loop.
>>>
>>> Instead, we change the loop structure somewhat, so that we
>>> have a more inclusive set of flags that are tested for on the
>>> first entry to the function (including TIF_NOHZ), and if any
>>> of those flags are set, we enter the loop. And, we do the
>>> task_isolation() test unconditionally at the bottom of the loop,
>>> but then when making the decision to loop back, we just use the
>>> set of flags that doesn't include TIF_NOHZ. That way we only
>>> loop if there is other work to do, but then if that work
>>> is done, we again unconditionally call task_isolation_enter().
>>>
>>> In syscall_trace_enter_phase1(), we try to add the necessary
>>> support for strict-mode detection of syscalls in an optimized
>>> way, by letting the code remain unchanged if we are not using
>>> TASK_ISOLATION, but otherwise calling enter_from_user_mode()
>>> under the first time we see _TIF_NOHZ, and then waiting until
>>> after we do the secure computing work to actually clear the bit
>>> from the "work" variable and call task_isolation_syscall().
>>>
>>> Signed-off-by: Chris Metcalf <cmetcalf@xxxxxxxxxx>
>>> ---
>>> arch/x86/entry/common.c | 47
>>> ++++++++++++++++++++++++++++++++++++-----------
>>> 1 file changed, 36 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>>> index 80dcc9261ca3..0f74389c6f3b 100644
>>> --- a/arch/x86/entry/common.c
>>> +++ b/arch/x86/entry/common.c
>>> @@ -21,6 +21,7 @@
>>> #include <linux/context_tracking.h>
>>> #include <linux/user-return-notifier.h>
>>> #include <linux/uprobes.h>
>>> +#include <linux/isolation.h>
>>>
>>> #include <asm/desc.h>
>>> #include <asm/traps.h>
>>> @@ -81,7 +82,8 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs
>>> *regs, u32 arch)
>>> */
>>> if (work & _TIF_NOHZ) {
>>> enter_from_user_mode();
>>> - work &= ~_TIF_NOHZ;
>>> + if (!IS_ENABLED(CONFIG_TASK_ISOLATION))
>>> + work &= ~_TIF_NOHZ;
>>> }
>>> #endif
>>>
>>> @@ -131,6 +133,13 @@ unsigned long syscall_trace_enter_phase1(struct
>>> pt_regs *regs, u32 arch)
>>> }
>>> #endif
>>>
>>> + /* Now check task isolation, if needed. */
>>> + if (IS_ENABLED(CONFIG_TASK_ISOLATION) && (work & _TIF_NOHZ)) {
>>> + work &= ~_TIF_NOHZ;
>>> + if (task_isolation_strict())
>>> + task_isolation_syscall(regs->orig_ax);
>>> + }
>>> +
>>
>> This is IMO rather nasty. Can you try to find a way to do this
>> without making the control flow depend on config options?
>
>
> Well, I suppose this is the best argument for testing for task
> isolation before seccomp :-)
>
> Honestly, if not, it's tricky to see how to do better; I did spend
> some time looking at it. One possibility is to just unconditionally
> clear _TIF_NOHZ before testing "work == 0", so that we can
> test (work & TIF_NOHZ) once early and once after seccomp.
> This presumably costs a cycle in the no-nohz-full case.
>
> So maybe just do it before seccomp...
>
>> What guarantees that TIF_NOHZ is an acceptable thing to check?
>
>
> Well, TIF_NOHZ is set on all tasks whenever we are running with
> nohz_full enabled anywhere, so testing it lets us do stuff on
> the fastpath without slowing down the fastpath much.
> See context_tracking_cpu_set().
>
>
>>> /* Do our best to finish without phase 2. */
>>> if (work == 0)
>>> return ret; /* seccomp and/or nohz only (ret == 0 here)
>>> */
>>> @@ -217,10 +226,26 @@ static struct thread_info
>>> *pt_regs_to_thread_info(struct pt_regs *regs)
>>> /* Called with IRQs disabled. */
>>> __visible void prepare_exit_to_usermode(struct pt_regs *regs)
>>> {
>>> + u32 cached_flags;
>>> +
>>> if (WARN_ON(!irqs_disabled()))
>>> local_irq_disable();
>>>
>>> /*
>>> + * We may want to enter the loop here unconditionally to make
>>> + * sure to do some work at least once. Test here for all
>>> + * possible conditions that might make us enter the loop,
>>> + * and return immediately if none of them are set.
>>> + */
>>> + cached_flags = READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>>> + if (!(cached_flags & (TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
>>> + _TIF_UPROBE | _TIF_NEED_RESCHED |
>>> + _TIF_USER_RETURN_NOTIFY | _TIF_NOHZ))) {
>>> + user_enter();
>>> + return;
>>> + }
>>> +
>>
>> Too complicated and too error prone.
>>
>> In any event, I don't think that the property you actually want is for
>> the loop to be entered once. I think the property you want is that
>> we're isolated by the time we're finished. Why not just check that
>> directly in the loop condition?
>
>
> So something like this (roughly):
>
> if (!(cached_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME |
> _TIF_UPROBE | _TIF_NEED_RESCHED |
> _TIF_USER_RETURN_NOTIFY)) &&
> + task_isolation_done())
> break;
>
> i.e. just add the one extra call? That could work, I suppose.
> In the body we would then keep the proposed logic that unconditionally
> calls task_isolation_enter().

Yeah, I think so.

>> Does anything here guarantee forward progress or at least give
>> reasonable confidence that we'll make forward progress?
>
>
> A given task can get stuck in the kernel if it has a lengthy far-future
> alarm() type situation, or if there are multiple task-isolated tasks
> scheduled onto the same core, but that only affects those tasks;
> other tasks on the same core, and the system as a whole, are OK.

Why are we treating alarms as something that should defer entry to
userspace? I think it would be entirely reasonable to set an alarm
for ten minutes, ask for isolation, and then think hard for ten
minutes.

A bigger issue would be if there's an RT task that asks for isolation
and a bunch of other stuff (most notably KVM hosts) running with
uncontrained affinity at full load. If task_isolation_enter always
sleeps, then your KVM host will get scheduled, and it'll ask for a
user return notifier on the way out, and you might just loop forever.
Can this happen?

ISTM something's suboptimal with the inner workings of all this if
task_isolation_enter needs to sleep to wait for an event that isn't
scheduled for the immediate future (e.g. already queued up as an
interrupt).

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/