Re: [PATCH v10 09/12] arch/x86: enable task isolation functionality

From: Andy Lutomirski
Date: Thu Mar 03 2016 - 18:46:59 EST


On Thu, Mar 3, 2016 at 11:52 AM, Chris Metcalf <cmetcalf@xxxxxxxxxxxx> wrote:
> On 03/02/2016 07:36 PM, Andy Lutomirski wrote:
>>
>> On Mar 2, 2016 12:10 PM, "Chris Metcalf" <cmetcalf@xxxxxxxxxx> wrote:
>>>
>>> In prepare_exit_to_usermode(), call task_isolation_ready()
>>> when we are checking the thread-info flags, and after we've handled
>>> the other work, call task_isolation_enter() unconditionally.
>>>
>>> In syscall_trace_enter_phase1(), we add the necessary support for
>>> strict-mode detection of syscalls.
>>>
>>> We add strict reporting for the kernel exception types that do
>>> not result in signals, namely non-signalling page faults and
>>> non-signalling MPX fixups.
>>>
>>> Signed-off-by: Chris Metcalf <cmetcalf@xxxxxxxxxx>
>>> ---
>>> arch/x86/entry/common.c | 18 ++++++++++++++++--
>>> arch/x86/kernel/traps.c | 2 ++
>>> arch/x86/mm/fault.c | 2 ++
>>> 3 files changed, 20 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
>>> index 03663740c866..27c71165416b 100644
>>> --- a/arch/x86/entry/common.c
>>> +++ b/arch/x86/entry/common.c
>>> @@ -21,6 +21,7 @@
>>> #include <linux/context_tracking.h>
>>> #include <linux/user-return-notifier.h>
>>> #include <linux/uprobes.h>
>>> +#include <linux/isolation.h>
>>>
>>> #include <asm/desc.h>
>>> #include <asm/traps.h>
>>> @@ -91,6 +92,10 @@ unsigned long syscall_trace_enter_phase1(struct
>>> pt_regs *regs, u32 arch)
>>> */
>>> if (work & _TIF_NOHZ) {
>>> enter_from_user_mode();
>>> + if (task_isolation_check_syscall(regs->orig_ax)) {
>>> + regs->orig_ax = -1;
>>> + return 0;
>>> + }
>>
>> This needs a comment indicating the intended semantics.
>> And I've still heard no explanation of why this part can't use seccomp.
>
>
> Here's an excerpt from my earlier reply to you from:
>
> https://lkml.kernel.org/r/55AE9EAC.4010202@xxxxxxxxxx
>
> Admittedly this patch series has been moving very slowly through
> review, so it's not surprising we have to revisit some things!
>
> On 07/21/2015 03:34 PM, Chris Metcalf wrote:
>>
>> On 07/13/2015 05:47 PM, Andy Lutomirski wrote:
>>>
>>> If a user wants a syscall to kill them, use
>>> seccomp. The kernel isn't at fault if the user does a syscall when it
>>> didn't want to enter the kernel.
>>
>>
>> Interesting! I didn't realize how close SECCOMP_SET_MODE_STRICT
>> was to what I wanted here. One concern is that there doesn't seem
>> to be a way to "escape" from seccomp strict mode, i.e. you can't
>> call seccomp() again to turn it off - which makes sense for seccomp
>> since it's a security issue, but not so much sense with cpu_isolated.
>>
>> So, do you think there's a good role for the seccomp() API to play
>> in achieving this goal? It's certainly not a question of "the kernel at
>> fault" but rather "asking the kernel to help catch user mistakes"
>> (typically third-party libraries in our customers' experience). You
>> could imagine a SECCOMP_SET_MODE_ISOLATED or something.
>>
>> Alternatively, we could stick with the API proposed in my patch
>> series, or something similar, and just try to piggy-back on the seccomp
>> internals to make it happen. It would require Kconfig to ensure
>> that SECCOMP was enabled though, which obviously isn't currently
>> required to do cpu isolation.
>
>
> On looking at this again just now, one thing that strikes me is that
> it may not be necessary to forbid the syscall like seccomp does.
> It may be sufficient just to trigger the task isolation strict signal
> and then allow the syscall to complete. After all, we don't "fail"
> any of the other things that upset strict mode, like page faults; we
> let them complete, but add a signal. So for consistency, I think it
> may in fact make sense to simply trigger the signal but let the
> syscall do its thing. After all, perhaps the signal is handled
> and logged and we don't mind having the application continue; the
> signal handler can certainly choose to fail hard, or in the usual
> case of no signal handler, that kills the task just fine too.
> Allowing the syscall to complete is really kind of incidental.

No, don't do that. First, if you have a signal pending, a lot of
syscalls will abort with -EINTR. Second, if you fire a signal on
entry via sigreturn, you're not going to like the results.


>
> After the changes proposed lower down in this email, this call
> site will end up looking like:
>
> /* Generate a task isolation strict signal if necessary. */
> if ((work & _TIF_TASK_ISOLATION) && task_isolation_strict())
> task_isolation_syscall(regs->orig_ax);
>
> But do let me know if you think more specifically that there is
> a way seccomp can be helpful.

Let task isolation users who want to detect when they screw up and do
a syscall do it with seccomp.

>
>
>>> work &= ~_TIF_NOHZ;
>>> }
>>> #endif
>>> @@ -254,17 +259,26 @@ static void exit_to_usermode_loop(struct pt_regs
>>> *regs, u32 cached_flags)
>>> if (cached_flags & _TIF_USER_RETURN_NOTIFY)
>>> fire_user_return_notifiers();
>>>
>>> + task_isolation_enter();
>>> +
>>> /* Disable IRQs and retry */
>>> local_irq_disable();
>>>
>>> cached_flags =
>>> READ_ONCE(pt_regs_to_thread_info(regs)->flags);
>>>
>>> - if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
>>> + if (!(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS) &&
>>> + task_isolation_ready())
>>> break;
>>>
>>> }
>>> }
>>>
>>> +#ifdef CONFIG_TASK_ISOLATION
>>> +# define EXIT_TO_USERMODE_FLAGS (EXIT_TO_USERMODE_LOOP_FLAGS |
>>> _TIF_NOHZ)
>>> +#else
>>> +# define EXIT_TO_USERMODE_FLAGS EXIT_TO_USERMODE_LOOP_FLAGS
>>> +#endif
>>> +
>>
>> TIF_NOHZ is already a hack and IMO this just makes it worse. At the
>> very least this should have a comment. It really ought to be either a
>> static_branch or a flag that's actually switched per cpu.
>> But this is also a confusing change. Don't override the definition
>> here -- stick it in the header where it belongs.
>
>
> The EXIT_TO_USERMODE_xxx stuff is only defined in common.c, not in a header.
>
> The more I look at this, though, the more I think it would be cleanest
> to add a new flag _TIF_TASK_ISOLATION that is set just for tasks that have
> enabled task isolation. That can be used unconditionally to check to see
> if we need to call exit_to_usermode_loop() from prepare_exit_to_usermode(),
> and also means that non-task-isolation tasks don't need to go into the
> exit loop unconditionally, which is obviously a performance drag for them.

Sounds better to me.

--Andy