Re: [PATCH v8 -tip 13/26] kernel/entry: Add support for core-wide protection of kernel-mode

From: Joel Fernandes
Date: Mon Nov 02 2020 - 20:20:12 EST


Hi Alexandre,

Sorry for late reply as I was working on the snapshotting patch...

On Fri, Oct 30, 2020 at 11:29:26AM +0100, Alexandre Chartre wrote:
>
> On 10/20/20 3:43 AM, Joel Fernandes (Google) wrote:
> > Core-scheduling prevents hyperthreads in usermode from attacking each
> > other, but it does not do anything about one of the hyperthreads
> > entering the kernel for any reason. This leaves the door open for MDS
> > and L1TF attacks with concurrent execution sequences between
> > hyperthreads.
> >
> > This patch therefore adds support for protecting all syscall and IRQ
> > kernel mode entries. Care is taken to track the outermost usermode exit
> > and entry using per-cpu counters. In cases where one of the hyperthreads
> > enter the kernel, no additional IPIs are sent. Further, IPIs are avoided
> > when not needed - example: idle and non-cookie HTs do not need to be
> > forced into kernel mode.
>
> Hi Joel,
>
> In order to protect syscall/IRQ kernel mode entries, shouldn't we have a
> call to sched_core_unsafe_enter() in the syscall/IRQ entry code? I don't
> see such a call. Am I missing something?

Yes, this is known bug and fixed in v9 which I'll post soon. Meanwhile
updated patch is appended below:

> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index 3236427e2215..48567110f709 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -4678,6 +4678,13 @@
> > sbni= [NET] Granch SBNI12 leased line adapter
> > + sched_core_protect_kernel=
> > + [SCHED_CORE] Pause SMT siblings of a core running in
> > + user mode, if at least one of the siblings of the core
> > + is running in kernel mode. This is to guarantee that
> > + kernel data is not leaked to tasks which are not trusted
> > + by the kernel.
> > +
> > sched_debug [KNL] Enables verbose scheduler debug messages.
> > schedstats= [KNL,X86] Enable or disable scheduled statistics.
> > diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> > index 474f29638d2c..260216de357b 100644
> > --- a/include/linux/entry-common.h
> > +++ b/include/linux/entry-common.h
> > @@ -69,7 +69,7 @@
> > #define EXIT_TO_USER_MODE_WORK \
> > (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
> > - _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | \
> > + _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_UNSAFE_RET | \
> > ARCH_EXIT_TO_USER_MODE_WORK)
> > /**
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d38e904dd603..fe6f225bfbf9 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2071,4 +2071,16 @@ int sched_trace_rq_nr_running(struct rq *rq);
> > const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
> > +#ifdef CONFIG_SCHED_CORE
> > +void sched_core_unsafe_enter(void);
> > +void sched_core_unsafe_exit(void);
> > +bool sched_core_wait_till_safe(unsigned long ti_check);
> > +bool sched_core_kernel_protected(void);
> > +#else
> > +#define sched_core_unsafe_enter(ignore) do { } while (0)
> > +#define sched_core_unsafe_exit(ignore) do { } while (0)
> > +#define sched_core_wait_till_safe(ignore) do { } while (0)
> > +#define sched_core_kernel_protected(ignore) do { } while (0)
> > +#endif
> > +
> > #endif
> > diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> > index 0a1e20f8d4e8..c8dc6b1b1f40 100644
> > --- a/kernel/entry/common.c
> > +++ b/kernel/entry/common.c
> > @@ -137,6 +137,26 @@ static __always_inline void exit_to_user_mode(void)
> > /* Workaround to allow gradual conversion of architecture code */
> > void __weak arch_do_signal(struct pt_regs *regs) { }
> > +unsigned long exit_to_user_get_work(void)
> > +{
> > + unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> > +
> > + if (IS_ENABLED(CONFIG_SCHED_CORE) && !sched_core_kernel_protected())
> > + return ti_work;
> > +
> > +#ifdef CONFIG_SCHED_CORE
> > + ti_work &= EXIT_TO_USER_MODE_WORK;
> > + if ((ti_work & _TIF_UNSAFE_RET) == ti_work) {
> > + sched_core_unsafe_exit();
> > + if (sched_core_wait_till_safe(EXIT_TO_USER_MODE_WORK)) {
>
> If we call sched_core_unsafe_exit() before sched_core_wait_till_safe() then we
> expose ourself during the entire wait period in sched_core_wait_till_safe(). It
> would be better to call sched_core_unsafe_exit() once we know for sure we are
> going to exit.

The way the algorithm works right now, it requires the current task to get
out of the unsafe state while waiting otherwise it will lockup. Note that we
wait with interrupts enabled so new interrupts could come while waiting.

TBH this code is very tricky to get right and it took long time to get it
working properly. For now I am content with the way it works. We can improve
further incrementally on it in the future.

Let me know if I may add your Reviewed-by tag for this patch, if there are no
other comments, and I appreciate it. Appended the updated patch below.

thanks,

- Joel

---8<-----------------------