Re: [PATCH] sched/x86: Save [ER]FLAGS on context switch

From: Will Deacon
Date: Tue Feb 19 2019 - 07:48:18 EST


On Tue, Feb 19, 2019 at 10:04:09AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 18, 2019 at 02:30:21PM -0800, H. Peter Anvin wrote:
> > On 2/16/19 2:30 AM, Peter Zijlstra wrote:
> > > On Fri, Feb 15, 2019 at 08:06:56PM -0800, hpa@xxxxxxxxx wrote:
> > >> This implies we invoke schedule -- a restricted operation (consider
> > >> may_sleep) during execution of STAC-enabled code, but *not* as an
> > >> exception or interrupt, since those preserve the flags.
> > >
> > > Meet preempt_enable().
> >
> > I believe this falls under "doctor, it hurts when I do that." And it hurts for
> > very good reasons. See below.
>
> I disagree; the basic rule is that if you're preemptible you must also
> be able to schedule and vice-versa. These AC regions violate this.
>
> And, like illustrated, it is _very_ easy to get all sorts inside that AC
> crap.
>
> > >> I have serious concerns about this. This is more or less saying that
> > >> we have left an unlimited gap and have had AC escape.
> > >
> > > Yes; by allowing normal C in between STAC and CLAC this is so.
> > >
> > >> Does *anyone* see a need to allow this? I got a question at LPC from
> > >> someone about this, and what they were trying to do once all the
> > >> layers had been unwound was so far down the wrong track for a root
> > >> problem that actually has a very simple solution.
> > >
> > > Have you read the rest of the thread?
> > >
> > > All it takes for this to explode is a call to a C function that has
> > > tracing on it in between the user_access_begin() and user_access_end()
> > > things. That is a _very_ easy thing to do.
> > >
> > > Heck, GCC is allowed to generate that broken code compiling
> > > lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user (esp.
> > > with CONFIG_OPTIMIZE_INLINING), and making that a function call would
> > > get us fentry hooks and ftrace and *BOOM*.
> > >
> > > (Now, of course, its a static function with a single caller, and GCC
> > > isn't _THAT_ stupid, but it could, if it wanted to)
> > >
> > > Since the bar is _that_ low for mistakes, I figure we'd better fix it.
> > >
> >
> > The question is what "fix it" means.
>
> It means that when we do schedule, the next task doesn't have AC set,
> and when we schedule back, we'll restore our AC when we had it. Unlike
> the current situation, where the next task will run with AC set.
>
> IOW, it confines AC to the task context where it was set.
>
> > I'm really concerned about AC escapes,
> > and everyone else should be, too.
>
> Then _that_ should be asserted.
>
> > For an issue specific to tracing, it would be more appropriate to do the
> > appropriate things in the tracing entry/exit than in schedule. Otherwise, we
> > don't want to silently paper over mistakes which could mean that we run a
> > large portion of the kernel without protection we thought we had.
> >
> > In that sense, calling schedule() with AC set is in fact a great place to have
> > a WARN() or even BUG(), because such an event really could imply that the
> > kernel has been security compromised.
>
> It is not specific to tracing, tracing is just one of the most trivial
> and non-obvious ways to make it go splat.
>
> There's lot of fairly innocent stuff that does preempt_disable() /
> preempt_enable(). And just a warning in schedule() isn't sufficient,
> you'd have to actually trigger a reschedule before you know your code is
> bad.
>
> And like I said; the invariant is: if you're preemptible you can
> schedule and v.v.
>
> Now, clearly you don't want to mark these whole regions as !preemptible,
> because then you can also not take faults, but somehow you're not
> worried about the whole fault handler, but you are worried about the
> scheduler ?!? How does that work? The fault handler can touch a _ton_
> more code. Heck, the fault handler can schedule.
>
> So either pre-fault, and run the whole AC crap with preemption disabled
> and retry, or accept that we have to schedule.

I think you'll still hate this, but could we not disable preemption during
the uaccess-enabled region, re-enabling it on the fault path after we've
toggled uaccess off and disable it again when we return back to the
uaccess-enabled region? Doesn't help with tracing, but it should at least
handle the common case.

Will