Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

From: Andy Lutomirski
Date: Wed Nov 12 2014 - 10:48:43 EST


On Nov 12, 2014 2:30 AM, "Borislav Petkov" <bp@xxxxxxxxx> wrote:
>
> On Tue, Nov 11, 2014 at 06:06:48PM -0800, Tony Luck wrote:
> > > Innocent bystanders have RIPV=1, EIPV=0 in MCG_STATUS ... so they
> > > are quite easy to spot.
> >
> > Bother ... except for the SRAO cases where *everyone* is an innocent
> > bystander - but someone should go look for the error and queue up
> > a page offline event. Perhaps for this we'd do the self-ipi trick and
> > then scan the banks to look for SRAO signatures to process as well
> > as clearing MCIP
>
> First of all, good morning :-)
>
> So I'm reading all those notes again and I can't say I'm convinced about
> the change. And all those "but but, what might happen if" cases are what
> bother me.
>
> First, the stack: switching to a possibly almost full kernel stack
> where we have other stuff already allocated is a bad idea IMO. There's
> a good reason to have a known-good, solely-for-our-purpose stack which
> we get to use. It is always there and it gives us the optimal conditions
> possible for recovery. Switching back to the possibly crowded kernel
> stack and causing stack overflows while handling an MCE is not an
> improvement in my book.

I only switch stacks on entry from userspace, and the kernel stack is
completely empty if that happens.

>
> Tony:
> > The current code has an ugly hole at the moment. End of
> > do_machine_check() clears MCG_STATUS. At that point we are still
> > running on the magic stack for machine check exceptions ... if we take
> > a machine check in the small window from there until we get off this
> > stack (iret) ... then we will enter the handler back on the same stack
> > that we haven't finished using yet. Bad things ensue.
>
> I'm not crazy about it either but how often does this actually happen in
> practice? And if it does happen, then, well, we die, so be it. It can
> happen with other MCEs coming in back-to-back like poisoned data (1st
> MCE) being consumed on another core (2nd MCE). I know, Intel does the
> broadcasting but that is going to change, I hear and AMD raises an MCE
> on the core which detects it only.
>
> Another big issue I see with this is verifying the locking works and
> generally testing this serious change. MCA is core kernel infrastructure
> and we cannot break this left and right. And changing stacks we're
> running on and the whole dynamics of the code might open a whole other
> can of worms.

One nice thing for testing is that my patch applies to int3 from
userspace as well, and that's easy to test.

>
> Oh, and we shouldn't forget why we're doing this: just to save two
> members to task_struct and getting rid of paranoid_userspace. Yep, still
> not convinced the effort is worth it.

I think I want to make this change anyway, though, since it may
simplify fsgsbase support enough to justify it solely on that account.
I don't think that the machine check code needs to change at all to
accommodate a stack switch, but I think it makes some simplifications
possible.

>
> Now, I think Andy had a much better/simpler/cleaner suggestion
> yesterday: task_work_add(). It is basically what _TIF_MCE_NOTIFY does
> and it is called in the same path - do_notify_resume() - but generic and
> we can drop _TIF_MCE_NOTIFY then and use the generic thing.
>
> Much less intrusive change.

Less intrusive is certainly true.

--Andy

>
> --
> Regards/Gruss,
> Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/