Re: [RFC PATCH v3 00/16] Core scheduling v3

From: Peter Zijlstra
Date: Thu Aug 29 2019 - 10:38:59 EST


On Thu, Aug 29, 2019 at 10:30:51AM -0400, Phil Auld wrote:
> On Wed, Aug 28, 2019 at 06:01:14PM +0200 Peter Zijlstra wrote:
> > On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
> > > On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> >
> > > > And given MDS, I'm still not entirely convinced it all makes sense. If
> > > > it were just L1TF, then yes, but now...
> > >
> > > I was thinking MDS is really the reason for this. L1TF has mitigations but
> > > the only current mitigation for MDS for smt is ... nosmt.
> >
> > L1TF has no known mitigation that is SMT safe. The moment you have
> > something in your L1, the other sibling can read it using L1TF.
> >
> > The nice thing about L1TF is that only (malicious) guests can exploit
> > it, and therefore the synchronizatin context is VMM. And it so happens
> > that VMEXITs are 'rare' (and already expensive and thus lots of effort
> > has already gone into avoiding them).
> >
> > If you don't use VMs, you're good and SMT is not a problem.
> >
> > If you do use VMs (and do/can not trust them), _then_ you need
> > core-scheduling; and in that case, the implementation under discussion
> > misses things like synchronization on VMEXITs due to interrupts and
> > things like that.
> >
> > But under the assumption that VMs don't generate high scheduling rates,
> > it can work.
> >
> > > The current core scheduler implementation, I believe, still has (theoretical?)
> > > holes involving interrupts, once/if those are closed it may be even less
> > > attractive.
> >
> > No; so MDS leaks anything the other sibling (currently) does, this makes
> > _any_ privilidge boundary a synchronization context.
> >
> > Worse still, the exploit doesn't require a VM at all, any other task can
> > get to it.
> >
> > That means you get to sync the siblings on lovely things like system
> > call entry and exit, along with VMM and anything else that one would
> > consider a privilidge boundary. Now, system calls are not rare, they
> > are really quite common in fact. Trying to sync up siblings at the rate
> > of system calls is utter madness.
> >
> > So under MDS, SMT is completely hosed. If you use VMs exclusively, then
> > it _might_ work because a 'pure' host doesn't schedule that often
> > (maybe, same assumption as for L1TF).
> >
> > Now, there have been proposals of moving the privilidge boundary further
> > into the kernel. Just like PTI exposes the entry stack and code to
> > Meltdown, the thinking is, lets expose more. By moving the priv boundary
> > the hope is that we can do lots of common system calls without having to
> > sync up -- lots of details are 'pending'.
>
>
> Thanks for clarifying. My understanding is (somewhat) less fuzzy now. :)
>
> I think, though, that you were basically agreeing with me that the current
> core scheduler does not close the holes, or am I reading that wrong.

Agreed; the missing bits for L1TF are ugly but doable (I've actually
done them before, Tim has that _somewhere_), but I've not seen a
'workable' solution for MDS yet.