Re: [RFC PATCH v3 00/16] Core scheduling v3

From: Peter Zijlstra
Date: Wed Aug 28 2019 - 12:01:50 EST


On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
> On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:

> > And given MDS, I'm still not entirely convinced it all makes sense. If
> > it were just L1TF, then yes, but now...
>
> I was thinking MDS is really the reason for this. L1TF has mitigations but
> the only current mitigation for MDS for smt is ... nosmt.

L1TF has no known mitigation that is SMT safe. The moment you have
something in your L1, the other sibling can read it using L1TF.

The nice thing about L1TF is that only (malicious) guests can exploit
it, and therefore the synchronizatin context is VMM. And it so happens
that VMEXITs are 'rare' (and already expensive and thus lots of effort
has already gone into avoiding them).

If you don't use VMs, you're good and SMT is not a problem.

If you do use VMs (and do/can not trust them), _then_ you need
core-scheduling; and in that case, the implementation under discussion
misses things like synchronization on VMEXITs due to interrupts and
things like that.

But under the assumption that VMs don't generate high scheduling rates,
it can work.

> The current core scheduler implementation, I believe, still has (theoretical?)
> holes involving interrupts, once/if those are closed it may be even less
> attractive.

No; so MDS leaks anything the other sibling (currently) does, this makes
_any_ privilidge boundary a synchronization context.

Worse still, the exploit doesn't require a VM at all, any other task can
get to it.

That means you get to sync the siblings on lovely things like system
call entry and exit, along with VMM and anything else that one would
consider a privilidge boundary. Now, system calls are not rare, they
are really quite common in fact. Trying to sync up siblings at the rate
of system calls is utter madness.

So under MDS, SMT is completely hosed. If you use VMs exclusively, then
it _might_ work because a 'pure' host doesn't schedule that often
(maybe, same assumption as for L1TF).

Now, there have been proposals of moving the privilidge boundary further
into the kernel. Just like PTI exposes the entry stack and code to
Meltdown, the thinking is, lets expose more. By moving the priv boundary
the hope is that we can do lots of common system calls without having to
sync up -- lots of details are 'pending'.