Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP

From: Andy Lutomirski
Date: Wed Dec 12 2018 - 13:50:47 EST

On Wed, Dec 12, 2018 at 10:45 AM Borislav Petkov <bp@xxxxxxxxx> wrote:
> On Wed, Dec 12, 2018 at 10:07:03AM -0800, Andy Lutomirski wrote:
> > You're proving my point, I think. CPUID, IRET, MOV to CR, etc are
> > "serializing". LFENCE, on many CPUd and depending on MSRs, is a
> > different kind of serializing. MFENCE is something else. All LOCK
> > instructions are some kind of barrier, but I don't think anyone calls
> > them "serializing".
> Yeah, peterz and I hashed it out a bit today on IRC about the different
> meanings of serializing. I see your point now.
> > The uaccess users of barrier_nospec() are presumably looking for a
> > speculation barrier in the sense of "CPU, please don't execute the
> > code after this until you're sure that this code should be executed
> > for real and until all inputs are known, not guessed."
> Yeah, I believe AMD's paper has this nicely written:
> Description: Set an MSR in the processor so that LFENCE is a dispatch
> serializing instruction and then use LFENCE in code streams to
> serialize dispatch (LFENCE is faster than RDTSCP which is also dispatch
> serializing). This mode of LFENCE may be enabled by setting MSR
> C001_1029[1]=1.
> Effect: Upon encountering an LFENCE when the MSR bit is set, dispatch
> will stop until the LFENCE instruction becomes the oldest instruction in
> the machine."
> which is basically what you want for the whole mitigation crap if you
> want to kill speculation - you simply hold dispatch until the LFENCE
> retires.
> > The property I want for RDTSC ordering is much weaker: I want it to be
> > ordered like a load. Imagine that, instead of an on-chip TSC, the TSC
> > is literally a location in main memory that gets incremented by an
> > extra dedicated CPU every nanosecond or so. I want users of RDTSC to
> > work as if they were reading such a location in memory using an
> > ordinary load. I believe this gives the real desired property that it
> > should be impossible to observe the TSC going backwards. This is a
> > much weaker form of serialization.
> Well, in that case you need something new.
> Because, the moment you have a RDTSC in flight and a second RDTSC comes
> in and that second RDTSC must *not* bypass the first one and execute
> earlier due to OoO, you need to impose some ordering. And that's pretty
> much uarch-dependent, I'd say.
> And I guess on AMD the way to do that is to stop dispatch until the
> first RDTSC retires.
> Can it be done faster? Sure. And I'm pretty sure there's a lot of pesky
> little hw details we're not even hearing of, which get in the way.

As far as I know, RDTSCP gets the job done, as does LFENCE, RDTSC on
Intel. There was a big discussion a few years ago where we changed it
from LFENCE;RDTSC;LFENCE to just LFENCE;RDTSC after everyone was
reasonably convinced that the uarch would not dispatch two RDTSCs
backwards if the first one was immediately preceeded by LFENCE.