Re: [RFC PATCH 4/4] x86/TSC: Use RDTSCP
From: Borislav Petkov
Date: Wed Dec 12 2018 - 13:45:11 EST
On Wed, Dec 12, 2018 at 10:07:03AM -0800, Andy Lutomirski wrote:
> You're proving my point, I think. CPUID, IRET, MOV to CR, etc are
> "serializing". LFENCE, on many CPUd and depending on MSRs, is a
> different kind of serializing. MFENCE is something else. All LOCK
> instructions are some kind of barrier, but I don't think anyone calls
> them "serializing".
Yeah, peterz and I hashed it out a bit today on IRC about the different
meanings of serializing. I see your point now.
> The uaccess users of barrier_nospec() are presumably looking for a
> speculation barrier in the sense of "CPU, please don't execute the
> code after this until you're sure that this code should be executed
> for real and until all inputs are known, not guessed."
Yeah, I believe AMD's paper has this nicely written:
"MITIGATION G-2
Description: Set an MSR in the processor so that LFENCE is a dispatch
serializing instruction and then use LFENCE in code streams to
serialize dispatch (LFENCE is faster than RDTSCP which is also dispatch
serializing). This mode of LFENCE may be enabled by setting MSR
C001_1029[1]=1.
Effect: Upon encountering an LFENCE when the MSR bit is set, dispatch
will stop until the LFENCE instruction becomes the oldest instruction in
the machine."
https://developer.amd.com/wp-content/resources/90343-B_SoftwareTechniquesforManagingSpeculation_WP_7-18Update_FNL.pdf
which is basically what you want for the whole mitigation crap if you
want to kill speculation - you simply hold dispatch until the LFENCE
retires.
> The property I want for RDTSC ordering is much weaker: I want it to be
> ordered like a load. Imagine that, instead of an on-chip TSC, the TSC
> is literally a location in main memory that gets incremented by an
> extra dedicated CPU every nanosecond or so. I want users of RDTSC to
> work as if they were reading such a location in memory using an
> ordinary load. I believe this gives the real desired property that it
> should be impossible to observe the TSC going backwards. This is a
> much weaker form of serialization.
Well, in that case you need something new.
Because, the moment you have a RDTSC in flight and a second RDTSC comes
in and that second RDTSC must *not* bypass the first one and execute
earlier due to OoO, you need to impose some ordering. And that's pretty
much uarch-dependent, I'd say.
And I guess on AMD the way to do that is to stop dispatch until the
first RDTSC retires.
Can it be done faster? Sure. And I'm pretty sure there's a lot of pesky
little hw details we're not even hearing of, which get in the way.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.