Re: [RFC][PATCH 2/2] x86: add extra serialization for non-serializing MSRs
From: Andrew Cooper
Date: Fri Feb 05 2021 - 07:16:33 EST
On 05/02/2021 10:02, Peter Zijlstra wrote:
> On Thu, Feb 04, 2021 at 04:11:12PM -0800, Andy Lutomirski wrote:
>> I'm wondering if a more mild violation is possible:
>>
>> Initialize *addr = 0.
>>
>> mov $1, (addr)
>> wrmsr
>>
>> remote cpu's IDT vector:
>>
>> mov (addr), %rax
>> %rax == 0!
>>
>> There's no speculative-execution-becoming-visible-even-if-it-doesn't-retire
>> here -- there's just an ordering violation. For Linux, this would
>> presumably only manifest as a potential deadlock or confusion if the
>> IPI vector code looks at the list of pending work and doesn't find the
>> expected work in it.
>>
>> Dave? hpa? What is the SDM trying to tell us?
> [ Big caveat, I've not spoken to any hardware people about this. The
> below is purely my own understanding. ]
>
> This is my interpretation as well. Without the MFENCE+LFENCE there is no
> guarantee the store is out of the store-buffer and the remote load isn't
> guaranteed to observe it.
>
> What I think the SDM is trying to tell us, is that the IPI, even if it
> goes on the same regular coherency fabric as memory transfers, is not
> subject to the regular memory ordering rules.
>
> Normal TSO rules tells us that when:
>
> P1() {
> x = 1;
> y = 1;
> }
>
> P2() {
> r1 = y;
> r2 = x;
> }
>
> r2 must not be 0 when r1 is 1. Because if we see store to y, we must
> also see store to x. But the IPI thing doesn't behave like a store. The
> (fast) wrmsr isn't even considered a memop.
>
> The thing is, the above ordering does not guarantee we have r2 != 0.
> r2==0 is allowed when r1==0. And that's an entirely sane outcome even if
> we run the instructions like:
>
> CPU1 CPU2
>
> cycle-1 mov $1, ([x])
> cycle-2 mov $1, ([y])
> cycle-3 mov ([y]), rax
> cycle-4 mov ([x]), rbx
>
> There is no guarantee _any_ of the stores will have made it out. And
> that's exactly the issue. The IPI might make it out of the core before
> any of the stores will.
>
> Furthermore, since there is no dependency between:
>
> mov $1, ([x])
> wrmsr
>
> The CPU is allowed to reorder the execution and retire the wrmsr before
> the store. Very much like it would for normal non-dependent
> instructions.
Execution, sure (for details which don't escape the core, just like any
other speculative execution). Retirement, surely not - it is inherently
tied to the program order of things.
Causality would also be broken if the WRMSR retired ahead of the MOV,
and an interrupt were to hit the boundary between them.
> And presumably it is still allowed to do that when we write it like:
>
> mov $1, ([x])
> mfence
> wrmsr
>
> because, mfence only has dependencies to memops and (fast) wrmsr is not
> a memop.
>
> Which then brings us to:
>
> mov $1, ([x])
> mfence
> lfence
> wrmsr
>
> In this case, the lfence acts like the newly minted ifence (see
> spectre), and will block execution of (any) later instructions until
> completion of all prior instructions. This, and only this ensures the
> wrmsr happens after the mfence, which in turn ensures the store to x is
> globally visible.
I understand that "what the architecture guarantees" differs from "how
parts behave in practice".
And I also understand the reasoning behind declaring that MFENCE;LFENCE
the only architecturally guaranteed way of ensuring all stores are
globally visible before the WRMSR starts.
However, what is missing is a explanation of how it is possible to build
a causality-preserving part where those fences (for plain stores) are
necessary in practice.
That sequence is a large set of pipeline stalls in practice for what
appears to a problem in theory only.
~Andrew