Re: [RFC][PATCH 2/2] x86: add extra serialization for non-serializing MSRs
From: Peter Zijlstra
Date: Fri Feb 05 2021 - 19:53:39 EST
On Thu, Feb 04, 2021 at 04:11:12PM -0800, Andy Lutomirski wrote:
> I'm wondering if a more mild violation is possible:
>
> Initialize *addr = 0.
>
> mov $1, (addr)
> wrmsr
>
> remote cpu's IDT vector:
>
> mov (addr), %rax
> %rax == 0!
>
> There's no speculative-execution-becoming-visible-even-if-it-doesn't-retire
> here -- there's just an ordering violation. For Linux, this would
> presumably only manifest as a potential deadlock or confusion if the
> IPI vector code looks at the list of pending work and doesn't find the
> expected work in it.
>
> Dave? hpa? What is the SDM trying to tell us?
[ Big caveat, I've not spoken to any hardware people about this. The
below is purely my own understanding. ]
This is my interpretation as well. Without the MFENCE+LFENCE there is no
guarantee the store is out of the store-buffer and the remote load isn't
guaranteed to observe it.
What I think the SDM is trying to tell us, is that the IPI, even if it
goes on the same regular coherency fabric as memory transfers, is not
subject to the regular memory ordering rules.
Normal TSO rules tells us that when:
P1() {
x = 1;
y = 1;
}
P2() {
r1 = y;
r2 = x;
}
r2 must not be 0 when r1 is 1. Because if we see store to y, we must
also see store to x. But the IPI thing doesn't behave like a store. The
(fast) wrmsr isn't even considered a memop.
The thing is, the above ordering does not guarantee we have r2 != 0.
r2==0 is allowed when r1==0. And that's an entirely sane outcome even if
we run the instructions like:
CPU1 CPU2
cycle-1 mov $1, ([x])
cycle-2 mov $1, ([y])
cycle-3 mov ([y]), rax
cycle-4 mov ([x]), rbx
There is no guarantee _any_ of the stores will have made it out. And
that's exactly the issue. The IPI might make it out of the core before
any of the stores will.
Furthermore, since there is no dependency between:
mov $1, ([x])
wrmsr
The CPU is allowed to reorder the execution and retire the wrmsr before
the store. Very much like it would for normal non-dependent
instructions.
And presumably it is still allowed to do that when we write it like:
mov $1, ([x])
mfence
wrmsr
because, mfence only has dependencies to memops and (fast) wrmsr is not
a memop.
Which then brings us to:
mov $1, ([x])
mfence
lfence
wrmsr
In this case, the lfence acts like the newly minted ifence (see
spectre), and will block execution of (any) later instructions until
completion of all prior instructions. This, and only this ensures the
wrmsr happens after the mfence, which in turn ensures the store to x is
globally visible.