Re: [PATCH] x86: Add an explicit barrier() to clflushopt()
From: Andy Lutomirski
Date: Tue Jan 12 2016 - 21:43:14 EST
On Tue, Jan 12, 2016 at 6:06 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> The double clflush() remains a mystery.
>
> Actually, I think it's explainable.
>
> It's wrong to do the clflush *after* the GPU has done the write, which
> seems to be what you are doing.
>
> Why?
>
> If the GPU really isn't cache coherent, what can happen is:
>
> - the CPU has the line cached
>
> - the GPU writes the data
>
> - you do the clflushopt to invalidate the cacheline
>
> - you expect to see the GPU data.
>
> Right?
>
> Wrong. The above is complete crap.
>
> Why?
>
> Very simple reason: the CPU may have had the cacheline dirty at some
> level in its caches, so when you did the clflushopt, it didn't just
> invalidate the CPU cacheline, it wrote it back to memory. And in the
> process over-wrote the data that the GPU had written.
>
> Now you can say "but the CPU never wrote to the cacheline, so it's not
> dirty in the CPU caches". That may or may not be trie. The CPU may
> have written to it quite a long time ago.
>
> So if you are doing a GPU write, and you want to see the data that the
> GPU wrote, you had better do the clflushopt long *before* the GPU ever
> writes to memory.
>
> Your pattern of doing "flush and read" is simply fundamentally buggy.
> There are only two valid CPU flushing patterns:
>
> - write and flush (to make the writes visible to the GPU)
>
> - flush before starting GPU accesses, and then read
>
> At no point can "flush and read" be right.
>
> Now, I haven't actually seen your code, so I'm just going by your
> high-level description of where the CPU flush and CPU read were done,
> but it *sounds* like you did that invalid "flush and read" behavior.
Since barriers are on my mind: how strong a barrier is needed to
prevent cache fills from being speculated across the barrier?
Concretely, if I do:
clflush A
clflush B
load A
load B
the architecture guarantees that (unless store forwarding happens) the
value I see for B is at least as new as the value I see for A *with
respect to other access within the coherency domain*. But the GPU
isn't in the coherency domain at all.
Is:
clflush A
clflush B
load A
MFENCE
load B
good enough? If it is, and if
clflush A
clflush B
load A
LOCK whatever
load B
is *not*, then this might account for the performance difference.
In any event, it seems to me that what i915 is trying to do isn't
really intended to be supported for WB memory. i915 really wants to
force a read from main memory and to simultaneously prevent the CPU
from writing back to main memory. Ick. I'd assume that:
clflush A
clflush B
load A
serializing instruction here
load B
is good enough, as long as you make sure that the GPU does its writes
after the clflushes make it all the way out to main memory (which
might require a second serializing instruction in the case of
clflushopt), but this still relies on the hardware prefetcher not
prefetching B too early, which it's permitted to do even in the
absence of any explicit access at all.
Presumably this is good enough on any implementation:
clflush A
clflush B
load A
clflush B
load B
But that will be really, really slow. And you're still screwed if the
hardware is permitted to arbitrarily change cache lines from S to M.
In other words, I'm not really convinced that x86 was ever intended to
have well-defined behavior if something outside the coherency domain
writes to a page of memory while that page is mapped WB. Of course,
I'm also not sure how to reliably switch a page from WB to any other
memory type short of remapping it and doing CLFLUSH after remapping.
SDM Volume 3 11.12.4 seems to agree with me.
Could the driver be changed to use WC or UC and to use MOVNTDQA on
supported CPUs to get the performance back? It sounds like i915 is
effectively doing PIO here, and reasonably modern CPUs have a nice set
of fast PIO instructions.
--Andy