Re: [PATCH] x86: Add an explicit barrier() to clflushopt()

From: Chris Wilson
Date: Tue Jan 12 2016 - 16:15:42 EST

On Tue, Jan 12, 2016 at 09:05:19AM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 8:37 AM, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote:
> > On Mon, Jan 11, 2016 at 09:05:06PM +0000, Chris Wilson wrote:
> >> I can narrow down the principal buggy path by doing the clflush(vend-1)
> >> in the callers at least.
> >
> > That leads to the suspect path being a read back of a cache line from
> > main memory that was just written to by the GPU.
> How do you know it was written by the GPU?

Test construction: write some data, copy it on the GPU, read it back.
Repeat for various data, sequences of copy (differing GPU instructions,
intermediates etc), and accessors.

> Maybe it's a memory ordering issue on the GPU. Say it writes something
> to memory, then sets the "I'm done" flag (or whatever you check), but
> because of ordering on the GPU the "I'm done" flag is visible before.

That is a continual worry. To try and assuage that fear, I sent 8x
flush gpu writes between the end of the copy and setting the "I'm done"
flag. The definition of the GPU flush is that it both flushes all
previous writes before it completes and only after it completes does it
do the post-sync write (before moving onto the next command). The spec
is always a bit hazy on what order the memory writes will be visible on
the CPU though.

Sending the 8x GPU flushes before marking "I'm done" did not fix the

> So the reason you see the old content may just be that the GPU writes
> are still buffered on the GPU. And you adding a clflushopt on the same
> address just changes the timing enough that you don't see the memory
> ordering any more (or it's just much harder to see, it might still be
> there).

Indeed. So I replaced the post-clflush_cache_range() clflush() with a
udelay(10) instead, and the corruption vanished. Putting the udelay(10)
before the clflush_cache_range() does not fix the corruption.

> Maybe the reason you only see the problem with the last cacheline is
> simply that the "last" cacheline is also the last that was written by
> the GPU, and it's still in the GPU write buffers.

Exactly the fear.

> Also, did you ever print out the value of clflush_size? Maybe we just
> got it wrong and it's bogus data.

It's 64 bytes as expected. And fudging it to any other value quickly
explodes :)


/* lots of GPU flushes + GPU/CPU sync point */
clflush_cache_range(vaddr, size);
memcpy(user, vaddr, size);

fails, but

/* lots of GPU flushes + GPU/CPU sync point */
clflush_cache_range(vaddr, size);
memcpy(user, vaddr, size);

passes, I'm inclined to point the finger at the mb() following the

Chris Wilson, Intel Open Source Technology Centre