Re: [PATCH] x86: Add an explicit barrier() to clflushopt()

From: Chris Wilson
Date: Wed Jan 13 2016 - 07:35:55 EST


On Tue, Jan 12, 2016 at 06:06:34PM -0800, Linus Torvalds wrote:
> On Tue, Jan 12, 2016 at 4:55 PM, Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > The double clflush() remains a mystery.
>
> Actually, I think it's explainable.
>
> It's wrong to do the clflush *after* the GPU has done the write, which
> seems to be what you are doing.
>
> Why?
>
> If the GPU really isn't cache coherent, what can happen is:
>
> - the CPU has the line cached
>
> - the GPU writes the data
>
> - you do the clflushopt to invalidate the cacheline
>
> - you expect to see the GPU data.
>
> Right?
>
> Wrong. The above is complete crap.
>
> Why?
>
> Very simple reason: the CPU may have had the cacheline dirty at some
> level in its caches, so when you did the clflushopt, it didn't just
> invalidate the CPU cacheline, it wrote it back to memory. And in the
> process over-wrote the data that the GPU had written.

Forgive me for being dense, but if we overwrite the GPU data in the
backing struct page with the cacheline from the CPU, how do we see the
results from the GPU afterwards?

> Now you can say "but the CPU never wrote to the cacheline, so it's not
> dirty in the CPU caches". That may or may not be trie. The CPU may
> have written to it quite a long time ago.

What we do is we clflush the entire object after it is written to by the
CPU (including newly allocated objects from shmemfs, or pages being
returned to us by shmemfs) before any operation on that object by the
GPU. We have to so that the GPU sees the correct page contents.
(If I change the clflush of the written objects to a wbinvd, that's not
sufficient for the tests to pass.)

We do not clflush the object after we read the backing pages on the CPU
before the next GPU operation, even if it is a GPU write. This leaves us
with clean but stale cachelines. Should. That's why we then
clflush_cache_range() prior to the next read on the object, it is
intended to be a pure cache line invalidation.

If we clflush the entire object between every CPU read back and the *next*
GPU operation, it fails. If we clflush the object before every GPU write
to it, it passes. And to refresh, we always clflush after a CPU write.

I am reasonably confident that any cachelines we dirty (or inherited)
are flushed. What you are suggesting is that there are dirty cachelines
regardless. I am also reasonably confident that even if we clflush the
entire object after touching it before the GPU write, and clflush the
individual cachelines again after the GPU write, we see the errors.

I haven't found the hole yet, or been convincingly able to explain the
differences between gen.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre