On Tue, Nov 11, 2014 at 07:40:22PM +0000, Linus Torvalds wrote:
On Tue, Nov 11, 2014 at 10:57 AM, <alexander.duyck@xxxxxxxxx> wrote:[...]
On reviewing the documentation and code for smp_load_acquire() it occuredSo I don't hate the concept, but. there's a couple of reasons to think
to me that implementing something similar for CPU <-> device interraction
would be worth while. This commit provides just the load/read side of this
in the form of read_acquire().
this is broken.
One is just the name. Why do we have "smp_load_acquire()", but then
call the non-smp version "read_acquire()"? That makes very little
sense to me. Why did "load" become "read"?
But we do have a very real difference between "smp_rmb()" (inter-cpuRight, so now I see what's going on here. This isn't actually anything
cache coherency read barrier) and "rmb()" (full memory barrier that
synchronizes with IO).
And your patch is very confused about this. In *some* places you use
"rmb()", and in other places you just use "smp_load_acquire()". Have
you done extensive verification to check that this is actually ok?
Because the performance difference you quote very much seems to be
about your x86 testing now akipping the IO-synchronizing "rmb()", and
depending on DMA being ordered even without it.
And I'm pretty sure that's actually fine on x86. The real
IO-synchronizing rmb() (which translates into a lfence) is only needed
for when you have uncached accesses (ie mmio) on x86. So I don't think
your code is wrong, I just want to verify that everybody understands
the issues. I'm not even sure DMA can ever really have weaker memory
ordering (I really don't see how you'd be able to do a read barrier
without DMA stores being ordered natively), so maybe I worry too much,
but the ppc people in particular should look at this, because the ppc
memory ordering rules and serialization are some completely odd ad-hoc
black magic....
to do with acquire/release (I don't know of any architectures that have
a read-barrier-acquire instruction), it's all about DMA to main memory.
If a device is DMA'ing data *and* control information (e.g. 'descriptor
valid') to memory, then it must be maintaining order between those writes
with respect to memory. In that case, using the usual MMIO barriers can
be overkill because we really just want to enforce read-ordering on the CPU
side. In fact, I think you could even do this with a fake address dependency
on ARM (although I'm not actually suggesting we do that).
In light of that, it actually sounds like we want a new set of barrier
macros that apply only to DMA buffer accesses by the CPU -- they wouldn't
enforce ordering against things like MMIO registers. I wonder whether any
architectures would implement them differently to the smp_* flavours?
But anything with non-cache-coherent DMA is obviously very suspect too.I think non-cache-coherent DMA should work too (at least, on ARM), but
only for buffers mapped via dma_alloc_coherent (i.e. a non-cacheable
mapping).
Will