Re: [PATCH] arch: Introduce read_acquire()

From: Alexander Duyck
Date: Wed Nov 12 2014 - 10:44:18 EST

Next message: Grazvydas Ignotas: "Re: [PATCH] mfd: twl4030-power: Fix poweroff with PM configuration enabled"
Previous message: Jim Davis: "randconfig build error with next-20141112, in net/openvswitch"
In reply to: Will Deacon: "Re: [PATCH] arch: Introduce read_acquire()"
Next in thread: Will Deacon: "Re: [PATCH] arch: Introduce read_acquire()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 11/12/2014 02:10 AM, Will Deacon wrote:

On Tue, Nov 11, 2014 at 07:40:22PM +0000, Linus Torvalds wrote:

On Tue, Nov 11, 2014 at 10:57 AM, <alexander.duyck@xxxxxxxxx> wrote:

On reviewing the documentation and code for smp_load_acquire() it occured
to me that implementing something similar for CPU <-> device interraction
would be worth while. This commit provides just the load/read side of this
in the form of read_acquire().

So I don't hate the concept, but. there's a couple of reasons to think
this is broken.

One is just the name. Why do we have "smp_load_acquire()", but then
call the non-smp version "read_acquire()"? That makes very little
sense to me. Why did "load" become "read"?

[...]

But we do have a very real difference between "smp_rmb()" (inter-cpu
cache coherency read barrier) and "rmb()" (full memory barrier that
synchronizes with IO).

And your patch is very confused about this. In *some* places you use
"rmb()", and in other places you just use "smp_load_acquire()". Have
you done extensive verification to check that this is actually ok?
Because the performance difference you quote very much seems to be
about your x86 testing now akipping the IO-synchronizing "rmb()", and
depending on DMA being ordered even without it.

And I'm pretty sure that's actually fine on x86. The real
IO-synchronizing rmb() (which translates into a lfence) is only needed
for when you have uncached accesses (ie mmio) on x86. So I don't think
your code is wrong, I just want to verify that everybody understands
the issues. I'm not even sure DMA can ever really have weaker memory
ordering (I really don't see how you'd be able to do a read barrier
without DMA stores being ordered natively), so maybe I worry too much,
but the ppc people in particular should look at this, because the ppc
memory ordering rules and serialization are some completely odd ad-hoc
black magic....

Right, so now I see what's going on here. This isn't actually anything
to do with acquire/release (I don't know of any architectures that have
a read-barrier-acquire instruction), it's all about DMA to main memory.

Actually it is sort of, I just hadn't realized it until I read some of the explanations of the C11 acquire/release memory order specifics, but I believe most network drivers are engaged in acquire/release logic because we are usually using something such as a lockless descriptor ring to pass data back and forth between the device and the system. The net win for device drivers is that we can remove some of the heavy-weight barriers that are having to be used by making use of lighter barriers or primitives such as lwsync vs sync in PowerPC or ldar vs dsb(ld) on arm64.

If a device is DMA'ing data *and* control information (e.g. 'descriptor
valid') to memory, then it must be maintaining order between those writes
with respect to memory. In that case, using the usual MMIO barriers can
be overkill because we really just want to enforce read-ordering on the CPU
side. In fact, I think you could even do this with a fake address dependency
on ARM (although I'm not actually suggesting we do that).

In light of that, it actually sounds like we want a new set of barrier
macros that apply only to DMA buffer accesses by the CPU -- they wouldn't
enforce ordering against things like MMIO registers. I wonder whether any
architectures would implement them differently to the smp_* flavours?

My concern would be the cost of the barriers vs the acquire/release primitives. In the case of arm64 I am assuming there is a reason for wanting to use ldar vs dsb instructions. I would imagine the devices drivers would want to get the same kind of advantages.

But anything with non-cache-coherent DMA is obviously very suspect too.

I think non-cache-coherent DMA should work too (at least, on ARM), but
only for buffers mapped via dma_alloc_coherent (i.e. a non-cacheable
mapping).

Will

For now my plan is to focus on coherent memory only with this. Specifically it is only really intended for use with dma_alloc_coherent.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Grazvydas Ignotas: "Re: [PATCH] mfd: twl4030-power: Fix poweroff with PM configuration enabled"
Previous message: Jim Davis: "randconfig build error with next-20141112, in net/openvswitch"
In reply to: Will Deacon: "Re: [PATCH] arch: Introduce read_acquire()"
Next in thread: Will Deacon: "Re: [PATCH] arch: Introduce read_acquire()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]