Re: [PATCH 20/28] ARCv2: barriers

From: Vineet Gupta
Date: Fri Jun 19 2015 - 09:14:52 EST

On Thursday 11 June 2015 07:09 PM, Will Deacon wrote:
> On Thu, Jun 11, 2015 at 01:13:28PM +0100, Vineet Gupta wrote:
>> On Wednesday 10 June 2015 06:31 PM, Will Deacon wrote:
>>> On Wed, Jun 10, 2015 at 11:58:40AM +0100, Peter Zijlstra wrote:
>>>> On Wed, Jun 10, 2015 at 09:34:18AM +0000, Vineet Gupta wrote:
>>>>> On Tuesday 09 June 2015 06:10 PM, Peter Zijlstra wrote:
>>>> I think the most interesting part is the device side.
>>>>>>> +/*
>>>>>>> + * DSYNC:
>>>>>>> + * - Waits for completion of all outstanding memory operations before any new
>>>>>>> + * operations can begin
>>>>>>> + * - Includes implicit memory operations such as cache/TLB/BPU maintenance ops
>>>>>>> + * - Lighter version of SYNC as it doesn't wait for non-memory operations
>>>>>>> + */
>>>>>>> +#define mb() asm volatile("dsync\n" : : : "memory")
>>>>>> So mb() is supposed to order against things like DMA memory ops, is DMA
>>>>>> part of point 1 or 3, if 3, this is not a suitable instruction.
>>>>> Can u please explain the DMA case a bit more ? From what I understood and used in
>>>>> say ethernet driver, it is more of a line drawn between say cpu updating a shared
>>>>> buffer descriptor and kicking a MMIO register (which in turn could initiate a DMA)
>>>>> but I'm not sure how mb() can possibly order with DMA per se (unless there's some
>>>>> advanced form of IO-coherency)
>>>> I'm afraid I might not be the best of sources here, I tend to stay away
>>>> from actual device stuff like that. I've Cc'ed Will Deacon who might be
>>>> able to shed a bit more light on this aspect.
>>> I'd definitely expect mb() to order arbitrary memory accesses against each
>>> other (i.e. regardless of whether or not they're to RAM or MMIO devices).
>>> Some drivers use it to "flush the writebuffer" but I don't think that makes
>>> a whole lot of sense. Certainly, on ARM, if we want to know that something
>>> reached an MMIO endpoint then we'll need a read-back as well as the barrier
>>> for the general case.
>>> You also need that guarantee in your readl/writel family of macros. It's
>>> extremely heavy and rarely needed, which is why I added the _relaxed
>>> versions to all architectures.
>> Wow - adding that to these accessors will really be heavy - given that a whole
>> bunch of drivers still use the stock API (or perhaps don't know / care whether
>> they need the readl or the relaxed api. And it is practically impossible to switch
>> them over - after if ain't broken how can u fix it. So far we've been testing this
>> implementation (readl/writel - w/o any explicit barrier) on slower FPGA builds and
>> this includes a whole bunch of designware IP - mmc, eth, gpio.... and don't see
>> any ill effects - do you reckon we still need to add it.
> Unfortunately, yes, as that's effectively what the kernel requires:

Oh great - thx for those !

> The conclusion is that x86 *does* provide this ordering in its accessors
> and drivers are written to assume that, so either you go round fixing all
> the drivers by adding the missing barriers or you implement it in your
> accessors (like we have done on ARM). Subtle I/O ordering issues are no
> fun to debug.
> That's also the reason I added the _relaxed versions, so you can port
> drivers one-by-one to the weaker semantics whilst having the potentially
> broken drivers continue to work.

OK, so given that regular/mmio is also weakly ordered, it would seem that we need
full mb() *before* and *after* the IO access in the non relaxed API. ARM code
seems to put a rmb() after the readl and wmb() before the writel. Is that based on
how h/w provides for some ?

In one of the links you posted above, Catalin posed the same question, but I
didn't see response to that.

| If we are to make the writel/readl on ARM fully ordered with both IO
| (enforced by hardware) and uncached memory, do we add barriers on each
| side of the writel/readl etc.? The common cases would require a barrier
| before writel (write buffer flushing) and a barrier after readl (in case
| of polling for a "DMA complete" state).
| So if io_wmb() just orders to IO writes (writel_relaxed), does it mean
| that we still need a mighty wmb() that orders any type of accesses (i.e.
| uncached memory vs IO)? Can drivers not use the strict writel() and no
| longer rely on wmb() (wondering whether we could simplify it on ARM with
| fully ordered IO accessors)?

Further readl/writel would be no different than ioread32/iowrite32 ?

FWIW, h/w folks tell me that DMB guarentess local barrier semantics so we don't
need to use DSYNC. Latter only provides full r+w+TLB/BPU stuff while DMB allows
finer grained r/w/r+w. But if we need full mb then using one vs. other becomes a
moot point.


>>> The "ordering against DMA" is something like reading an MMIO register to
>>> determine whether the DMA has completed, then going off to read the contents
>>> out of the DMA buffer. The comment you have about DSYNC makes it sound like
>>> it's not sufficient for this case.
>> IMHO this use case is slightly pedantic - since DMA completion will typically
>> follow up with an interrupt (I understand it's still possible to poll a dma status
>> reg). at any rate when it comes to dwaring a line between memory accesses -
>> regular or mmio, DSYNC is all we got in the ISA so ARCV2 mb() has to use it -
>> there's no better option.
> Does taking an interrupt ensure visibility of the data on your
> architecture? Most non-pci device architectures allow that to race, so
> you end up relying on the readX in the irq handler to order the buffer
> access.
> If you don't have an instruction for this, then I don't understand how
> you can perform DMA to/from regions of memory that are mapped as weakly
> ordered by the CPU (e.g. how would you write a data buffer then tell the
> device to go read from it?).
> Will

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at