Re: [PATCH 4/7] RISC-V: arch/riscv/include

From: Palmer Dabbelt
Date: Tue Jun 06 2017 - 15:09:43 EST


On Tue, 06 Jun 2017 01:54:23 PDT (-0700), Arnd Bergmann wrote:
> On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>> On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote:
>>> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>>>> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
>>>>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>>>>>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..d942555a7a08
>>>>>> --- /dev/null
>>>>>> +++ b/arch/riscv/include/asm/io.h
>>>>>> @@ -0,0 +1,36 @@
>>>>>
>>>>>> +#ifndef _ASM_RISCV_IO_H
>>>>>> +#define _ASM_RISCV_IO_H
>>>>>> +
>>>>>> +#include <asm-generic/io.h>
>>>>>
>>>>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
>>>>> helpers using inline assembly, to prevent the compiler for breaking
>>>>> up accesses into byte accesses.
>>>>>
>>>>> Also, most architectures require to some synchronization after a
>>>>> non-relaxed readl() to prevent prefetching of DMA buffers, and
>>>>> before a writel() to flush write buffers when a DMA gets triggered.
>>>>
>>>> Makes sense. These were all OK on existing implementations (as there's no
>>>> writable PMAs, so all MMIO regions are strictly ordered), but that's not
>>>> actually what the RISC-V ISA says. I patterned this on arm64
>>>>
>>>> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa
>>>>
>>>> where I think the only odd thing is our definition of mmiowb
>>>>
>>>> +/* IO barriers. These only fence on the IO bits because they're only required
>>>> + * to order device access. We're defining mmiowb because our AMO instructions
>>>> + * (which are used to implement locks) don't specify ordering. From Chapter 7
>>>> + * of v2.2 of the user ISA:
>>>> + * "The bits order accesses to one of the two address domains, memory or I/O,
>>>> + * depending on which address domain the atomic instruction is accessing. No
>>>> + * ordering constraint is implied to accesses to the other domain, and a FENCE
>>>> + * instruction should be used to order across both domains."
>>>> + */
>>>> +
>>>> +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : "memory");
>>>> +#define __iowmb() __asm__ __volatile__ ("fence io,o" : : : "memory");
>>>
>>> Looks ok, yes.
>>>
>>>> +#define mmiowb() __asm__ __volatile__ ("fence io,io" : : : "memory");
>>>>
>>>> which I think is correct.
>>>
>>> I can never remember what exactly this one does.
>>
>> I can't find the reference again, but what I found said that if your atomics
>> (or whatever's used for locking) don't stay ordered with your MMIO accesses,
>> then you should define mmiowb to ensure ordering. I managed to screw this up,
>> as there's no "w" in the successor set (to actually enforce the AMO ordering).
>> This is somewhat confirmed by
>>
>> https://lkml.org/lkml/2006/8/31/174
>> Subject: Re: When to use mmiowb()?
>> AFAICT, they're both right. Generally, mmiowb() should be used prior to
>> unlock in a critical section whose last PIO operation is a writeX.
>>
>> Thus, I think the actual fence should be at least
>>
>> fence o,w
> ...
>> which matches what's above. I think "fence o,w" is sufficient for a mmiowb on
>> RISC-V. I'll make the change.
>
> This sounds reasonable according to the documentation, but with your
> longer explanation of the barriers, I think the __iormb/__iowmb definitions
> above are wrong. What you actually need I think is
>
> void writel(u32 v, volatile void __iomem *addr)
> {
> asm volatile("fence w,o" : : : "memory");
> writel_relaxed(v, addr);
> }
>
> u32 readl(volatile void __iomem *addr)
> {
> u32 ret = readl_relaxed(addr);
> asm volatile("fence i,r" : : : "memory");
> return ret;
> }
>
> to synchronize between DMA and I/O. The barriers you listed above
> in contrast appear to be directed at synchronizing I/O with other I/O.
> We normally assume that this is not required when you have
> subsequent MMIO accesses on the same device (on PCI) or the
> same address region (per ARM architecture and others). If you do
> need to enforce ordering between MMIO, you might even need to
> add those barriers in the relaxed version to be portable with drivers
> written for ARM SoCs:
>
> void writel_relaxed(u32 v, volatile void __iomem *addr)
> {
> __raw_writel((__force u32)cpu_to_le32(v, addr)
> asm volatile("fence o,io" : : : "memory");
> }
>
> u32 readl_relaxed(volatile void __iomem *addr)
> {
> asm volatile("fence i,io" : : : "memory");
> return le32_to_cpu((__force __le32)__raw_readl(addr));
> }
>
> You then end up with a barrier before and after each regular
> readl/writel in order to synchronize both with DMA and MMIO
> instrictructions, and you still need the extre mmiowb() to
> synchronize against the spinlock.

Ah, thanks. I guess I just had those wrong. I'll fix the non-relaxed versions
and add relaxed versions that have the fence.

> My memory on mmiowb is still a bit cloudy, but I think we don't
> need that on ARM, and while PowerPC originally needed it, it is
> now implied by the spin_unlock(). If this is actually right, you might
> want to do the same here. Very few drivers actually use mmiowb(),
> but there might be more drivers that would need it if your
> spin_unlock() doesn't synchronize against writel() or writel_relaxed().
> Maybe it the mmiowb() should really be implied by writel() but not
> writel_relaxed()? That might be sensible, but only if we do it
> the same way on powerpc, which currently doesn't have
> writel_relaxed() any more relaxed than writel()

I like the idea of putting this in spin_unlock() better, particularly if
PowerPC does it that way. I was worried that with mmiowb being such an
esoteric thing that it would be wrong all over the place, this feels much
better.