Re: [PATCH 4/7] RISC-V: arch/riscv/include

From: Arnd Bergmann
Date: Tue Jun 06 2017 - 04:54:35 EST


On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
> On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote:
>> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>>> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
>>>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>> - Many architectures have cache line prefetch, flush, zero or copy
>> instructions that are used for important performance optimizations
>> but that are typically defined on a cacheline granularity. I don't
>> think you currently have any of them, but it seems likely that there
>> will be demand for them later.
>
> We actually have an implicit prefetch (loads to x0, the zero register), but
> it still has all the load side-effects so nothing uses it.
>
>> Having a larger than necessary alignment can waste substantial amounts
>> of memory for arrays of cache line aligned structures (typically
>> per-cpu arrays), but otherwise should not cause harm.
>
> I bugged our L1 guy and he says 64-byte lines are a bit of a magic number
> because of how they line up with DIMMs. Since there's no spec to define this,
> there's no correct answer. I'd be amenable to making this a Kconfig option,
> but I think we'll leave it alone for now. It does match the extant
> implementations.

Hmm, this sounds like a hole in the architecture definition: if you have an
instruction that performs a prefetch (even one that is not easily usable),
I would argue that the cache line size has become a feature of the
architecture and is no longer strictly an implementation detail of the
microarchitecture.

Regarding the memory interface, a lot of systems use two DIMMs on
each memory channel for 128-bit parallel buses, and with LP-DDRx
controllers, you might have a width as small as 16 bits.

>>>>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
>>>>> new file mode 100644
>>>>> index 000000000000..d942555a7a08
>>>>> --- /dev/null
>>>>> +++ b/arch/riscv/include/asm/io.h
>>>>> @@ -0,0 +1,36 @@
>>>>
>>>>> +#ifndef _ASM_RISCV_IO_H
>>>>> +#define _ASM_RISCV_IO_H
>>>>> +
>>>>> +#include <asm-generic/io.h>
>>>>
>>>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
>>>> helpers using inline assembly, to prevent the compiler for breaking
>>>> up accesses into byte accesses.
>>>>
>>>> Also, most architectures require to some synchronization after a
>>>> non-relaxed readl() to prevent prefetching of DMA buffers, and
>>>> before a writel() to flush write buffers when a DMA gets triggered.
>>>
>>> Makes sense. These were all OK on existing implementations (as there's no
>>> writable PMAs, so all MMIO regions are strictly ordered), but that's not
>>> actually what the RISC-V ISA says. I patterned this on arm64
>>>
>>> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa
>>>
>>> where I think the only odd thing is our definition of mmiowb
>>>
>>> +/* IO barriers. These only fence on the IO bits because they're only required
>>> + * to order device access. We're defining mmiowb because our AMO instructions
>>> + * (which are used to implement locks) don't specify ordering. From Chapter 7
>>> + * of v2.2 of the user ISA:
>>> + * "The bits order accesses to one of the two address domains, memory or I/O,
>>> + * depending on which address domain the atomic instruction is accessing. No
>>> + * ordering constraint is implied to accesses to the other domain, and a FENCE
>>> + * instruction should be used to order across both domains."
>>> + */
>>> +
>>> +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : "memory");
>>> +#define __iowmb() __asm__ __volatile__ ("fence io,o" : : : "memory");
>>
>> Looks ok, yes.
>>
>>> +#define mmiowb() __asm__ __volatile__ ("fence io,io" : : : "memory");
>>>
>>> which I think is correct.
>>
>> I can never remember what exactly this one does.
>
> I can't find the reference again, but what I found said that if your atomics
> (or whatever's used for locking) don't stay ordered with your MMIO accesses,
> then you should define mmiowb to ensure ordering. I managed to screw this up,
> as there's no "w" in the successor set (to actually enforce the AMO ordering).
> This is somewhat confirmed by
>
> https://lkml.org/lkml/2006/8/31/174
> Subject: Re: When to use mmiowb()?
> AFAICT, they're both right. Generally, mmiowb() should be used prior to
> unlock in a critical section whose last PIO operation is a writeX.
>
> Thus, I think the actual fence should be at least
>
> fence o,w
...
> which matches what's above. I think "fence o,w" is sufficient for a mmiowb on
> RISC-V. I'll make the change.

This sounds reasonable according to the documentation, but with your
longer explanation of the barriers, I think the __iormb/__iowmb definitions
above are wrong. What you actually need I think is

void writel(u32 v, volatile void __iomem *addr)
{
asm volatile("fence w,o" : : : "memory");
writel_relaxed(v, addr);
}

u32 readl(volatile void __iomem *addr)
{
u32 ret = readl_relaxed(addr);
asm volatile("fence i,r" : : : "memory");
return ret;
}

to synchronize between DMA and I/O. The barriers you listed above
in contrast appear to be directed at synchronizing I/O with other I/O.
We normally assume that this is not required when you have
subsequent MMIO accesses on the same device (on PCI) or the
same address region (per ARM architecture and others). If you do
need to enforce ordering between MMIO, you might even need to
add those barriers in the relaxed version to be portable with drivers
written for ARM SoCs:

void writel_relaxed(u32 v, volatile void __iomem *addr)
{
__raw_writel((__force u32)cpu_to_le32(v, addr)
asm volatile("fence o,io" : : : "memory");
}

u32 readl_relaxed(volatile void __iomem *addr)
{
asm volatile("fence i,io" : : : "memory");
return le32_to_cpu((__force __le32)__raw_readl(addr));
}

You then end up with a barrier before and after each regular
readl/writel in order to synchronize both with DMA and MMIO
instrictructions, and you still need the extre mmiowb() to
synchronize against the spinlock.

My memory on mmiowb is still a bit cloudy, but I think we don't
need that on ARM, and while PowerPC originally needed it, it is
now implied by the spin_unlock(). If this is actually right, you might
want to do the same here. Very few drivers actually use mmiowb(),
but there might be more drivers that would need it if your
spin_unlock() doesn't synchronize against writel() or writel_relaxed().
Maybe it the mmiowb() should really be implied by writel() but not
writel_relaxed()? That might be sensible, but only if we do it
the same way on powerpc, which currently doesn't have
writel_relaxed() any more relaxed than writel()

> The arguments in x86 land for moving everything to "struct thread_struct" were
> that they always know it's in "struct task_struct", but since that's all we're
> supporting on RISC-V I don't think it counts. There were also discussions of
> eliminating "struct thread_info", but unless there's something at the start of
> "struct task_struct" then RISC-V will have to pay a bunch of cycles on a
> context switch.

Ok. Since you have THREAD_INFO_IN_TASK, it probably doesn't matter
too much then, it's just a bit inconsistent with the other architectures, but
the effect is the same, so just do it the way that is more efficient for you.

Arnd