Re: [PATCH 4/7] RISC-V: arch/riscv/include

From: Arnd Bergmann
Date: Thu Jun 01 2017 - 05:00:49 EST


On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:


>>> +#ifndef _ASM_RISCV_CACHE_H
>>> +#define _ASM_RISCV_CACHE_H
>>> +
>>> +#define L1_CACHE_SHIFT 6
>>> +
>>> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>>
>> Is this the only valid cache line size on riscv, or just the largest
>> one that is allowed?
>
> The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no
> restriction on L1 cache line size (we tried to keep microarchitecture out of
> the ISA specification). We provide the actual cache parameters as part of the
> device tree, but it looks like this needs to be known staticly in some places
> so we can't use that everywhere.
>
> We could always make this a Kconfig parameter.

The cache line size is used in a couple of places, let's go through the most
common ones to see where that abstraction might be leaky and you actually
get an architectural effect:

- On SMP machines, ____cacheline_aligned_in_smp is used to annotate
data structures used in lockless algorithms, typically with one CPU writing
to some members of a structure, and another CPU reading from it but
not writing the same members. Depending on the architecture, having a
larger actual alignment than L1_CACHE_BYTES will either lead to
bad performance from cache line ping pong, or actual data corruption.

- On systems with DMA masters that are not fully coherent,
____cacheline_aligned is used to annotate data structures used
for DMA buffers, to make sure that the cache maintenance operations
in dma_sync_*_for_*() helpers don't corrup data outside of the
DMA buffer. You don't seem to support noncoherent DMA masters
or the cache maintenance operations required to use those, so this
might not be a problem until someone adds an extension for those.

- Depending on the bus interconnect, a coherent DMA master might
not be able to update partial cache lines, so you need the same
annotation.

- The kmalloc() family of memory allocators aligns data to the cache
line size, for both DMA and SMP synchronization above.

- Many architectures have cache line prefetch, flush, zero or copy
instructions that are used for important performance optimizations
but that are typically defined on a cacheline granularity. I don't
think you currently have any of them, but it seems likely that there
will be demand for them later.

Having a larger than necessary alignment can waste substantial amounts
of memory for arrays of cache line aligned structures (typically
per-cpu arrays), but otherwise should not cause harm.

>>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
>>> new file mode 100644
>>> index 000000000000..d942555a7a08
>>> --- /dev/null
>>> +++ b/arch/riscv/include/asm/io.h
>>> @@ -0,0 +1,36 @@
>>
>>> +#ifndef _ASM_RISCV_IO_H
>>> +#define _ASM_RISCV_IO_H
>>> +
>>> +#include <asm-generic/io.h>
>>
>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
>> helpers using inline assembly, to prevent the compiler for breaking
>> up accesses into byte accesses.
>>
>> Also, most architectures require to some synchronization after a
>> non-relaxed readl() to prevent prefetching of DMA buffers, and
>> before a writel() to flush write buffers when a DMA gets triggered.
>
> Makes sense. These were all OK on existing implementations (as there's no
> writable PMAs, so all MMIO regions are strictly ordered), but that's not
> actually what the RISC-V ISA says. I patterned this on arm64
>
> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa
>
> where I think the only odd thing is our definition of mmiowb
>
> +/* IO barriers. These only fence on the IO bits because they're only required
> + * to order device access. We're defining mmiowb because our AMO instructions
> + * (which are used to implement locks) don't specify ordering. From Chapter 7
> + * of v2.2 of the user ISA:
> + * "The bits order accesses to one of the two address domains, memory or I/O,
> + * depending on which address domain the atomic instruction is accessing. No
> + * ordering constraint is implied to accesses to the other domain, and a FENCE
> + * instruction should be used to order across both domains."
> + */
> +
> +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : "memory");
> +#define __iowmb() __asm__ __volatile__ ("fence io,o" : : : "memory");

Looks ok, yes.

> +#define mmiowb() __asm__ __volatile__ ("fence io,io" : : : "memory");
>
> which I think is correct.

I can never remember what exactly this one does.

>>> +#ifdef __KERNEL__
>>> +
>>> +#ifdef CONFIG_MMU
>>> +
>>> +extern void __iomem *ioremap(phys_addr_t offset, unsigned long size);
>>> +
>>> +#define ioremap_nocache(addr, size) ioremap((addr), (size))
>>> +#define ioremap_wc(addr, size) ioremap((addr), (size))
>>> +#define ioremap_wt(addr, size) ioremap((addr), (size))
>>
>> Is this a hard architecture limitation? Normally you really want
>> write-combined access on frame buffer memory and a few other
>> cases for performance reasons, and ioremap_wc() gets used
>> for by memremap() for addressing RAM in some cases, and you
>> normally don't want to have PTEs for the same memory using
>> cached and uncached page flags
>
> This is currently an architecture limitation. In RISC-V these properties are
> known as PMAs (Physical Memory Attributes). While the supervisor spec mentions
> PMAs, it doesn't provide a mechanism to read or write them so they are
> essentially unspecified. PMAs will be properly defined as part of the platform
> specification, which isn't written yet.

Ok. Maybe add that as a comment above these definitions then.

>>> +/*
>>> + * low level task data that entry.S needs immediate access to
>>> + * - this struct should fit entirely inside of one cache line
>>> + * - this struct resides at the bottom of the supervisor stack
>>> + * - if the members of this struct changes, the assembly constants
>>> + * in asm-offsets.c must be updated accordingly
>>> + */
>>> +struct thread_info {
>>> + struct task_struct *task; /* main task structure */
>>> + unsigned long flags; /* low level flags */
>>> + __u32 cpu; /* current CPU */
>>> + int preempt_count; /* 0 => preemptable, <0 => BUG */
>>> + mm_segment_t addr_limit;
>>> +};
>>
>> Please see 15f4eae70d36 ("x86: Move thread_info into task_struct")
>> and try to do the same.
>
> OK, here's my attempt
>
> https://github.com/riscv/riscv-linux/commit/c618553e7aa65c85564a5d0a868ec7e6cf634afd
>
> Since there's some actual meat, I left a commit message (these are more just
> notes for me for my v2, I'll be squashing everything)
>
> "
> This is patterned more off the arm64 move than the x86 one, since we
> still need to have at least addr_limit to emulate FS.
>
> The patch itself changes sscratch from holding SP to holding TP, which
> contains a pointer to task_struct. thread_info must be at a 0 offset
> from task_struct, but it looks like that's already enforced with a big
> comment. We now store both the user and kernel SP in task_struct, but
> those are really acting more as extra scratch space than pemanent
> storage.
> "

I haven't looked at all the details of the x86 patch, but it seems they
decided to put the arch specific members into 'struct thread_struct'
rather than 'struct thread_info', so I'd suggest you do the same here for
consistency, unless there is a strong reason against doing it.

>>> +#else /* !CONFIG_MMU */
>>> +
>>> +static inline void flush_tlb_all(void)
>>> +{
>>> + BUG();
>>> +}
>>> +
>>> +static inline void flush_tlb_mm(struct mm_struct *mm)
>>> +{
>>> + BUG();
>>> +}
>>
>> The NOMMU support is rather incomplete and CONFIG_MMU is
>> hard-enabled, so I'd just drop any !CONFIG_MMU #ifdefs.
>
> OK. I've left in the "#ifdef CONFIG_MMU" blocks as the #ifdef/#endif doesn't
> really add any code, but I can go ahead and drop the #ifdef if you think that's
> better.
>
> https://github.com/riscv/riscv-linux/commit/e98ca23adfb9422bebc87cbfb58f70d4a63cf067

Ok.

>>> +#include <asm-generic/unistd.h>
>>> +
>>> +#define __NR_sysriscv __NR_arch_specific_syscall
>>> +#ifndef __riscv_atomic
>>> +__SYSCALL(__NR_sysriscv, sys_sysriscv)
>>> +#endif
>>
>> Please make this a straight cmpxchg syscall and remove the multiplexer.
>> Why does the definition depend on __riscv_atomic rather than the
>> Kconfig symbol?
>
> I think that was just an oversight: that's not the right switch. Either you or
> someone else pointed out some problems with this. There's going to be an
> interposer in the VDSO, and then we'll always enable the system call.
>
> I can change this to two system calls: sysriscv_cmpxchg32 and
> sysriscv_cmpxchg64.

Sounds good.

Arnd