Re: [PATCH 4/7] RISC-V: arch/riscv/include
From: Palmer Dabbelt
Date: Tue Jun 06 2017 - 00:56:50 EST
On Thu, 01 Jun 2017 02:00:22 PDT (-0700), Arnd Bergmann wrote:
> On Thu, Jun 1, 2017 at 2:56 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>> On Tue, 23 May 2017 05:55:15 PDT (-0700), Arnd Bergmann wrote:
>>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>
>
>>>> +#ifndef _ASM_RISCV_CACHE_H
>>>> +#define _ASM_RISCV_CACHE_H
>>>> +
>>>> +#define L1_CACHE_SHIFT 6
>>>> +
>>>> +#define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>>>
>>> Is this the only valid cache line size on riscv, or just the largest
>>> one that is allowed?
>>
>> The RISC-V ISA manual doesn't actually mention caches anywhere, so there's no
>> restriction on L1 cache line size (we tried to keep microarchitecture out of
>> the ISA specification). We provide the actual cache parameters as part of the
>> device tree, but it looks like this needs to be known staticly in some places
>> so we can't use that everywhere.
>>
>> We could always make this a Kconfig parameter.
>
> The cache line size is used in a couple of places, let's go through the most
> common ones to see where that abstraction might be leaky and you actually
> get an architectural effect:
>
> - On SMP machines, ____cacheline_aligned_in_smp is used to annotate
> data structures used in lockless algorithms, typically with one CPU writing
> to some members of a structure, and another CPU reading from it but
> not writing the same members. Depending on the architecture, having a
> larger actual alignment than L1_CACHE_BYTES will either lead to
> bad performance from cache line ping pong, or actual data corruption.
On RISC-V it's just a performance problem, so at least it's not catastrophic.
> - On systems with DMA masters that are not fully coherent,
> ____cacheline_aligned is used to annotate data structures used
> for DMA buffers, to make sure that the cache maintenance operations
> in dma_sync_*_for_*() helpers don't corrup data outside of the
> DMA buffer. You don't seem to support noncoherent DMA masters
> or the cache maintenance operations required to use those, so this
> might not be a problem until someone adds an extension for those.
>
> - Depending on the bus interconnect, a coherent DMA master might
> not be able to update partial cache lines, so you need the same
> annotation.
Well, our (SiFive's) bus is easy to master so hopefully we won't end up doing
that. There is, of course, the rest of the world -- but that's just a bridge
we'll have to cross later (if such an implementation arises).
> - The kmalloc() family of memory allocators aligns data to the cache
> line size, for both DMA and SMP synchronization above.
Ya, but luckily just a performance problem on RISC-V.
> - Many architectures have cache line prefetch, flush, zero or copy
> instructions that are used for important performance optimizations
> but that are typically defined on a cacheline granularity. I don't
> think you currently have any of them, but it seems likely that there
> will be demand for them later.
We actually have an implicit prefetch (loads to x0, the zero register), but
it still has all the load side-effects so nothing uses it.
> Having a larger than necessary alignment can waste substantial amounts
> of memory for arrays of cache line aligned structures (typically
> per-cpu arrays), but otherwise should not cause harm.
I bugged our L1 guy and he says 64-byte lines are a bit of a magic number
because of how they line up with DIMMs. Since there's no spec to define this,
there's no correct answer. I'd be amenable to making this a Kconfig option,
but I think we'll leave it alone for now. It does match the extant
implementations.
>>>> diff --git a/arch/riscv/include/asm/io.h b/arch/riscv/include/asm/io.h
>>>> new file mode 100644
>>>> index 000000000000..d942555a7a08
>>>> --- /dev/null
>>>> +++ b/arch/riscv/include/asm/io.h
>>>> @@ -0,0 +1,36 @@
>>>
>>>> +#ifndef _ASM_RISCV_IO_H
>>>> +#define _ASM_RISCV_IO_H
>>>> +
>>>> +#include <asm-generic/io.h>
>>>
>>> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
>>> helpers using inline assembly, to prevent the compiler for breaking
>>> up accesses into byte accesses.
>>>
>>> Also, most architectures require to some synchronization after a
>>> non-relaxed readl() to prevent prefetching of DMA buffers, and
>>> before a writel() to flush write buffers when a DMA gets triggered.
>>
>> Makes sense. These were all OK on existing implementations (as there's no
>> writable PMAs, so all MMIO regions are strictly ordered), but that's not
>> actually what the RISC-V ISA says. I patterned this on arm64
>>
>> https://github.com/riscv/riscv-linux/commit/e200fa29a69451ef4d575076e4d2af6b7877b1fa
>>
>> where I think the only odd thing is our definition of mmiowb
>>
>> +/* IO barriers. These only fence on the IO bits because they're only required
>> + * to order device access. We're defining mmiowb because our AMO instructions
>> + * (which are used to implement locks) don't specify ordering. From Chapter 7
>> + * of v2.2 of the user ISA:
>> + * "The bits order accesses to one of the two address domains, memory or I/O,
>> + * depending on which address domain the atomic instruction is accessing. No
>> + * ordering constraint is implied to accesses to the other domain, and a FENCE
>> + * instruction should be used to order across both domains."
>> + */
>> +
>> +#define __iormb() __asm__ __volatile__ ("fence i,io" : : : "memory");
>> +#define __iowmb() __asm__ __volatile__ ("fence io,o" : : : "memory");
>
> Looks ok, yes.
>
>> +#define mmiowb() __asm__ __volatile__ ("fence io,io" : : : "memory");
>>
>> which I think is correct.
>
> I can never remember what exactly this one does.
I can't find the reference again, but what I found said that if your atomics
(or whatever's used for locking) don't stay ordered with your MMIO accesses,
then you should define mmiowb to ensure ordering. I managed to screw this up,
as there's no "w" in the successor set (to actually enforce the AMO ordering).
This is somewhat confirmed by
https://lkml.org/lkml/2006/8/31/174
Subject: Re: When to use mmiowb()?
AFAICT, they're both right. Generally, mmiowb() should be used prior to
unlock in a critical section whose last PIO operation is a writeX.
Thus, I think the actual fence should be at least
fence o,w
Documentation/memory-barries.txt says
"
The Linux kernel also has a special barrier for use with memory-mapped I/O
writes:
mmiowb();
This is a variation on the mandatory write barrier that causes writes to weakly
ordered I/O regions to be partially ordered. Its effects may go beyond the
CPU->Hardware interface and actually affect the hardware at some level.
See the subsection "Acquires vs I/O accesses" for more information.
"
"
ACQUIRES VS I/O ACCESSES
------------------------
Under certain circumstances (especially involving NUMA), I/O accesses within
two spinlocked sections on two different CPUs may be seen as interleaved by the
PCI bridge, because the PCI bridge does not necessarily participate in the
cache-coherence protocol, and is therefore incapable of issuing the required
read memory barriers.
For example:
CPU 1 CPU 2
=============================== ===============================
spin_lock(Q)
writel(0, ADDR)
writel(1, DATA);
spin_unlock(Q);
spin_lock(Q);
writel(4, ADDR);
writel(5, DATA);
spin_unlock(Q);
may be seen by the PCI bridge as follows:
STORE *ADDR = 0, STORE *ADDR = 4, STORE *DATA = 1, STORE *DATA = 5
which would probably cause the hardware to malfunction.
What is necessary here is to intervene with an mmiowb() before dropping the
spinlock, for example:
CPU 1 CPU 2
=============================== ===============================
spin_lock(Q)
writel(0, ADDR)
writel(1, DATA);
mmiowb();
spin_unlock(Q);
spin_lock(Q);
writel(4, ADDR);
writel(5, DATA);
mmiowb();
spin_unlock(Q);
this will ensure that the two stores issued on CPU 1 appear at the PCI bridge
before either of the stores issued on CPU 2.
Furthermore, following a store by a load from the same device obviates the need
for the mmiowb(), because the load forces the store to complete before the load
is performed:
CPU 1 CPU 2
=============================== ===============================
spin_lock(Q)
writel(0, ADDR)
a = readl(DATA);
spin_unlock(Q);
spin_lock(Q);
writel(4, ADDR);
b = readl(DATA);
spin_unlock(Q);
See Documentation/driver-api/device-io.rst for more information.
"
which matches what's above. I think "fence o,w" is sufficient for a mmiowb on
RISC-V. I'll make the change.
>>>> +#ifdef __KERNEL__
>>>> +
>>>> +#ifdef CONFIG_MMU
>>>> +
>>>> +extern void __iomem *ioremap(phys_addr_t offset, unsigned long size);
>>>> +
>>>> +#define ioremap_nocache(addr, size) ioremap((addr), (size))
>>>> +#define ioremap_wc(addr, size) ioremap((addr), (size))
>>>> +#define ioremap_wt(addr, size) ioremap((addr), (size))
>>>
>>> Is this a hard architecture limitation? Normally you really want
>>> write-combined access on frame buffer memory and a few other
>>> cases for performance reasons, and ioremap_wc() gets used
>>> for by memremap() for addressing RAM in some cases, and you
>>> normally don't want to have PTEs for the same memory using
>>> cached and uncached page flags
>>
>> This is currently an architecture limitation. In RISC-V these properties are
>> known as PMAs (Physical Memory Attributes). While the supervisor spec mentions
>> PMAs, it doesn't provide a mechanism to read or write them so they are
>> essentially unspecified. PMAs will be properly defined as part of the platform
>> specification, which isn't written yet.
>
> Ok. Maybe add that as a comment above these definitions then.
OK.
>>>> +/*
>>>> + * low level task data that entry.S needs immediate access to
>>>> + * - this struct should fit entirely inside of one cache line
>>>> + * - this struct resides at the bottom of the supervisor stack
>>>> + * - if the members of this struct changes, the assembly constants
>>>> + * in asm-offsets.c must be updated accordingly
>>>> + */
>>>> +struct thread_info {
>>>> + struct task_struct *task; /* main task structure */
>>>> + unsigned long flags; /* low level flags */
>>>> + __u32 cpu; /* current CPU */
>>>> + int preempt_count; /* 0 => preemptable, <0 => BUG */
>>>> + mm_segment_t addr_limit;
>>>> +};
>>>
>>> Please see 15f4eae70d36 ("x86: Move thread_info into task_struct")
>>> and try to do the same.
>>
>> OK, here's my attempt
>>
>> https://github.com/riscv/riscv-linux/commit/c618553e7aa65c85564a5d0a868ec7e6cf634afd
>>
>> Since there's some actual meat, I left a commit message (these are more just
>> notes for me for my v2, I'll be squashing everything)
>>
>> "
>> This is patterned more off the arm64 move than the x86 one, since we
>> still need to have at least addr_limit to emulate FS.
>>
>> The patch itself changes sscratch from holding SP to holding TP, which
>> contains a pointer to task_struct. thread_info must be at a 0 offset
>> from task_struct, but it looks like that's already enforced with a big
>> comment. We now store both the user and kernel SP in task_struct, but
>> those are really acting more as extra scratch space than pemanent
>> storage.
>> "
>
> I haven't looked at all the details of the x86 patch, but it seems they
> decided to put the arch specific members into 'struct thread_struct'
> rather than 'struct thread_info', so I'd suggest you do the same here for
> consistency, unless there is a strong reason against doing it.
We actually can't put them in "struct thread_struct" in a sane manner. On
context switches on RISC-V there are no register saved, instead we use the
instructions
csrrw REG, sscratch, REG
which swaps some register (used to be sp, now tp) with the "sscratch" CSR (a
register only visible to the supervisor). At this point we only have one
register to work with in order to save the user state. Since "struct
thread_info" is 0-offset from "struct task_struct", we're guaranteed to be able
to access it using our one addressing mode (a 12-bit signed constant offset
from a register).
Unfortunately, "struct thread_struct" is at a potentially large offset from the
saved TP. We could do something silly like
addi tp, tp, OFFSET_1
addi tp, tp, OFFSET_2
addi tp, tp, OFFSET_3
but for an "allyesconfig" we end up with offsets of about 9KiB, which would
require 4 additional instructions to find "struct thread_struct". While this
is possible, I'd prefer to avoid the extra cycles. We usually handle long
immediate with a two-instruction sequence
li REG, IMM31-12 (loads the high bits of REG, zeroing the low bits)
addi REG, REG, IMM11-0 (loads the low bits of REG, leaving the high bits alone)
but there's no scratch register here so we can't use that. We don't have a
long-immediate-add instruction.
The arguments in x86 land for moving everything to "struct thread_struct" were
that they always know it's in "struct task_struct", but since that's all we're
supporting on RISC-V I don't think it counts. There were also discussions of
eliminating "struct thread_info", but unless there's something at the start of
"struct task_struct" then RISC-V will have to pay a bunch of cycles on a
context switch.
>>>> +#else /* !CONFIG_MMU */
>>>> +
>>>> +static inline void flush_tlb_all(void)
>>>> +{
>>>> + BUG();
>>>> +}
>>>> +
>>>> +static inline void flush_tlb_mm(struct mm_struct *mm)
>>>> +{
>>>> + BUG();
>>>> +}
>>>
>>> The NOMMU support is rather incomplete and CONFIG_MMU is
>>> hard-enabled, so I'd just drop any !CONFIG_MMU #ifdefs.
>>
>> OK. I've left in the "#ifdef CONFIG_MMU" blocks as the #ifdef/#endif doesn't
>> really add any code, but I can go ahead and drop the #ifdef if you think that's
>> better.
>>
>> https://github.com/riscv/riscv-linux/commit/e98ca23adfb9422bebc87cbfb58f70d4a63cf067
>
> Ok.
>
>>>> +#include <asm-generic/unistd.h>
>>>> +
>>>> +#define __NR_sysriscv __NR_arch_specific_syscall
>>>> +#ifndef __riscv_atomic
>>>> +__SYSCALL(__NR_sysriscv, sys_sysriscv)
>>>> +#endif
>>>
>>> Please make this a straight cmpxchg syscall and remove the multiplexer.
>>> Why does the definition depend on __riscv_atomic rather than the
>>> Kconfig symbol?
>>
>> I think that was just an oversight: that's not the right switch. Either you or
>> someone else pointed out some problems with this. There's going to be an
>> interposer in the VDSO, and then we'll always enable the system call.
>>
>> I can change this to two system calls: sysriscv_cmpxchg32 and
>> sysriscv_cmpxchg64.
>
> Sounds good.
>
> Arnd