Re: [PATCH v2 2/3] LoongArch: Add barrier between set_pte and memory access
From: Huacai Chen
Date: Tue Oct 15 2024 - 08:28:59 EST
On Tue, Oct 15, 2024 at 10:54 AM maobibo <maobibo@xxxxxxxxxxx> wrote:
>
>
>
> On 2024/10/14 下午2:31, Huacai Chen wrote:
> > Hi, Bibo,
> >
> > On Mon, Oct 14, 2024 at 11:59 AM Bibo Mao <maobibo@xxxxxxxxxxx> wrote:
> >>
> >> It is possible to return a spurious fault if memory is accessed
> >> right after the pte is set. For user address space, pte is set
> >> in kernel space and memory is accessed in user space, there is
> >> long time for synchronization, no barrier needed. However for
> >> kernel address space, it is possible that memory is accessed
> >> right after the pte is set.
> >>
> >> Here flush_cache_vmap/flush_cache_vmap_early is used for
> >> synchronization.
> >>
> >> Signed-off-by: Bibo Mao <maobibo@xxxxxxxxxxx>
> >> ---
> >> arch/loongarch/include/asm/cacheflush.h | 14 +++++++++++++-
> >> 1 file changed, 13 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/loongarch/include/asm/cacheflush.h b/arch/loongarch/include/asm/cacheflush.h
> >> index f8754d08a31a..53be231319ef 100644
> >> --- a/arch/loongarch/include/asm/cacheflush.h
> >> +++ b/arch/loongarch/include/asm/cacheflush.h
> >> @@ -42,12 +42,24 @@ void local_flush_icache_range(unsigned long start, unsigned long end);
> >> #define flush_cache_dup_mm(mm) do { } while (0)
> >> #define flush_cache_range(vma, start, end) do { } while (0)
> >> #define flush_cache_page(vma, vmaddr, pfn) do { } while (0)
> >> -#define flush_cache_vmap(start, end) do { } while (0)
> >> #define flush_cache_vunmap(start, end) do { } while (0)
> >> #define flush_icache_user_page(vma, page, addr, len) do { } while (0)
> >> #define flush_dcache_mmap_lock(mapping) do { } while (0)
> >> #define flush_dcache_mmap_unlock(mapping) do { } while (0)
> >>
> >> +/*
> >> + * It is possible for a kernel virtual mapping access to return a spurious
> >> + * fault if it's accessed right after the pte is set. The page fault handler
> >> + * does not expect this type of fault. flush_cache_vmap is not exactly the
> >> + * right place to put this, but it seems to work well enough.
> >> + */
> >> +static inline void flush_cache_vmap(unsigned long start, unsigned long end)
> >> +{
> >> + smp_mb();
> >> +}
> >> +#define flush_cache_vmap flush_cache_vmap
> >> +#define flush_cache_vmap_early flush_cache_vmap
> > From the history of flush_cache_vmap_early(), It seems only archs with
> > "virtual cache" (VIVT or VIPT) need this API, so LoongArch can be a
> > no-op here.
OK, flush_cache_vmap_early() also needs smp_mb().
>
> Here is usage about flush_cache_vmap_early in file linux/mm/percpu.c,
> map the page and access it immediately. Do you think it should be noop
> on LoongArch.
>
> rc = __pcpu_map_pages(unit_addr, &pages[unit * unit_pages],
> unit_pages);
> if (rc < 0)
> panic("failed to map percpu area, err=%d\n", rc);
> flush_cache_vmap_early(unit_addr, unit_addr + ai->unit_size);
> /* copy static data */
> memcpy((void *)unit_addr, __per_cpu_load, ai->static_size);
> }
>
>
> >
> > And I still think flush_cache_vunmap() should be a smp_mb(). A
> > smp_mb() in flush_cache_vmap() prevents subsequent accesses be
> > reordered before pte_set(), and a smp_mb() in flush_cache_vunmap()
> smp_mb() in flush_cache_vmap() does not prevent reorder. It is to flush
> pipeline and let page table walker HW sync with data cache.
>
> For the following example.
> rb = vmap(pages, nr_meta_pages + 2 * nr_data_pages,
> VM_MAP | VM_USERMAP, PAGE_KERNEL);
> if (rb) {
> <<<<<<<<<<< * the sentence if (rb) can prevent reorder. Otherwise with
> any API kmalloc/vmap/vmalloc and subsequent memory access, there will be
> reorder issu. *
> kmemleak_not_leak(pages);
> rb->pages = pages;
> rb->nr_pages = nr_pages;
> return rb;
> }
>
> > prevents preceding accesses be reordered after pte_clear(). This
> Can you give an example about such usage about flush_cache_vunmap()? and
> we can continue to talk about it, else it is just guessing.
Since we cannot reach a consensus, and the flush_cache_* API look very
strange for this purpose (Yes, I know PowerPC does it like this, but
ARM64 doesn't). I prefer to still use the ARM64 method which means add
a dbar in set_pte(). Of course the performance will be a little worse,
but still better than the old version, and it is more robust.
I know you are very busy, so if you have no time you don't need to
send V3, I can just do a small modification on the 3rd patch.
Huacai
>
> Regards
> Bibo Mao
> > potential problem may not be seen from experiment, but it is needed in
> > theory.
> >
> > Huacai
> >
> >> +
> >> #define cache_op(op, addr) \
> >> __asm__ __volatile__( \
> >> " cacop %0, %1 \n" \
> >> --
> >> 2.39.3
> >>
> >>
>
>