Re: [PATCH] riscv: Add support for BATCHED_UNMAP_TLB_FLUSH

From: Jisheng Zhang
Date: Sat Jan 06 2024 - 09:18:12 EST

Next message: Andy Shevchenko: "Re: [PATCH] list: Add hlist_count_nodes()"
Previous message: Dipam Turkar: "[PATCH 1/1] Refactor radeon driver to use drm_gem_create_map_offset() instead of its custom implementation for associating GEM object with a fake offset. Since, we already have a generic implementation, we don't need the custom function and it is better to standardize the code."
In reply to: Jisheng Zhang: "Re: [PATCH] riscv: Add support for BATCHED_UNMAP_TLB_FLUSH"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Jan 06, 2024 at 09:47:04PM +0800, Jisheng Zhang wrote:
> On Fri, Jan 05, 2024 at 02:36:44PM +0100, Alexandre Ghiti wrote:
> > On Thu, Jan 4, 2024 at 6:42 PM Alexandre Ghiti <alexghiti@xxxxxxxxxxxx> wrote:
> > >
> > > Hi Jisheng,
> > >
> > > On Wed, Jan 3, 2024 at 12:10 PM Jisheng Zhang <jszhang@xxxxxxxxxx> wrote:
> > > >
> > > > On Tue, Jan 02, 2024 at 03:18:51PM +0100, Alexandre Ghiti wrote:
> > > > > Allow to defer the flushing of the TLB when unmapping pges, which allows
> > > > > to reduce the numbers of IPI and the number of sfence.vma.
> > > > >
> > > > > The ubenchmarch used in commit 43b3dfdd0455 ("arm64: support
> > > > > batched/deferred tlb shootdown during page reclamation/migration") shows
> > > > > good performance improvement and perf reports an important decrease in
> > > > > time spent flushing the tlb (results come from qemu):
> > > >
> > > > Hi Alex,
> > > >
> > > > I tried this micro benchmark with your patch on T-HEAD TH1520 platform, I
> > > > didn't see any performance improvement for the micro benchmark. Per
> > > > myunderstanding, the micro benchmark is special case for arm64 because
> > > > in a normal tlb flush flow, below sequence is necessary:
> > > >
> > > > tlbi
> > > > dsb
> > > >
> > > >
> > > > while with BATCHED_UNMAP_TLB_FLUSH, the arm64 just does 'tlbi', leaving
> > > > the dsb to the arch_tlbbatch_flush(). So the final result is
> > > >
> > > > several 'tlbi + dsb' sequence VS. several 'tlbi' instructions + only one dsb
> > > > The performance improvement comes from the unnecessary dsb eliminations.
> > >
> > > Some batching should take place, and with this patch, we only send one
> > > "full" sfence.vma instead of a "local" sfence.vma for each page, it
> > > seems weird that you don't see any improvement, I would have thought
> > > that one "full" sfence.vma would be better.
> > >
> > > >
> > > > Do you have suitable benchmark(s) for BATCHED_UNMAP_TLB_FLUSH on riscv?
> > >
> > > Can you give the following benchmark a try? I simply created threads
> > > and dispatched them on all the cpus to force IPI usage, that should be
> > > way better if the batching of the first ubenchmark is not enough to
> > > exacerbate performance improvements, let me know and thanks for your
> > > tests!
> > >
> > > #define _GNU_SOURCE
> > > #include <pthread.h>
> > > #include <sys/types.h>
> > > #include <unistd.h>
> > > #include <sys/mman.h>
> > > #include <string.h>
> > > #include <errno.h>
> > > #include <sched.h>
> > > #include <time.h>
> > >
> > > int stick_this_thread_to_core(int core_id) {
> > > int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
> > > if (core_id < 0 || core_id >= num_cores)
> > > return EINVAL;
> > >
> > > cpu_set_t cpuset;
> > > CPU_ZERO(&cpuset);
> > > CPU_SET(core_id, &cpuset);
> > >
> > > pthread_t current_thread = pthread_self();
> > > return pthread_setaffinity_np(current_thread,
> > > sizeof(cpu_set_t), &cpuset);
> > > }
> > >
> > > static void *fn_thread (void *p_data)
> > > {
> > > int ret;
> > > pthread_t thread;
> > >
> > > stick_this_thread_to_core((int)p_data);
> > >
> > > while (1) {
> > > sleep(1);
> > > }
> > >
> > > return NULL;
> > > }
> > >
> > > int main()
> > > {
> > > #define SIZE (1 * 1024 * 1024)
> > > volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> > > MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> > > pthread_t threads[4];
> > > int ret;
> > >
> > > for (int i = 0; i < 4; ++i) {
> > > ret = pthread_create(&threads[i], NULL, fn_thread, (void *)i);
> > > if (ret)
> > > {
> > > printf("%s", strerror (ret));
> > > }
> > > }
> > >
> > > memset(p, 0x88, SIZE);
> > >
> > > for (int k = 0; k < 500 /* 10000 */; k++) {
> > > /* swap in */
> > > for (int i = 0; i < SIZE; i += 4096) {
> > > (void)p[i];
> > > }
> > >
> > > /* swap out */
> > > madvise(p, SIZE, MADV_PAGEOUT);
> > > }
> > >
> > > for (int i = 0; i < 4; i++)
> > > {
> > > pthread_cancel(threads[i]);
> > > }
> > >
> > > for (int i = 0; i < 4; i++)
> > > {
> > > pthread_join(threads[i], NULL);
> > > }
> > >
> > > return 0;
> > >
> > > }
> > >
> >
> > So I removed the dust from my unmatched and ran the benchmarks I proposed:
> >
> > Without this patch:
> > * benchmark from commit 43b3dfdd0455 (4 runs) : ~20.3s
> > * same benchmark with threads (4 runs) : ~27.4s
> >
> > With this patch:
> > * benchmark from commit 43b3dfdd0455 (4 runs) : ~17.9s
> > * same benchmark with threads (4 runs) : ~18.1s
> >
> > So a small improvement for the single thread benchmark, but it depends
> > on the number of pages that get flushed, so to me that's not
> > applicable for the general case. For the same benchmark with multiple
> > threads, that's ~34% improvement. I'll add those numbers to the v2,
> > and JIsheng if you can provide some too, I'll add them too!
>
> Hi Alex,
>
> the threaded version show ~78% improvement! impressive!

One more thing when you cook v2: it's better to patch the riscv entry
in Documentation/features/vm/TLB/arch-support.txt

>
> So for the patch:
>
> Reviewed-by: Jisheng Zhang <jszhang@xxxxxxxxxx>
> Tested-by: Jisheng Zhang <jszhang@xxxxxxxxxx>
>
> Thanks
> >
> > Thanks,
> >
> > Alex
> >
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > Before this patch:
> > > > >
> > > > > real 2m1.135s
> > > > > user 0m0.980s
> > > > > sys 2m0.096s
> > > > >
> > > > > 4.83% batch_tlb [kernel.kallsyms] [k] __flush_tlb_range
> > > > >
> > > > > After this patch:
> > > > >
> > > > > real 1m0.543s
> > > > > user 0m1.059s
> > > > > sys 0m59.489s
> > > > >
> > > > > 0.14% batch_tlb [kernel.kallsyms] [k] __flush_tlb_range
> > > > >
> > > > > Signed-off-by: Alexandre Ghiti <alexghiti@xxxxxxxxxxxx>
> > > > > ---
> > > > > arch/riscv/Kconfig | 1 +
> > > > > arch/riscv/include/asm/tlbbatch.h | 15 +++++++
> > > > > arch/riscv/include/asm/tlbflush.h | 10 +++++
> > > > > arch/riscv/mm/tlbflush.c | 71 ++++++++++++++++++++++---------
> > > > > 4 files changed, 77 insertions(+), 20 deletions(-)
> > > > > create mode 100644 arch/riscv/include/asm/tlbbatch.h
> > > > >
> > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > > > index 7603bd8ab333..aa07bd43b138 100644
> > > > > --- a/arch/riscv/Kconfig
> > > > > +++ b/arch/riscv/Kconfig
> > > > > @@ -53,6 +53,7 @@ config RISCV
> > > > > select ARCH_USE_MEMTEST
> > > > > select ARCH_USE_QUEUED_RWLOCKS
> > > > > select ARCH_USES_CFI_TRAPS if CFI_CLANG
> > > > > + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if SMP && MMU
> > > > > select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
> > > > > select ARCH_WANT_FRAME_POINTERS
> > > > > select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
> > > > > diff --git a/arch/riscv/include/asm/tlbbatch.h b/arch/riscv/include/asm/tlbbatch.h
> > > > > new file mode 100644
> > > > > index 000000000000..46014f70b9da
> > > > > --- /dev/null
> > > > > +++ b/arch/riscv/include/asm/tlbbatch.h
> > > > > @@ -0,0 +1,15 @@
> > > > > +/* SPDX-License-Identifier: GPL-2.0-only */
> > > > > +/*
> > > > > + * Copyright (C) 2023 Rivos Inc.
> > > > > + */
> > > > > +
> > > > > +#ifndef _ASM_RISCV_TLBBATCH_H
> > > > > +#define _ASM_RISCV_TLBBATCH_H
> > > > > +
> > > > > +#include <linux/cpumask.h>
> > > > > +
> > > > > +struct arch_tlbflush_unmap_batch {
> > > > > + struct cpumask cpumask;
> > > > > +};
> > > > > +
> > > > > +#endif /* _ASM_RISCV_TLBBATCH_H */
> > > > > diff --git a/arch/riscv/include/asm/tlbflush.h b/arch/riscv/include/asm/tlbflush.h
> > > > > index 8f3418c5f172..f0b731ccc0c2 100644
> > > > > --- a/arch/riscv/include/asm/tlbflush.h
> > > > > +++ b/arch/riscv/include/asm/tlbflush.h
> > > > > @@ -46,6 +46,16 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end);
> > > > > void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
> > > > > unsigned long end);
> > > > > #endif
> > > > > +
> > > > > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> > > > > +bool arch_tlbbatch_should_defer(struct mm_struct *mm);
> > > > > +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> > > > > + struct mm_struct *mm,
> > > > > + unsigned long uaddr);
> > > > > +void arch_flush_tlb_batched_pending(struct mm_struct *mm);
> > > > > +void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
> > > > > +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> > > > > +
> > > > > #else /* CONFIG_SMP && CONFIG_MMU */
> > > > >
> > > > > #define flush_tlb_all() local_flush_tlb_all()
> > > > > diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
> > > > > index e6659d7368b3..bb623bca0a7d 100644
> > > > > --- a/arch/riscv/mm/tlbflush.c
> > > > > +++ b/arch/riscv/mm/tlbflush.c
> > > > > @@ -93,29 +93,23 @@ static void __ipi_flush_tlb_range_asid(void *info)
> > > > > local_flush_tlb_range_asid(d->start, d->size, d->stride, d->asid);
> > > > > }
> > > > >
> > > > > -static void __flush_tlb_range(struct mm_struct *mm, unsigned long start,
> > > > > - unsigned long size, unsigned long stride)
> > > > > +static void __flush_tlb_range(struct cpumask *cmask, unsigned long asid,
> > > > > + unsigned long start, unsigned long size,
> > > > > + unsigned long stride)
> > > > > {
> > > > > struct flush_tlb_range_data ftd;
> > > > > - const struct cpumask *cmask;
> > > > > - unsigned long asid = FLUSH_TLB_NO_ASID;
> > > > > bool broadcast;
> > > > >
> > > > > - if (mm) {
> > > > > - unsigned int cpuid;
> > > > > + if (cpumask_empty(cmask))
> > > > > + return;
> > > > >
> > > > > - cmask = mm_cpumask(mm);
> > > > > - if (cpumask_empty(cmask))
> > > > > - return;
> > > > > + if (cmask != cpu_online_mask) {
> > > > > + unsigned int cpuid;
> > > > >
> > > > > cpuid = get_cpu();
> > > > > /* check if the tlbflush needs to be sent to other CPUs */
> > > > > broadcast = cpumask_any_but(cmask, cpuid) < nr_cpu_ids;
> > > > > -
> > > > > - if (static_branch_unlikely(&use_asid_allocator))
> > > > > - asid = atomic_long_read(&mm->context.id) & asid_mask;
> > > > > } else {
> > > > > - cmask = cpu_online_mask;
> > > > > broadcast = true;
> > > > > }
> > > > >
> > > > > @@ -135,25 +129,34 @@ static void __flush_tlb_range(struct mm_struct *mm, unsigned long start,
> > > > > local_flush_tlb_range_asid(start, size, stride, asid);
> > > > > }
> > > > >
> > > > > - if (mm)
> > > > > + if (cmask != cpu_online_mask)
> > > > > put_cpu();
> > > > > }
> > > > >
> > > > > +static inline unsigned long get_mm_asid(struct mm_struct *mm)
> > > > > +{
> > > > > + return static_branch_unlikely(&use_asid_allocator) ?
> > > > > + atomic_long_read(&mm->context.id) & asid_mask : FLUSH_TLB_NO_ASID;
> > > > > +}
> > > > > +
> > > > > void flush_tlb_mm(struct mm_struct *mm)
> > > > > {
> > > > > - __flush_tlb_range(mm, 0, FLUSH_TLB_MAX_SIZE, PAGE_SIZE);
> > > > > + __flush_tlb_range(mm_cpumask(mm), get_mm_asid(mm),
> > > > > + 0, FLUSH_TLB_MAX_SIZE, PAGE_SIZE);
> > > > > }
> > > > >
> > > > > void flush_tlb_mm_range(struct mm_struct *mm,
> > > > > unsigned long start, unsigned long end,
> > > > > unsigned int page_size)
> > > > > {
> > > > > - __flush_tlb_range(mm, start, end - start, page_size);
> > > > > + __flush_tlb_range(mm_cpumask(mm), get_mm_asid(mm),
> > > > > + start, end - start, page_size);
> > > > > }
> > > > >
> > > > > void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr)
> > > > > {
> > > > > - __flush_tlb_range(vma->vm_mm, addr, PAGE_SIZE, PAGE_SIZE);
> > > > > + __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm),
> > > > > + addr, PAGE_SIZE, PAGE_SIZE);
> > > > > }
> > > > >
> > > > > void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
> > > > > @@ -185,18 +188,46 @@ void flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
> > > > > }
> > > > > }
> > > > >
> > > > > - __flush_tlb_range(vma->vm_mm, start, end - start, stride_size);
> > > > > + __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm),
> > > > > + start, end - start, stride_size);
> > > > > }
> > > > >
> > > > > void flush_tlb_kernel_range(unsigned long start, unsigned long end)
> > > > > {
> > > > > - __flush_tlb_range(NULL, start, end - start, PAGE_SIZE);
> > > > > + __flush_tlb_range((struct cpumask *)cpu_online_mask, FLUSH_TLB_NO_ASID,
> > > > > + start, end - start, PAGE_SIZE);
> > > > > }
> > > > >
> > > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > > void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
> > > > > unsigned long end)
> > > > > {
> > > > > - __flush_tlb_range(vma->vm_mm, start, end - start, PMD_SIZE);
> > > > > + __flush_tlb_range(mm_cpumask(vma->vm_mm), get_mm_asid(vma->vm_mm),
> > > > > + start, end - start, PMD_SIZE);
> > > > > }
> > > > > #endif
> > > > > +
> > > > > +#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> > > > > +bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> > > > > +{
> > > > > + return true;
> > > > > +}
> > > > > +
> > > > > +void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
> > > > > + struct mm_struct *mm,
> > > > > + unsigned long uaddr)
> > > > > +{
> > > > > + cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
> > > > > +}
> > > > > +
> > > > > +void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> > > > > +{
> > > > > + flush_tlb_mm(mm);
> > > > > +}
> > > > > +
> > > > > +void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> > > > > +{
> > > > > + __flush_tlb_range(&batch->cpumask, FLUSH_TLB_NO_ASID, 0,
> > > > > + FLUSH_TLB_MAX_SIZE, PAGE_SIZE);
> > > > > +}
> > > > > +#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
> > > > > --
> > > > > 2.39.2
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > linux-riscv mailing list
> > > > > linux-riscv@xxxxxxxxxxxxxxxxxxx
> > > > > http://lists.infradead.org/mailman/listinfo/linux-riscv

Next message: Andy Shevchenko: "Re: [PATCH] list: Add hlist_count_nodes()"
Previous message: Dipam Turkar: "[PATCH 1/1] Refactor radeon driver to use drm_gem_create_map_offset() instead of its custom implementation for associating GEM object with a fake offset. Since, we already have a generic implementation, we don't need the custom function and it is better to standardize the code."
In reply to: Jisheng Zhang: "Re: [PATCH] riscv: Add support for BATCHED_UNMAP_TLB_FLUSH"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]