Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation

From: Barry Song
Date: Mon Sep 19 2022 - 00:53:58 EST


On Mon, Sep 19, 2022 at 4:24 PM Anshuman Khandual
<anshuman.khandual@xxxxxxx> wrote:
>
>
>
> On 9/15/22 12:12, Barry Song wrote:
> > On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual
> > <anshuman.khandual@xxxxxxx> wrote:
> >>
> >>
> >>
> >> On 9/9/22 11:05, Barry Song wrote:
> >>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual
> >>> <anshuman.khandual@xxxxxxx> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 8/22/22 13:51, Yicong Yang wrote:
> >>>>> From: Barry Song <v-songbaohua@xxxxxxxx>
> >>>>>
> >>>>> on x86, batched and deferred tlb shootdown has lead to 90%
> >>>>> performance increase on tlb shootdown. on arm64, HW can do
> >>>>> tlb shootdown without software IPI. But sync tlbi is still
> >>>>> quite expensive.
> >>>>>
> >>>>> Even running a simplest program which requires swapout can
> >>>>> prove this is true,
> >>>>> #include <sys/types.h>
> >>>>> #include <unistd.h>
> >>>>> #include <sys/mman.h>
> >>>>> #include <string.h>
> >>>>>
> >>>>> int main()
> >>>>> {
> >>>>> #define SIZE (1 * 1024 * 1024)
> >>>>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
> >>>>> MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> >>>>>
> >>>>> memset(p, 0x88, SIZE);
> >>>>>
> >>>>> for (int k = 0; k < 10000; k++) {
> >>>>> /* swap in */
> >>>>> for (int i = 0; i < SIZE; i += 4096) {
> >>>>> (void)p[i];
> >>>>> }
> >>>>>
> >>>>> /* swap out */
> >>>>> madvise(p, SIZE, MADV_PAGEOUT);
> >>>>> }
> >>>>> }
> >>>>>
> >>>>> Perf result on snapdragon 888 with 8 cores by using zRAM
> >>>>> as the swap block device.
> >>>>>
> >>>>> ~ # perf record taskset -c 4 ./a.out
> >>>>> [ perf record: Woken up 10 times to write data ]
> >>>>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
> >>>>> ~ # perf report
> >>>>> # To display the perf.data header info, please use --header/--header-only options.
> >>>>> # To display the perf.data header info, please use --header/--header-only options.
> >>>>> #
> >>>>> #
> >>>>> # Total Lost Samples: 0
> >>>>> #
> >>>>> # Samples: 60K of event 'cycles'
> >>>>> # Event count (approx.): 35706225414
> >>>>> #
> >>>>> # Overhead Command Shared Object Symbol
> >>>>> # ........ ....... ................. .............................................................................
> >>>>> #
> >>>>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
> >>>>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
> >>>>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
> >>>>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
> >>>>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
> >>>>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
> >>>>> 3.49% a.out [kernel.kallsyms] [k] memset64
> >>>>> 1.63% a.out [kernel.kallsyms] [k] clear_page
> >>>>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
> >>>>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
> >>>>> 1.23% a.out [kernel.kallsyms] [k] xas_load
> >>>>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock
> >>>>>
> >>>>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
> >>>>> swapping in/out a page mapped by only one process. If the
> >>>>> page is mapped by multiple processes, typically, like more
> >>>>> than 100 on a phone, the overhead would be much higher as
> >>>>> we have to run tlb flush 100 times for one single page.
> >>>>> Plus, tlb flush overhead will increase with the number
> >>>>> of CPU cores due to the bad scalability of tlb shootdown
> >>>>> in HW, so those ARM64 servers should expect much higher
> >>>>> overhead.
> >>>>>
> >>>>> Further perf annonate shows 95% cpu time of ptep_clear_flush
> >>>>> is actually used by the final dsb() to wait for the completion
> >>>>> of tlb flush. This provides us a very good chance to leverage
> >>>>> the existing batched tlb in kernel. The minimum modification
> >>>>> is that we only send async tlbi in the first stage and we send
> >>>>> dsb while we have to sync in the second stage.
> >>>>>
> >>>>> With the above simplest micro benchmark, collapsed time to
> >>>>> finish the program decreases around 5%.
> >>>>>
> >>>>> Typical collapsed time w/o patch:
> >>>>> ~ # time taskset -c 4 ./a.out
> >>>>> 0.21user 14.34system 0:14.69elapsed
> >>>>> w/ patch:
> >>>>> ~ # time taskset -c 4 ./a.out
> >>>>> 0.22user 13.45system 0:13.80elapsed
> >>>>>
> >>>>> Also, Yicong Yang added the following observation.
> >>>>> Tested with benchmark in the commit on Kunpeng920 arm64 server,
> >>>>> observed an improvement around 12.5% with command
> >>>>> `time ./swap_bench`.
> >>>>> w/o w/
> >>>>> real 0m13.460s 0m11.771s
> >>>>> user 0m0.248s 0m0.279s
> >>>>> sys 0m12.039s 0m11.458s
> >>>>>
> >>>>> Originally it's noticed a 16.99% overhead of ptep_clear_flush()
> >>>>> which has been eliminated by this patch:
> >>>>>
> >>>>> [root@localhost yang]# perf record -- ./swap_bench && perf report
> >>>>> [...]
> >>>>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush
> >>>>>
> >>>>> Cc: Jonathan Corbet <corbet@xxxxxxx>
> >>>>> Cc: Nadav Amit <namit@xxxxxxxxxx>
> >>>>> Cc: Mel Gorman <mgorman@xxxxxxx>
> >>>>> Tested-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
> >>>>> Tested-by: Xin Hao <xhao@xxxxxxxxxxxxxxxxx>
> >>>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
> >>>>> Signed-off-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
> >>>>> ---
> >>>>> .../features/vm/TLB/arch-support.txt | 2 +-
> >>>>> arch/arm64/Kconfig | 1 +
> >>>>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++
> >>>>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++--
> >>>>> 4 files changed, 40 insertions(+), 3 deletions(-)
> >>>>> create mode 100644 arch/arm64/include/asm/tlbbatch.h
> >>>>>
> >>>>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
> >>>>> index 1c009312b9c1..2caf815d7c6c 100644
> >>>>> --- a/Documentation/features/vm/TLB/arch-support.txt
> >>>>> +++ b/Documentation/features/vm/TLB/arch-support.txt
> >>>>> @@ -9,7 +9,7 @@
> >>>>> | alpha: | TODO |
> >>>>> | arc: | TODO |
> >>>>> | arm: | TODO |
> >>>>> - | arm64: | TODO |
> >>>>> + | arm64: | ok |
> >>>>> | csky: | TODO |
> >>>>> | hexagon: | TODO |
> >>>>> | ia64: | TODO |
> >>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> >>>>> index 571cc234d0b3..09d45cd6d665 100644
> >>>>> --- a/arch/arm64/Kconfig
> >>>>> +++ b/arch/arm64/Kconfig
> >>>>> @@ -93,6 +93,7 @@ config ARM64
> >>>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
> >>>>> select ARCH_SUPPORTS_NUMA_BALANCING
> >>>>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK
> >>>>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
> >>>>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
> >>>>> select ARCH_WANT_DEFAULT_BPF_JIT
> >>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
> >>>>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
> >>>>> new file mode 100644
> >>>>> index 000000000000..fedb0b87b8db
> >>>>> --- /dev/null
> >>>>> +++ b/arch/arm64/include/asm/tlbbatch.h
> >>>>> @@ -0,0 +1,12 @@
> >>>>> +/* SPDX-License-Identifier: GPL-2.0 */
> >>>>> +#ifndef _ARCH_ARM64_TLBBATCH_H
> >>>>> +#define _ARCH_ARM64_TLBBATCH_H
> >>>>> +
> >>>>> +struct arch_tlbflush_unmap_batch {
> >>>>> + /*
> >>>>> + * For arm64, HW can do tlb shootdown, so we don't
> >>>>> + * need to record cpumask for sending IPI
> >>>>> + */
> >>>>> +};
> >>>>> +
> >>>>> +#endif /* _ARCH_ARM64_TLBBATCH_H */
> >>>>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> >>>>> index 412a3b9a3c25..23cbc987321a 100644
> >>>>> --- a/arch/arm64/include/asm/tlbflush.h
> >>>>> +++ b/arch/arm64/include/asm/tlbflush.h
> >>>>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
> >>>>> dsb(ish);
> >>>>> }
> >>>>>
> >>>>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> >>>>> +
> >>>>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
> >>>>> unsigned long uaddr)
> >>>>> {
> >>>>> unsigned long addr;
> >>>>>
> >>>>> dsb(ishst);
> >>>>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
> >>>>> + addr = __TLBI_VADDR(uaddr, ASID(mm));
> >>>>> __tlbi(vale1is, addr);
> >>>>> __tlbi_user(vale1is, addr);
> >>>>> }
> >>>>>
> >>>>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
> >>>>> + unsigned long uaddr)
> >>>>> +{
> >>>>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
> >>>>> +}
> >>>>> +
> >>>>> static inline void flush_tlb_page(struct vm_area_struct *vma,
> >>>>> unsigned long uaddr)
> >>>>> {
> >>>>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
> >>>>> dsb(ish);
> >>>>> }
> >>>>>
> >>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> >>>>> +{
> >>>>> + return true;
> >>>>> +}
> >>>>
> >>>> Always defer and batch up TLB flush, unconditionally ?
> >>>
> >>> My understanding is we actually don't need tlbbatch for a machine with one
> >>> or two cores as the tlb flush is not expensive. even for a system with four
> >>> cortex-a55 cores, i didn't see obvious cost. it was less than 1%.
> >>> when we have 8 cores, we see the obvious cost of tlb flush. for a server with
> >>> 100 crores, the cost is incredibly huge.
> >>
> >> Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still
> >> one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that
> >> expensive as well, while queuing up individual TLB flushes ?
> >
> > This one is much much cheaper as it is not waiting for the
> > completion of tlbi. waiting for the completion of tlbi is a big
> > deal in arm64, thus, similar optimization can be seen here
> >
> > 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation
> > when page aging").
> >
> >
> >>
> >> The very idea behind TLB deferral is the opportunity it (might) provide
> >> to accumulate address ranges and cpu masks so that individual TLB flush
> >> can be replaced with a more cost effective range based TLB flush. Hence
> >> I guess unless address range or cpumask based cost effective TLB flush
> >> is available, deferral does not improve the unmap performance as much.
> >
> >
> > After sending tlbi, if we wait for the completion of tlbi, we have to get Ack
> > from all cpus in the system, tlbi is not scalable. The point here is that we
> > avoid waiting for each individual TLBi. Alternatively, they are batched. If
> > you read the benchmark in the commit log, you can find the great decline
> > in the cost to swap out a page.
>
> Alright, although collecting and deferring 'dsb(ish)' to the very end, does
> not feel like a direct fit case for ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH but I
> guess it can be used to improve unmap performance on arm64.
>
> But is this 'dsb(ish)' deferral architecturally valid ?

yes as dsb(ish) ensures the completion of tlbi.
https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers

We are even depending on the dsb(ish) during context switch in commit:
3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation
when page aging").

Before the context switch, lots of tlbi could have been sent.

>
> Let's examine single page unmap path via try_to_unmap_one().
>
> should_defer_flush() {
> ptep_get_and_clear()
> set_tlb_ubc_flush_pending()
> arch_tlbbatch_add_mm()
> __flush_tlb_page_nosync()
> } else {
> ptep_clear_flush()
> ptep_get_and_clear()
> flush_tlb_page()
> flush_tlb_page_nosync()
> __flush_tlb_page_nosync()
> dsb(ish)
> }
>
> __flush_tlb_page_nosync()
> {
> dsb(ishst);
> addr = __TLBI_VADDR(uaddr, ASID(mm));
> __tlbi(vale1is, addr);
> __tlbi_user(vale1is, addr);
> }
>
> Currently without TLB deferral, 'dsb(ish)' gets executed just after __tlbi()
> and __tlbi_user(), because __flush_tlb_page_nosync() is an inline function.
>
> #define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE \
> "tlbi " #op "\n" \
> ALTERNATIVE("nop\n nop", \
> "dsb ish\n tlbi " #op, \
> ARM64_WORKAROUND_REPEAT_TLBI, \
> CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
> : : )
>
> #define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \
> "tlbi " #op ", %0\n" \
> ALTERNATIVE("nop\n nop", \
> "dsb ish\n tlbi " #op ", %0", \
> ARM64_WORKAROUND_REPEAT_TLBI, \
> CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
> : : "r" (arg))
>
> #define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg)
>
> #define __tlbi(op, ...) __TLBI_N(op, ##__VA_ARGS__, 1, 0)
>
> #define __tlbi_user(op, arg) do { \
> if (arm64_kernel_unmapped_at_el0()) \
> __tlbi(op, (arg) | USER_ASID_FLAG); \
> } while (0)
>
> There is already a 'dsb(ish)' in between two subsequent TLB operations in
> case ARM64_WORKAROUND_REPEAT_TLBI is detected on the system. Hence I guess
> deferral should not enabled on such systems ?
>
> But with deferral enabled, 'dsb(ish)' will be executed in arch_tlbbatch_flush()
> via try_to_unmap_flush[_dirty](). There might be random number of instructions
> in between __tlbi()/__tlbi_user() i.e 'tlbi' instructions and final 'dsb(ish)'.
> Just wondering, if such 'detached in time with other instructions in between'
> 'tlbi' and 'dsb(ish)', is architecturally valid ?

yes. I think so, arm64 even depends on the implicit dsb in context switch.

>
> There is a comment in 'struct tlbflush_unmap_batch'.
>
> /*
> * The arch code makes the following promise: generic code can modify a
> * PTE, then call arch_tlbbatch_add_mm() (which internally provides all
> * needed barriers), then call arch_tlbbatch_flush(), and the entries
> * will be flushed on all CPUs by the time that arch_tlbbatch_flush()
> * returns.
> */
>
>
> It expects arch_tlbbatch_add_mm() to provide all barriers, hence wondering if
> that would include just the first 'dsb(isht)' not the subsequent 'dsb(ish)' ?

yes. include/asm/tlbflush.h, we see the below comments:

* TLB Invalidation
* ================
*
* This header file implements the low-level TLB invalidation routines
* (sometimes referred to as "flushing" in the kernel) for arm64.
*
* Every invalidation operation uses the following template:
*
* DSB ISHST // Ensure prior page-table updates have completed
* TLBI ... // Invalidate the TLB
* DSB ISH // Ensure the TLB invalidation has completed
* if (invalidated kernel mappings)
* ISB // Discard any instructions fetched from the old mapping
*

Clearly dsb(ishst) has ensured page-table updates are visible to all CPUs.

Thanks
Barry