Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
From: Anshuman Khandual
Date: Mon Sep 19 2022 - 00:24:58 EST
On 9/15/22 12:12, Barry Song wrote:
> On Thu, Sep 15, 2022 at 6:07 PM Anshuman Khandual
> <anshuman.khandual@xxxxxxx> wrote:
>>
>>
>>
>> On 9/9/22 11:05, Barry Song wrote:
>>> On Fri, Sep 9, 2022 at 5:24 PM Anshuman Khandual
>>> <anshuman.khandual@xxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> On 8/22/22 13:51, Yicong Yang wrote:
>>>>> From: Barry Song <v-songbaohua@xxxxxxxx>
>>>>>
>>>>> on x86, batched and deferred tlb shootdown has lead to 90%
>>>>> performance increase on tlb shootdown. on arm64, HW can do
>>>>> tlb shootdown without software IPI. But sync tlbi is still
>>>>> quite expensive.
>>>>>
>>>>> Even running a simplest program which requires swapout can
>>>>> prove this is true,
>>>>> #include <sys/types.h>
>>>>> #include <unistd.h>
>>>>> #include <sys/mman.h>
>>>>> #include <string.h>
>>>>>
>>>>> int main()
>>>>> {
>>>>> #define SIZE (1 * 1024 * 1024)
>>>>> volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
>>>>> MAP_SHARED | MAP_ANONYMOUS, -1, 0);
>>>>>
>>>>> memset(p, 0x88, SIZE);
>>>>>
>>>>> for (int k = 0; k < 10000; k++) {
>>>>> /* swap in */
>>>>> for (int i = 0; i < SIZE; i += 4096) {
>>>>> (void)p[i];
>>>>> }
>>>>>
>>>>> /* swap out */
>>>>> madvise(p, SIZE, MADV_PAGEOUT);
>>>>> }
>>>>> }
>>>>>
>>>>> Perf result on snapdragon 888 with 8 cores by using zRAM
>>>>> as the swap block device.
>>>>>
>>>>> ~ # perf record taskset -c 4 ./a.out
>>>>> [ perf record: Woken up 10 times to write data ]
>>>>> [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
>>>>> ~ # perf report
>>>>> # To display the perf.data header info, please use --header/--header-only options.
>>>>> # To display the perf.data header info, please use --header/--header-only options.
>>>>> #
>>>>> #
>>>>> # Total Lost Samples: 0
>>>>> #
>>>>> # Samples: 60K of event 'cycles'
>>>>> # Event count (approx.): 35706225414
>>>>> #
>>>>> # Overhead Command Shared Object Symbol
>>>>> # ........ ....... ................. .............................................................................
>>>>> #
>>>>> 21.07% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irq
>>>>> 8.23% a.out [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
>>>>> 6.67% a.out [kernel.kallsyms] [k] filemap_map_pages
>>>>> 6.16% a.out [kernel.kallsyms] [k] __zram_bvec_write
>>>>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
>>>>> 3.71% a.out [kernel.kallsyms] [k] _raw_spin_lock
>>>>> 3.49% a.out [kernel.kallsyms] [k] memset64
>>>>> 1.63% a.out [kernel.kallsyms] [k] clear_page
>>>>> 1.42% a.out [kernel.kallsyms] [k] _raw_spin_unlock
>>>>> 1.26% a.out [kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930
>>>>> 1.23% a.out [kernel.kallsyms] [k] xas_load
>>>>> 1.15% a.out [kernel.kallsyms] [k] zram_slot_lock
>>>>>
>>>>> ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
>>>>> swapping in/out a page mapped by only one process. If the
>>>>> page is mapped by multiple processes, typically, like more
>>>>> than 100 on a phone, the overhead would be much higher as
>>>>> we have to run tlb flush 100 times for one single page.
>>>>> Plus, tlb flush overhead will increase with the number
>>>>> of CPU cores due to the bad scalability of tlb shootdown
>>>>> in HW, so those ARM64 servers should expect much higher
>>>>> overhead.
>>>>>
>>>>> Further perf annonate shows 95% cpu time of ptep_clear_flush
>>>>> is actually used by the final dsb() to wait for the completion
>>>>> of tlb flush. This provides us a very good chance to leverage
>>>>> the existing batched tlb in kernel. The minimum modification
>>>>> is that we only send async tlbi in the first stage and we send
>>>>> dsb while we have to sync in the second stage.
>>>>>
>>>>> With the above simplest micro benchmark, collapsed time to
>>>>> finish the program decreases around 5%.
>>>>>
>>>>> Typical collapsed time w/o patch:
>>>>> ~ # time taskset -c 4 ./a.out
>>>>> 0.21user 14.34system 0:14.69elapsed
>>>>> w/ patch:
>>>>> ~ # time taskset -c 4 ./a.out
>>>>> 0.22user 13.45system 0:13.80elapsed
>>>>>
>>>>> Also, Yicong Yang added the following observation.
>>>>> Tested with benchmark in the commit on Kunpeng920 arm64 server,
>>>>> observed an improvement around 12.5% with command
>>>>> `time ./swap_bench`.
>>>>> w/o w/
>>>>> real 0m13.460s 0m11.771s
>>>>> user 0m0.248s 0m0.279s
>>>>> sys 0m12.039s 0m11.458s
>>>>>
>>>>> Originally it's noticed a 16.99% overhead of ptep_clear_flush()
>>>>> which has been eliminated by this patch:
>>>>>
>>>>> [root@localhost yang]# perf record -- ./swap_bench && perf report
>>>>> [...]
>>>>> 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush
>>>>>
>>>>> Cc: Jonathan Corbet <corbet@xxxxxxx>
>>>>> Cc: Nadav Amit <namit@xxxxxxxxxx>
>>>>> Cc: Mel Gorman <mgorman@xxxxxxx>
>>>>> Tested-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
>>>>> Tested-by: Xin Hao <xhao@xxxxxxxxxxxxxxxxx>
>>>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx>
>>>>> Signed-off-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
>>>>> ---
>>>>> .../features/vm/TLB/arch-support.txt | 2 +-
>>>>> arch/arm64/Kconfig | 1 +
>>>>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++
>>>>> arch/arm64/include/asm/tlbflush.h | 28 +++++++++++++++++--
>>>>> 4 files changed, 40 insertions(+), 3 deletions(-)
>>>>> create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>>>
>>>>> diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt
>>>>> index 1c009312b9c1..2caf815d7c6c 100644
>>>>> --- a/Documentation/features/vm/TLB/arch-support.txt
>>>>> +++ b/Documentation/features/vm/TLB/arch-support.txt
>>>>> @@ -9,7 +9,7 @@
>>>>> | alpha: | TODO |
>>>>> | arc: | TODO |
>>>>> | arm: | TODO |
>>>>> - | arm64: | TODO |
>>>>> + | arm64: | ok |
>>>>> | csky: | TODO |
>>>>> | hexagon: | TODO |
>>>>> | ia64: | TODO |
>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>>> index 571cc234d0b3..09d45cd6d665 100644
>>>>> --- a/arch/arm64/Kconfig
>>>>> +++ b/arch/arm64/Kconfig
>>>>> @@ -93,6 +93,7 @@ config ARM64
>>>>> select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
>>>>> select ARCH_SUPPORTS_NUMA_BALANCING
>>>>> select ARCH_SUPPORTS_PAGE_TABLE_CHECK
>>>>> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>>>>> select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
>>>>> select ARCH_WANT_DEFAULT_BPF_JIT
>>>>> select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
>>>>> diff --git a/arch/arm64/include/asm/tlbbatch.h b/arch/arm64/include/asm/tlbbatch.h
>>>>> new file mode 100644
>>>>> index 000000000000..fedb0b87b8db
>>>>> --- /dev/null
>>>>> +++ b/arch/arm64/include/asm/tlbbatch.h
>>>>> @@ -0,0 +1,12 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +#ifndef _ARCH_ARM64_TLBBATCH_H
>>>>> +#define _ARCH_ARM64_TLBBATCH_H
>>>>> +
>>>>> +struct arch_tlbflush_unmap_batch {
>>>>> + /*
>>>>> + * For arm64, HW can do tlb shootdown, so we don't
>>>>> + * need to record cpumask for sending IPI
>>>>> + */
>>>>> +};
>>>>> +
>>>>> +#endif /* _ARCH_ARM64_TLBBATCH_H */
>>>>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>>>>> index 412a3b9a3c25..23cbc987321a 100644
>>>>> --- a/arch/arm64/include/asm/tlbflush.h
>>>>> +++ b/arch/arm64/include/asm/tlbflush.h
>>>>> @@ -254,17 +254,24 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
>>>>> dsb(ish);
>>>>> }
>>>>>
>>>>> -static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
>>>>> +
>>>>> +static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
>>>>> unsigned long uaddr)
>>>>> {
>>>>> unsigned long addr;
>>>>>
>>>>> dsb(ishst);
>>>>> - addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
>>>>> + addr = __TLBI_VADDR(uaddr, ASID(mm));
>>>>> __tlbi(vale1is, addr);
>>>>> __tlbi_user(vale1is, addr);
>>>>> }
>>>>>
>>>>> +static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
>>>>> + unsigned long uaddr)
>>>>> +{
>>>>> + return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
>>>>> +}
>>>>> +
>>>>> static inline void flush_tlb_page(struct vm_area_struct *vma,
>>>>> unsigned long uaddr)
>>>>> {
>>>>> @@ -272,6 +279,23 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
>>>>> dsb(ish);
>>>>> }
>>>>>
>>>>> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
>>>>> +{
>>>>> + return true;
>>>>> +}
>>>>
>>>> Always defer and batch up TLB flush, unconditionally ?
>>>
>>> My understanding is we actually don't need tlbbatch for a machine with one
>>> or two cores as the tlb flush is not expensive. even for a system with four
>>> cortex-a55 cores, i didn't see obvious cost. it was less than 1%.
>>> when we have 8 cores, we see the obvious cost of tlb flush. for a server with
>>> 100 crores, the cost is incredibly huge.
>>
>> Although dsb(ish) is deferred via arch_tlbbatch_flush(), there is still
>> one dsb(isht) instruction left in __flush_tlb_page_nosync(). Is not that
>> expensive as well, while queuing up individual TLB flushes ?
>
> This one is much much cheaper as it is not waiting for the
> completion of tlbi. waiting for the completion of tlbi is a big
> deal in arm64, thus, similar optimization can be seen here
>
> 3403e56b41c1("arm64: mm: Don't wait for completion of TLB invalidation
> when page aging").
>
>
>>
>> The very idea behind TLB deferral is the opportunity it (might) provide
>> to accumulate address ranges and cpu masks so that individual TLB flush
>> can be replaced with a more cost effective range based TLB flush. Hence
>> I guess unless address range or cpumask based cost effective TLB flush
>> is available, deferral does not improve the unmap performance as much.
>
>
> After sending tlbi, if we wait for the completion of tlbi, we have to get Ack
> from all cpus in the system, tlbi is not scalable. The point here is that we
> avoid waiting for each individual TLBi. Alternatively, they are batched. If
> you read the benchmark in the commit log, you can find the great decline
> in the cost to swap out a page.
Alright, although collecting and deferring 'dsb(ish)' to the very end, does
not feel like a direct fit case for ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH but I
guess it can be used to improve unmap performance on arm64.
But is this 'dsb(ish)' deferral architecturally valid ?
Let's examine single page unmap path via try_to_unmap_one().
should_defer_flush() {
ptep_get_and_clear()
set_tlb_ubc_flush_pending()
arch_tlbbatch_add_mm()
__flush_tlb_page_nosync()
} else {
ptep_clear_flush()
ptep_get_and_clear()
flush_tlb_page()
flush_tlb_page_nosync()
__flush_tlb_page_nosync()
dsb(ish)
}
__flush_tlb_page_nosync()
{
dsb(ishst);
addr = __TLBI_VADDR(uaddr, ASID(mm));
__tlbi(vale1is, addr);
__tlbi_user(vale1is, addr);
}
Currently without TLB deferral, 'dsb(ish)' gets executed just after __tlbi()
and __tlbi_user(), because __flush_tlb_page_nosync() is an inline function.
#define __TLBI_0(op, arg) asm (ARM64_ASM_PREAMBLE \
"tlbi " #op "\n" \
ALTERNATIVE("nop\n nop", \
"dsb ish\n tlbi " #op, \
ARM64_WORKAROUND_REPEAT_TLBI, \
CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
: : )
#define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE \
"tlbi " #op ", %0\n" \
ALTERNATIVE("nop\n nop", \
"dsb ish\n tlbi " #op ", %0", \
ARM64_WORKAROUND_REPEAT_TLBI, \
CONFIG_ARM64_WORKAROUND_REPEAT_TLBI) \
: : "r" (arg))
#define __TLBI_N(op, arg, n, ...) __TLBI_##n(op, arg)
#define __tlbi(op, ...) __TLBI_N(op, ##__VA_ARGS__, 1, 0)
#define __tlbi_user(op, arg) do { \
if (arm64_kernel_unmapped_at_el0()) \
__tlbi(op, (arg) | USER_ASID_FLAG); \
} while (0)
There is already a 'dsb(ish)' in between two subsequent TLB operations in
case ARM64_WORKAROUND_REPEAT_TLBI is detected on the system. Hence I guess
deferral should not enabled on such systems ?
But with deferral enabled, 'dsb(ish)' will be executed in arch_tlbbatch_flush()
via try_to_unmap_flush[_dirty](). There might be random number of instructions
in between __tlbi()/__tlbi_user() i.e 'tlbi' instructions and final 'dsb(ish)'.
Just wondering, if such 'detached in time with other instructions in between'
'tlbi' and 'dsb(ish)', is architecturally valid ?
There is a comment in 'struct tlbflush_unmap_batch'.
/*
* The arch code makes the following promise: generic code can modify a
* PTE, then call arch_tlbbatch_add_mm() (which internally provides all
* needed barriers), then call arch_tlbbatch_flush(), and the entries
* will be flushed on all CPUs by the time that arch_tlbbatch_flush()
* returns.
*/
It expects arch_tlbbatch_add_mm() to provide all barriers, hence wondering if
that would include just the first 'dsb(isht)' not the subsequent 'dsb(ish)' ?