On Tue, Jul 19, 2022 at 1:28 AM Yicong Yang <yangyicong@xxxxxxxxxx> wrote:
On 2022/7/14 12:51, Barry Song wrote:I guess that is because "./Run -c 1 -i 1 shell1" isn't an application
On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@xxxxxxxxxxxxxxxxx> wrote:flush_tlb_batched_pending() looks like the critical path for this issue then the code
Hi barry.I am really pleased to see the 30%+ improvement on unixbench on single core.
I do some test on Kunpeng arm64 machine use Unixbench.
The test result as below.
One core, we can see the performance improvement above +30%.
./Run -c 1 -i 1 shell1That is sad as we might get more concurrency between mprotect(), madvise(),
w/o
System Benchmarks Partial Index BASELINE RESULT INDEX
Shell Scripts (1 concurrent) 42.4 5481.0 1292.7
========
System Benchmarks Index Score (Partial Only) 1292.7
w/
System Benchmarks Partial Index BASELINE RESULT INDEX
Shell Scripts (1 concurrent) 42.4 6974.6 1645.0
========
System Benchmarks Index Score (Partial Only) 1645.0
But with whole cores, there have little performance degradation above -5%
mremap(), zap_pte_range() and the deferred tlbi.
./Run -c 96 -i 1 shell1i was guessing the problem might be flush_tlb_batched_pending()
w/o
Shell Scripts (1 concurrent) 80765.5 lpm (60.0 s, 1
samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
Shell Scripts (1 concurrent) 42.4 80765.5 19048.5
========
System Benchmarks Index Score (Partial Only) 19048.5
w
Shell Scripts (1 concurrent) 76333.6 lpm (60.0 s, 1
samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
Shell Scripts (1 concurrent) 42.4 76333.6 18003.2
========
System Benchmarks Index Score (Partial Only) 18003.2
----------------------------------------------------------------------------------------------
After discuss with you, and do some changes in the patch.
ndex a52381a680db..1ecba81f1277 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
if (pending != flushed) {
+#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
flush_tlb_mm(mm);
+#else
+ dsb(ish);
+#endif
so i asked you to change this to verify my guess.
above can mitigate this.
I cannot reproduce this on a 2P 128C Kunpeng920 server. The kernel is based on the
v5.19-rc6 and unixbench of version 5.1.3. The result of `./Run -c 128 -i 1 shell1` is:
iter-1 iter-2 iter-3
w/o 17708.1 17637.1 17630.1
w 17766.0 17752.3 17861.7
And flush_tlb_batched_pending()isn't the hot spot with the patch:
7.00% sh [kernel.kallsyms] [k] ptep_clear_flush
4.17% sh [kernel.kallsyms] [k] ptep_set_access_flags
2.43% multi.sh [kernel.kallsyms] [k] ptep_clear_flush
1.98% sh [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.69% sh [kernel.kallsyms] [k] next_uptodate_page
1.66% sort [kernel.kallsyms] [k] ptep_clear_flush
1.56% multi.sh [kernel.kallsyms] [k] ptep_set_access_flags
1.27% sh [kernel.kallsyms] [k] page_counter_cancel
1.11% sh [kernel.kallsyms] [k] page_remove_rmap
1.06% sh [kernel.kallsyms] [k] perf_event_alloc
Hi Xin Hao,
I'm not sure the test setup as well as the config is same with yours. (96C vs 128C
should not be the reason I think). Did you check that the 5% is a fluctuation or
not? It'll be helpful if more information provided for reproducing this issue.
Thanks.
stressed on
memory. Hi Xin, in what kinds of configurations can we reproduce your test
result?
As I suppose tlbbatch will mainly affect the performance of user scenarios
which require memory page-out/page-in like reclaiming file/anon pages.
"./Run -c 1 -i 1 shell1" on a system with sufficient free memory won't be
affected by tlbbatch at all, I believe.
Thanks
Barry