Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec 63.0% regression

From: Qi Zheng
Date: Tue Jan 28 2025 - 12:07:29 EST


Hi,

On 2025/1/28 21:42, David Hildenbrand wrote:
On 28.01.25 14:28, Peter Zijlstra wrote:
On Tue, Jan 28, 2025 at 12:39:51PM +0100, David Hildenbrand wrote:
On 28.01.25 12:31, Peter Zijlstra wrote:

I recall a recent series to select MMU_GATHER_RCU_TABLE_FREE on x86
unconditionally (@Peter, @Rik).

Those changes should not have made it to Linus yet.

/me updates git and checks...

nope, nothing changed there ... yet

Sorry, I wasn't quite clear. CONFIG_PT_RECLAIM made it upstream, which has
"select MMU_GATHER_RCU_TABLE_FREE" in kconfig.

So I'm wondering if the degradation we see in this report is due to
MMU_GATHER_RCU_TABLE_FREE being selected by CONFIG_PT_RECLAIM, and we'd get
the same result (degradation) when unconditionally enabling
MMU_GATHER_RCU_TABLE_FREE.

Ah, yes, put a RHEL based config (as is the case here) should already
have it selected due to PARAVIRT.

Ah, right. Most distros will just have it enabled either way.

But that would then mean that MMU_GATHER_RCU_TABLE_FREE is not the cause for the regression here, and something else is going wrong.


I did reproduce the performance regression using the following test
program:

stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --mmapaddr 64

The results are as follows:

1) Enable CONFIG_PT_RECLAIM

stress-ng: info: [826] dispatching hogs: 64 mmapaddr
stress-ng: info: [826] successful run completed in 60.29s (1 min, 0.29 secs)
stress-ng: info: [826] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [826] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [826] mmapaddr 17233711 60.01 238.47 1128.46 287178.92 12607.60
stress-ng: info: [826] for a 60.29s run time:
stress-ng: info: [826] 1447.07s available CPU time
stress-ng: info: [826] 238.85s user time ( 16.51%)
stress-ng: info: [826] 1128.87s system time ( 78.01%)
stress-ng: info: [826] 1367.72s total time ( 94.52%)
stress-ng: info: [826] load average: 48.64 20.73 7.82

2) Disable CONFIG_PT_RECLAIM

stress-ng: info: [704] dispatching hogs: 64 mmapaddr
stress-ng: info: [704] successful run completed in 60.05s (1 min, 0.05 secs)
stress-ng: info: [704] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [704] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [704] mmapaddr 28440843 60.02 343.93 1090.70 473882.98 19824.51
stress-ng: info: [704] for a 60.05s run time:
stress-ng: info: [704] 1441.23s available CPU time
stress-ng: info: [704] 344.30s user time ( 23.89%)
stress-ng: info: [704] 1091.12s system time ( 75.71%)
stress-ng: info: [704] 1435.42s total time ( 99.60%)
stress-ng: info: [704] load average: 40.03 11.51 3.96

Then I found that after enabling CONFIG_PT_RECLAIM, there was an
additional perf hotspot function:

16.35% [kernel] [k] _raw_spin_unlock_irqrestore
9.09% [kernel] [k] clear_page_rep
6.92% [kernel] [k] do_syscall_64
3.76% [kernel] [k] _raw_spin_lock
3.27% [kernel] [k] __slab_free
2.07% [kernel] [k] rcu_cblist_dequeue
1.94% [kernel] [k] flush_tlb_mm_range
1.87% [kernel] [k] lruvec_stat_mod_folio.part.130
1.79% [kernel] [k] get_page_from_freelist
1.61% [kernel] [k] tlb_remove_table_rcu
1.58% [kernel] [k] kmem_cache_alloc_noprof
1.43% [kernel] [k] mtree_range_walk

And its call stack is as follows:

bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();} interval:s:1 {exit();}'

@[
_raw_spin_unlock_irqrestore+5
free_one_page+85
rcu_do_batch+424
rcu_core+401
handle_softirqs+204
irq_exit_rcu+208
sysvec_apic_timer_interrupt+113
asm_sysvec_apic_timer_interrupt+26
_raw_spin_unlock_irqrestore+29
get_page_from_freelist+2014
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2283
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2014
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
pte_alloc_one+30
__pte_alloc+42
do_pte_missing+2499
__handle_mm_fault+1862
handle_mm_fault+195
__get_user_pages+690
populate_vma_page_range+127
__mm_populate+159
vm_mmap_pgoff+329
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2443
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2014
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 5184
@[
_raw_spin_unlock_irqrestore+5
free_one_page+85
tlb_remove_table_rcu+140
rcu_do_batch+424
rcu_core+401
handle_softirqs+204
irq_exit_rcu+208
sysvec_apic_timer_interrupt+113
asm_sysvec_apic_timer_interrupt+26
_raw_spin_unlock_irqrestore+29
get_page_from_freelist+2014
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 5301
@Error looking up stack id 4294967279 (pid -1): -1
[, stress-ng-mmapa]: 53366

It seems to be related to CONFIG_MMU_GATHER_RCU_TABLE_FREE?

I will continue to investigate further.

Thanks!