Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec 63.0% regression

From: Rik van Riel
Date: Wed Jan 29 2025 - 10:23:56 EST

Next message: Steven Price: "Re: [PATCH v6 06/43] arm64: RME: Check for RME support at KVM init"
Previous message: Igor Mammedov: "Re: [PATCH v2 03/13] acpi/ghes: add a firmware file with HEST address"
In reply to: Qi Zheng: "Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec 63.0% regression"
Next in thread: Rik van Riel: "Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec 63.0% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 2025-01-29 at 16:14 +0800, Qi Zheng wrote:
> On 2025/1/29 02:35, Rik van Riel wrote:
> >
> > That looks like the RCU freeing somehow bypassing the
> > per-cpu-pages, and hitting the zone->lock at page free
> > time, while regular freeing usually puts pages in the
> > CPU-local free page cache, without the lock?
>
> Take the following call stack as an example:
>
> @[
> _raw_spin_unlock_irqrestore+5
> free_one_page+85
> tlb_remove_table_rcu+140
> rcu_do_batch+424
> rcu_core+401
> handle_softirqs+204
> irq_exit_rcu+208
> sysvec_apic_timer_interrupt+113
> asm_sysvec_apic_timer_interrupt+26
> _raw_spin_unlock_irqrestore+29
> get_page_from_freelist+2014
> __alloc_frozen_pages_noprof+364
> alloc_pages_mpol+123
> alloc_pages_noprof+14
> get_free_pages_noprof+17
> __x64_sys_mincore+141
> do_syscall_64+98
> entry_SYSCALL_64_after_hwframe+118
> , stress-ng-mmapa]: 5301
>
> It looks like the following happened:
>
> get_page_from_freelist
> --> rmqueue
>      --> rmqueue_pcplist
>          --> pcp_spin_trylock (hold the pcp lock)
>              __rmqueue_pcplist
>              --> rmqueue_bulk
>                  --> spin_lock_irqsave(&zone->lock)
>                      __rmqueue
>                      spin_unlock_irqrestore(&zone->lock)
>
>                      <run softirq at this time>
>
>                      tlb_remove_table_rcu
>                      --> free_frozen_pages
>                          --> pcp = pcp_spin_trylock (failed!!!)
>                              if (!pcp)
>                                  free_one_page
>
> It seems that the pcp lock is held when doing tlb_remove_table_rcu(),
> so
> trylock fails, then bypassing PCP and calling free_one_page()
> directly,
> which leads to the hot spot of zone lock.
>
> As for the regular freeing, since the freeing operation will not be
> performed in the softirq, the above situation will not occur.
>
> Right?

You are absolutely right!

This raises an interesting question: should we keep
RCU from running callbacks while the pcp_spinlock is
held, and what would be the best way to do that?

Are there other corner cases where RCU callbacks
should not be running from softirq context at
irq reenable time?

Should maybe the RCU callbacks only run when
the current process has no locks held,
or should they simply always run from some
kernel thread?

I'm really not sure what the right answer is...

--
All Rights Reversed.

Next message: Steven Price: "Re: [PATCH v6 06/43] arm64: RME: Check for RME support at KVM init"
Previous message: Igor Mammedov: "Re: [PATCH v2 03/13] acpi/ghes: add a firmware file with HEST address"
In reply to: Qi Zheng: "Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec 63.0% regression"
Next in thread: Rik van Riel: "Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec 63.0% regression"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]