Re: [PATCH 2/2] x86/resctrl: Don't workqueue local event counter reads

From: Fenghua Yu
Date: Mon Nov 04 2024 - 22:28:59 EST


Hi, Tony,

On 11/4/24 16:12, Luck, Tony wrote:
Whenever this function is called, the performance is degraded rather
than improved because extra get_cpu()/put_cpu() are called in the fast
path in the current patch.

But get_cpu()/put_cpu() aren't high overhead. Maybe costs less that the
cpumask_any_housekeeping() call that is avoided by Peter's patch.

Quote from Peter:

"AMD EPYC 7B12 64-Core Processor (250 mon groups)

Local Domain: 3.25M -> 1.22M (-62.5%)
Remote Domain: 7.91M -> 8.05M (+2.9%)

Intel(R) Xeon(R) Gold 6268CL CPU @ 2.80GHz (190 mon groups)

Local Domain: 2.98M -> 2.21M (-25.8%)
Remote Domain: 4.49M -> 4.62M (+3.1%)

Note that there is a small increase in overhead for remote domains,
which results from the introduction of a put_cpu() call to reenable
preemption after determining whether the fast path can be used."

As his data shows, if the fast path is not taken, the extra put_cpu() itself costs +2.9% extra time on AMD machine and +3.1% extra time on Intel machine.

And this ~3% overhead is on top of queued work, which is more expensive than cpumask_any_housekeeping() IIUC.


Note that if Peter's patch doesn't take its fast path because the calling
CPU was on the wrong domain, then the subsequent code is going to
do an IPI whichever of the if/else path is taken.

In this case, actually IPI is only taken in smp_call_function_any() and smp_call_on_cpu() invokes a queued work instead of IPI.

My proposed change logically doesn't change Peter's fast path and performance for nohz_full/smp_call_on_cpu() case. It just utilizes the "built-in fast path already" inside smp_call_function_any() to save extra get_cpu() and put_cpu(). Hopefully the saved extra get_cpu() and put_cpu() can offset cost of cpumask_any_housekeeping().

From Peter's commit message, seems nohz_full case is not called/measured a lot if any. If only one or a very few housekeeping CPUs on a large system, the nohz_full case will be called frequently and fast path will fail most of time and the extra get_cpu()/put_cpu() around the fast path might impact more on both local and total domain.

Thanks.

-Fenghua