Re: [PATCH] x86/alternatives: remove false sharing in poke_int3_handler()

From: Eric Dumazet
Date: Mon Mar 24 2025 - 04:22:42 EST


On Mon, Mar 24, 2025 at 9:02 AM Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
>
> * Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
>
> > Do you have a specific case in mind that I can test on these big
> > platforms ?
>
> No. I was thinking of large-scale kprobes or ftrace patching - but you
> are right that the text_mutex should naturally serialize all the
> write-side code here.
>
> Mind adding your second round of test results to the changelog as well,
> which improved per call overhead from 36 to 28 nsecs?

Sure thing, thanks !

Note the 36 to 28 nsec was on a test host, not really under production stress.

As all of our production runs with the old code, I can not really tell
what would be the effective change once new kernels are rolled out.

When updating an unique and shared atomic_t from 480 cpus (worst case
scenario), we need more than 40000 cycles per operation.

perf stat atomic_bench -T480

The atomic counter is 21904528, total_cycles=2095231571464, 95652 avg
cycles per update
[05] 7866 in
[32,64[ cycles (53 avg)
[06] 2196 in
[64,128[ cycles (81 avg)
[07] 2942 in
[128,256[ cycles (202 avg)
[08] 1865 in
[256,512[ cycles (383 avg)
[09] 4251 in
[512,1024[ cycles (780 avg)
[10] 72248 in
[1024,2048[ cycles (1722 avg)
[11] *** 438110 in
[2048,4096[ cycles (3217 avg)
[12] *********** 1703927 in
[4096,8192[ cycles (6199 avg)
[13] ************************** 3869889 in
[8192,16384[ cycles (12320 avg)
[14] *************************** 4040952 in
[16384,32768[ cycles (25185 avg)
[15] ************************************************** 7261596 in
[32768,65536[ cycles (46884 avg)
[16] ****************** 2688791 in
[65536,131072[ cycles (83552 avg)
[17] * 253104 in
[131072,262144[ cycles (189642 avg)
[18] ** 326075 in
[262144,524288[ cycles (349319 avg)
[19] ****** 901293 in
[524288,1048576[ cycles (890724 avg)
[20] ** 321711 in
[1048576,2097152[ cycles (1205250 avg)
[21] 6616 in
[2097152,4194304[ cycles (2436096 avg)

Performance counter stats for './atomic_bench -T480':

964,194.88 msec task-clock # 467.120 CPUs
utilized
13,795 context-switches # 14.307 M/sec
480 cpu-migrations # 0.498 M/sec
1,605 page-faults # 1.665 M/sec
3,182,241,468,867 cycles # 3300416.170 GHz
11,077,646,267 instructions # 0.00 insn per
cycle
1,711,894,269 branches # 1775466.627 M/sec
3,747,877 branch-misses # 0.22% of all
branches

2.064128692 seconds time elapsed


I said the atomic_cond_read_acquire(refs, !VAL) was not called in my tests,
but it is a valid situation, we should not add a WARN_ON_ONCE().

I will simply add the unlikely()

Thanks.