[PATCH 0/4] Micro optimise to get this_cpu once

From: Shrikanth Hegde

Date: Mon Mar 23 2026 - 16:00:26 EST


It was observed that compiler doesn't optimise this block to hoist
this_cpu calculations out of the loop in preempt disabled sections.

for_each_cpu(c, mask) {
if (c == smp_processor_id())
do_something
do_something_else
}

smp_processor_id() could be compiled with CONFIG_DEBUG_PREEMPT=y where it
can be used for warnings. So maybe that's one of the reason it can't
optimize. __smp_processor_id is arch specific, that maybe another
reason.

Even on CONFIG_DEBUG_PREEMPT=n, compiler didn't optimize it out of the
loop.

find_new_ilb dis-assembly in powerpc(CONFIG_DEBUG_PREEMPT=n).
c00000000028cc7c: bl c000000000a93c98 <_find_next_and_bit>
c00000000028cc80: nop
c00000000028cc84: lwz r5,0(r29)
c00000000028cc88: extsw r30,r3
c00000000028cc8c: mr r31,r3
c00000000028cc90: mr r26,r3
c00000000028cc94: cmplw r5,r3
c00000000028cc98: mr r3,r30
c00000000028cc9c: ble c00000000028ccf8 <kick_ilb+0x10c>
c00000000028cca0: lhz r9,8(r13)
#This is where smp_processor_id is fetched i.e within the loop body.
c00000000028cca4: cmpw r9,r31
c00000000028cca8: beq c00000000028ccc0 <kick_ilb+0xd4>
c00000000028ccac: bl c0000000002cd938 <idle_cpu+0x8>
c00000000028ccb0: nop
c00000000028ccb4: cmpwi r3,0
c00000000028ccb8: bne c00000000028cd30 <kick_ilb+0x144>

find_new_ilb dis-assembly in x86(CONFIG_DEBUG_PREEMPT=n).
ffffffff813588eb: call ffffffff81367b30 <housekeeping_cpumask>
ffffffff813588f0: xor %ecx,%ecx
ffffffff813588f2: mov $0xffffffffffffffff,%rsi
ffffffff813588f9: mov %rax,%r8
ffffffff813588fc: mov %rsi,%rdx
ffffffff813588ff: mov 0x29258ba(%rip),%rax # ffffffff83c7e1c0 <nohz>
ffffffff81358906: and (%r8),%rax
ffffffff81358909: shl %cl,%rdx
ffffffff8135890c: and %rdx,%rax
ffffffff8135890f: je ffffffff81358952 <sched_balance_trigger+0x142>
ffffffff81358911: tzcnt %rax,%rbx
ffffffff81358916: cmp $0x3f,%ebx
ffffffff81358919: ja ffffffff81358952 <sched_balance_trigger+0x142>
ffffffff8135891b: cmp %ebx,%gs:0x28e7712(%rip) # ffffffff83c40034 <cpu_number>
#This is smp_processor_id() in the loop.
ffffffff81358922: mov %ebx,%edi
ffffffff81358924: je ffffffff81358946 <sched_balance_trigger+0x136>
ffffffff81358926: mov %r8,0x8(%rsp)
ffffffff8135892b: mov %ebx,(%rsp)
ffffffff8135892e: call ffffffff81365140 <idle_cpu>
ffffffff81358933: mov $0xffffffffffffffff,%rsi
ffffffff8135893a: mov (%rsp),%edi
ffffffff8135893d: mov 0x8(%rsp),%r8
ffffffff81358942: test %eax,%eax
ffffffff81358944: jne ffffffff813589a4 <sched_balance_trigger+0x194>
ffffffff81358946: lea 0x1(%rbx),%ecx

Patched kernel on powerpc find_new_ilb disassembly.
c00000000028cc5c: 08 00 4d a3 lhz r26,8(r13)
It is fetched once.
...
c00000000028cc94: bl c000000000a93cd8 <_find_next_and_bit>
c00000000028cc98: nop
c00000000028cc9c: lwz r5,0(r29)
c00000000028cca0: extsw r30,r3
c00000000028cca4: mr r31,r3
...
c00000000028cca8: cmpw cr7,r26,r3
c00000000028ccb8: ble c00000000028cd14 <kick_ilb+0x118>
c00000000028ccbc: nop
c00000000028ccc0: beq cr7,c00000000028ccd8 <kick_ilb+0xdc>
c00000000028ccc4: bl c0000000002cd958 <idle_cpu+0x8>


In CONFIG_DEBUG_PREEMPT=y, if preemption/irq is disabled, then it does
not print any warning.

In CONFIG_DEBUG_PREEMPT=n, it doesn't do anything apart from getting
__smp_processor_id.

So with both CONFIG_DEBUG_PREEMPT=y/n, in preemption disabled section
it is better to cache the value. It could save a few cycles. Though
tiny, repeated in loop could add up to a small value.

This is done only for hotpaths or function which gets called quite often.
It is skipped for init or conditional hotpaths such as tracing/events.

While it was sent out[1] along with other scheduler change, it made more sense
to send it out as separate series after observing a few more falling in
same bucket.
[1]: https://lore.kernel.org/all/20260319065314.343932-1-sshegde@xxxxxxxxxxxxx/


Shrikanth Hegde (4):
sched/fair: get this cpu once in find_new_ilb
sched/core: get this cpu once in ttwu_queue_cond
smp: get this_cpu once in smp_call_function
timers: Get this_cpu once while clearing idle timer

kernel/sched/core.c | 6 ++++--
kernel/sched/fair.c | 4 ++--
kernel/smp.c | 4 ++--
kernel/time/timer.c | 5 +++--
4 files changed, 11 insertions(+), 8 deletions(-)

--
2.47.3