Re: [PATCH 2/2] clocksource: Defer preempt_disable() after clocksource_verify_choose_cpus()

From: Waiman Long
Date: Fri Jan 24 2025 - 15:41:58 EST


On 1/24/25 3:37 PM, Paul E. McKenney wrote:
On Fri, Jan 24, 2025 at 12:25:19PM -0800, Paul E. McKenney wrote:
On Fri, Jan 24, 2025 at 01:59:23PM -0500, Waiman Long wrote:
The following bug report happened in a PREEMPT_RT kernel.

[ 30.957705] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[ 30.957711] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2012, name: kwatchdog
[ 30.962673] preempt_count: 1, expected: 0
[ 30.962676] RCU nest depth: 0, expected: 0
[ 30.962680] 3 locks held by kwatchdog/2012:
[ 30.962684] #0: ffffffff8af2da60 (clocksource_mutex){+.+.}-{3:3}, at: clocksource_watchdog_kthread+0x13/0x50
[ 30.967703] #1: ffffffff8aa8d4d0 (cpu_hotplug_lock){++++}-{0:0}, at: clocksource_verify_percpu.part.0+0x5c/0x330
[ 30.972774] #2: ffff9fe02f5f33e0 ((batched_entropy_u32.lock)){+.+.}-{2:2}, at: get_random_u32+0x4f/0x110
[ 30.977827] Preemption disabled at:
[ 30.977830] [<ffffffff88c1fe56>] clocksource_verify_percpu.part.0+0x66/0x330
[ 30.982837] CPU: 33 PID: 2012 Comm: kwatchdog Not tainted 5.14.0-503.23.1.el9_5.x86_64+rt-debug #1
[ 30.982843] Hardware name: HPE ProLiant DL385 Gen10 Plus/ProLiant DL385 Gen10 Plus, BIOS A42 04/29/2021
[ 30.982846] Call Trace:
[ 30.982850] <TASK>
[ 30.983821] dump_stack_lvl+0x57/0x81
[ 30.983821] __might_resched.cold+0xf4/0x12f
[ 30.983824] rt_spin_lock+0x4c/0x100
[ 30.988833] get_random_u32+0x4f/0x110
[ 30.988833] clocksource_verify_choose_cpus+0xab/0x1a0
[ 30.988833] clocksource_verify_percpu.part.0+0x6b/0x330
[ 30.993894] __clocksource_watchdog_kthread+0x193/0x1a0
[ 30.993898] clocksource_watchdog_kthread+0x18/0x50
[ 30.993898] kthread+0x114/0x140
[ 30.993898] ret_from_fork+0x2c/0x50
[ 31.002864] </TASK>

It is due to the fact that get_random_u32() is called in
clocksource_verify_choose_cpus() with preemption disabled.
If crng_ready() is true by the time get_random_u32() is called, The
batched_entropy_32 local lock will be acquired. In PREEMPT_RT kernel,
it is a rtmutex and we can't acquire it with preemption disabled.

To avoid this problem, we can't call get_random_u32() with preemption
disabled. However, smp_processor_id() has to be called with preemption
disabled though and we have to exclude the current CPU from the
cpus_chosen list to be tested.

Extract current CPU removal code out from
clocksource_verify_choose_cpus() and defer the preempt_disable()
call to after clocksource_verify_choose_cpus() and before
current CPU removal. Also use raw_smp_processor_id() in
clocksource_verify_choose_cpus().

Fixes: 7560c02bdffb ("clocksource: Check per-CPU clock synchronization when marked unstable")
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
Good catch!

But we don't need cryptographically secure random numbers (or blistering
speed) here. Substituting something like torture_random()%nr_cpu_ids
for get_random_u32_below() work?
I suppose I should add my concern... If we don't have preemption disabled
across this code, we cannot reliably avoid attempting to IPI ourselves.

Does the CPU choosing process itsself needs to have preemption disabled? I thought it was because of need to use smp_processor_id() and have the current CPU excluded. Preemption is still disabled after that.

Cheers,
Longman