Re: [PATCH] workqueue: fix selection of wake_cpu in kick_pool()

From: Sven Schnelle
Date: Fri Apr 19 2024 - 04:28:10 EST


Hi Tejun,

Tejun Heo <tj@xxxxxxxxxx> writes:

> On Wed, Apr 17, 2024 at 05:36:38PM +0200, Sven Schnelle wrote:
>> > This generally seems like a good idea but isn't this still racy? The CPU may
>> > go down between setting p->wake_cpu and wake_up_process().
>>
>> Don't know without reading the source, but how does this code normally
>> protect against that?
>
> Probably by wrapping determining the wake_cpu and the wake_up inside
> cpu_read_lock() section.

Do you mean rcu_read_lock()? cpus_read_lock() takes a mutex, and the
crash happens in softirq context - so cpus_read_lock() can't be the
correct lock.

If i read the code correctly, cpu hotplug uses stop_machine_cpuslocked()
- so rcu_read_lock() should be sufficient for non-atomic context.

Looking at the backtrace the crash is actually happening in
arch_vpu_is_preempted(). I don't know the semantics of that function,
whether it is ok to call it for offline CPUs, or whether the calling
code should make sure that the cpu is online (which would be my guess).

Following the backtrace from my initial mail, I can't find a place where
a check is done whether p->wake_cpu is actually online. Eventually
available_idle_cpu() is calling vcpu_is_preempted(). I wonder whether
available_idle_cpu() should do a cpu_online() check right at the
beginning?

Adding Peter to CC, he probably knows.

Thanks,
Sven