Re: [PATCH] sched,workqueue: Use READ_ONCE()/WRITE_ONCE() for wake_cpu accesses
From: Peter Zijlstra
Date: Mon Mar 30 2026 - 10:47:55 EST
On Fri, Mar 27, 2026 at 07:04:20AM -1000, Tejun Heo wrote:
> Hello,
>
> On Fri, Mar 27, 2026 at 10:27:39AM +0100, Peter Zijlstra wrote:
> > On Fri, Mar 27, 2026 at 03:30:07PM +0800, Yu Peng wrote:
> > > task_struct->wake_cpu is used as a wake placement hint by scheduler code
> > > and workqueue's non-strict affinity repatriation path.
> > >
> > > These accesses are intentionally lockless and stale values are tolerated,
> > > affecting only wakeup placement.
> >
> > The scheduler usage isn't lockless. And I'm not at all sure why and how
> > workqueue is poking at this stuff :/ It has no business poking at it.
>
> Oh, you were involved in the discussion. It's workqueue using ->wake_cpu as
> a way to implement opportunistic affinity at work boundaries as past history
> doesn't mean anything at these boundaries.
>
> https://lore.kernel.org/all/20230519001709.2563-1-tj@xxxxxxxxxx/
Bah, clearly I didn't remember ...
> Hmmm... it would trigger KCSAN. Maybe a better way to do it is exposing
> sched interface that does WRITE_ONCE wrapping? How do you want it resolved?
Perhaps a try_to_wake_up() variant that takes a cpumask (see the
completely untested code below). But I'm not entirely sure. The current
thing very much relies on a bunch of implementation details that aren't
guaranteed.
The below will 'fake' the task cpumask for a while; and note that the
caller has to be careful to provide a mask that only includes tasks that
were set in the current task, otherwise 'funny' things will happen.
Also, it will only use this mask if it ends up doing CPU selection at
all.
Ooh, I suppose I should make sure wake_cpu is inside the provided map
too...
That said, I don't think there is any urgency here, this code has been
like this for a long time, it can stay this way a little longer.
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 496dff740dca..81006b5c881d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4089,7 +4089,8 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
* Return: %true if @p->state changes (an actual wakeup was done),
* %false otherwise.
*/
-int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags,
+ const struct cpumask *wake_mask)
{
guard(preempt)();
int cpu, success = 0;
@@ -4128,6 +4129,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
* in set_current_state() that the waiting thread does.
*/
scoped_guard (raw_spinlock_irqsave, &p->pi_lock) {
+ const struct cpumask *prev_mask = NULL;
+
smp_mb__after_spinlock();
if (!ttwu_state_match(p, state, &success))
break;
@@ -4227,7 +4230,14 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
smp_cond_load_acquire(&p->on_cpu, !VAL);
+ if (wake_mask && !is_migration_disabled(p)) {
+ prev_mask = p->cpus_ptr;
+ p->cpus_ptr = wake_mask;
+ }
cpu = select_task_rq(p, p->wake_cpu, &wake_flags);
+ if (prev_mask)
+ p->cpus_ptr = prev_mask;
+
if (task_cpu(p) != cpu) {
if (p->in_iowait) {
delayacct_blkio_end(p);
@@ -4371,13 +4381,13 @@ struct task_struct *cpu_curr_snapshot(int cpu)
*/
int wake_up_process(struct task_struct *p)
{
- return try_to_wake_up(p, TASK_NORMAL, 0);
+ return try_to_wake_up(p, TASK_NORMAL, 0, NULL);
}
EXPORT_SYMBOL(wake_up_process);
int wake_up_state(struct task_struct *p, unsigned int state)
{
- return try_to_wake_up(p, state, 0);
+ return try_to_wake_up(p, state, 0, NULL);
}
/*
@@ -7247,7 +7257,7 @@ int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flag
void *key)
{
WARN_ON_ONCE(wake_flags & ~(WF_SYNC|WF_CURRENT_CPU));
- return try_to_wake_up(curr->private, mode, wake_flags);
+ return try_to_wake_up(curr->private, mode, wake_flags, NULL);
}
EXPORT_SYMBOL(default_wake_function);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 43bbf0693cca..4c9d475da072 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3760,7 +3760,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
extern void swake_up_all_locked(struct swait_queue_head *q);
extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);
-extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags,
+ const struct cpumask *wake_mask);
#ifdef CONFIG_PREEMPT_DYNAMIC
extern int preempt_dynamic_mode;
diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index 0fef6496c4c8..3eb2002fc847 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -27,7 +27,7 @@ void swake_up_locked(struct swait_queue_head *q, int wake_flags)
return;
curr = list_first_entry(&q->task_list, typeof(*curr), task_list);
- try_to_wake_up(curr->task, TASK_NORMAL, wake_flags);
+ try_to_wake_up(curr->task, TASK_NORMAL, wake_flags, NULL);
list_del_init(&curr->task_list);
}
EXPORT_SYMBOL(swake_up_locked);