Re: Loadavg accounting error on arm64
From: Peter Zijlstra
Date: Mon Nov 16 2020 - 09:20:17 EST
On Mon, Nov 16, 2020 at 01:37:21PM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 01:11:03PM +0000, Will Deacon wrote:
> > Anyway, setting all that aside, I do agree with you that the bitfield usage
> > in task_struct looks pretty suspicious. For example, in __schedule() we
> > have:
> >
> > rq_lock(rq, &rf);
> > smp_mb__after_spinlock();
> > ...
> > prev_state = prev->state;
> >
> > if (!preempt && prev_state) {
> > if (signal_pending_state(prev_state, prev)) {
> > prev->state = TASK_RUNNING;
> > } else {
> > prev->sched_contributes_to_load =
> > (prev_state & TASK_UNINTERRUPTIBLE) &&
> > !(prev_state & TASK_NOLOAD) &&
> > !(prev->flags & PF_FROZEN);
> > ...
> > deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> >
> > where deactivate_task() updates p->on_rq directly:
> >
> > p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
> >
>
> It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better
> document ttwu()") which changed it. Not sure why that is and didn't
> think about it too deep as it didn't appear to be directly related to
> the problem and didn't have ordering consequences.
I'm confused; that commit didn't change deactivate_task(). Anyway,
->on_rq should be strictly under rq->lock. That said, since there is a
READ_ONCE() consumer of ->on_rq it makes sense to have the stores as
WRITE_ONCE().
> > __ttwu_queue_wakelist() we have:
> >
> > p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> >
> > which can be invoked on the try_to_wake_up() path if p->on_rq is first read
> > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield
> > updates can race and cause the flags to be corrupted?
> >
>
> I think this is at least one possibility. I think at least that one
> should only be explicitly set on WF_MIGRATED and explicitly cleared in
> sched_ttwu_pending. While I haven't audited it fully, it might be enough
> to avoid a double write outside of the rq lock on the bitfield but I
> still need to think more about the ordering of sched_contributes_to_load
> and whether it's ordered by p->on_cpu or not.
The scenario you're worried about is something like:
CPU0 CPU1
schedule()
prev->sched_contributes_to_load = X;
deactivate_task(prev);
try_to_wake_up()
if (p->on_rq &&) // false
if (smp_load_acquire(&p->on_cpu) && // true
ttwu_queue_wakelist())
p->sched_remote_wakeup = Y;
smp_store_release(prev->on_cpu, 0);
And then the stores of X and Y clobber one another.. Hummph, seems
reasonable. One quick thing to test would be something like this:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7abbdd7f3884..9844e541c94c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,9 @@ struct task_struct {
unsigned sched_reset_on_fork:1;
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
+ unsigned :0;
unsigned sched_remote_wakeup:1;
+ unsigned :0;
#ifdef CONFIG_PSI
unsigned sched_psi_wake_requeue:1;
#endif