Re: [Bug, sched, 5.8-rc2]: PREEMPT kernels crashing in check_preempt_wakeup() running fsx on XFS
From: Peter Zijlstra
Date: Wed Jul 01 2020 - 09:42:51 EST
On Wed, Jul 01, 2020 at 09:06:44PM +1000, Dave Chinner wrote:
> > +/*
> > + * Serialization rules:
> > + *
> > + * Lock order:
> > + *
> > + * p->pi_lock
> > + * rq->lock
> > + * hrtimer_clock_base::cpu_base->lock
>
> So these are accessed by the task_rq_lock() and rq_lock()
> interfaces, right? And we take clock locks inside these because we
> need to update timestamps/task clocks when tasks are first
> scheduled?
That's the hrtimer_start(), which we use for bandwidth control.
sched_clock() is lockless.
> What are all the rq_*pin_[un]lock() calls and how do they
> factor into this?
Ah, that's a lockdep annotation to ensure the rq->lock doesn't get
unlocked by accident. We've had a bunch of races over the years where
code accidentally unlocked rq->lock.
Basically lockdep_pin_lock() allows you to 'pin' the lock and it will
return a cookie. Any unlock of a pinned lock will generate a warning.
lockdep_unpin_lock() requires the previously provided cookie, or again
it will complain.
Not sure where to put words on that, but I'll find something.
> > + *
> > + * rq1->lock
> > + * rq2->lock where: &rq1 < &rq2
> > + *
>
> Ok, I'm guessing this is when task migration is being performed?
Yeah.
> > + * Regular state:
> > + *
> > + * Normal scheduling state is serialized by rq->lock. __schedule() takes the
> > + * local CPU's rq->lock, it optionally removes the task from the runqueue and
> > + * always looks at the local rq data structures to find the most elegible task
> > + * to run next.
> > + *
> > + * Task enqueue is also under rq->lock, possibly taken from another CPU.
> > + * Wakeups from another LLC domain might use an IPI to transfer the enqueue
> > + * to the local CPU to avoid bouncing the runqueue state around.
>
> That's ttwu_queue_wakelist()?
Yes, added a reference to that.
> I'm a little confused as to where ttwu_remote() fits into remote
> wakeups, though. That piece of the puzzle hasn't dropped into place
> yet...
Ah, so try_to_wake_up() is about another task/context changing our
p->state.
Assume:
for (;;) {
set_current_task(TASK_UNINTERRUPTIBLE)
if (COND)
break;
schedule();
}
__set_current_task(TASK_RUNNING);
Now, suppose another CPU/IRQ/etc.. does a wakeup between
set_current_task() and schedule(). At that point our @p is still a
running/runnable task, ttwu_remote() deals with this case.
I'll improve the ttwu_remote() comment.
> > + * Task wakeup, specifically wakeups that involve migration, are horribly
> > + * complicated to avoid having to take two rq->locks.
> > + *
> > + * Special state:
> > + *
> > + * p->state <- TASK_*
> > + * p->on_cpu <- { 0, 1 }
> > + * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }
> > + * task_cpu(p) <- set_task_cpu()
> > + *
> > + * System-calls and anything external will use task_rq_lock() which acquires
> > + * both p->lock and rq->lock. As a consequence things like p->cpus_ptr are
> > + * stable while holding either lock.
>
> OK. Might be worthwhile iterating the objects that have this "stable
> with one lock, modified under both locks" structures...
Agreed, that was on the todo list. I called out p->cpus_ptr because
that's directly used in the ttwu() code, but there's more. I'll dig them
all out.
> > + *
> > + * p->state is changed locklessly using set_current_state(),
> > + * __set_current_state() or set_special_state(), see their respective comments.
>
> /me goes code spelunking
>
> Ok, so those comments talk about "using a barrier" to order p->state
> changes against external logic to avoid wakeup races, not about how
> they are ordered or used internally.
Correct. Although we also make internal use of these memory barriers, us
being frugal folk :-) You'll find it referenced in the ttwu() ordering
comments, also some architectural code (switch_to()) relies on it.
> But it also mentions that the barrier matches with a full barrier in
> wake_up_state() and that calls straight into ttwu(). Ah, so that's
> what this "CONDITION" is - it's the external wakeup checks!
>
> /*
> * If we are going to wake up a thread waiting for CONDITION we
> * need to ensure that CONDITION=1 done by the caller can not be
> * reordered with p->state check below. This pairs with mb() in
> * set_current_state() the waiting thread does.
> */
> raw_spin_lock_irqsave(&p->pi_lock, flags);
> smp_mb__after_spinlock();
> if (!(p->state & state))
> goto unlock;
>
> My brain now connects this straight to well known task sleep/wakeup
> code patterns, and this bit now makes a lot more sense....
Just so, I'll see if I can make the comments with
set_current_state()/try_to_wake_up()/__schedule() more consistent with
one another. They're all trying to describe much the same as they
co-operate to make that sleep/wakeup pattern work.
> > + * p->on_rq is set by activate_task() and cleared by deactivate_task(), under
> > + * rq->lock. Non-zero indicates the task is runnable, the special
> > + * ON_RQ_MIGRATING state is used for migration without holding both rq->locks.
> > + * It indicates task_cpu() is not stable, see *task_rq_lock().
> > + *
> > + * task_cpu(p) is changed by set_task_cpu(), the rules are intricate but basically:
> > + *
> > + * - don't call set_task_cpu() on a blocked task
>
> why? It's good to document the rule, but I haven't been able to find
> an obvious explanation of why this rule exists....
Ah, I forgot if there was a techinical reason for it, but the conceptual
reason is that if a task isn't running, it's CPU assignment is
irrelevant, it doesn't matter what CPU you're not running on.
So I created rules where we only care about the CPU assignment for
runnable tasks. I'll add this.
(it makes hotplug easier, we don't have to even care if the cpu
assignment is even possible)
> > + *
> > + * - for try_to_wake_up(), called under p->pi_lock
>
> What's the reason for it being under different locks if it's already
> in runnable state? i.e. the lock that needs to be held is spelled
> out on set_task_cpu():
>
> * The caller should hold either p->pi_lock or rq->lock, when changing
> *a task's CPU. ->pi_lock for waking tasks, rq->lock for runnable tasks.
>
> But there's no explanation there or here of why the different locks
> need to be used or how the two different situations don't
> overlap/contend/race because they aren't holding the same locks. So
> while I know rule exists, I don't understand the relationship
> or logic that the rule emerges from.
There's a hint here:
+ * Task wakeup, specifically wakeups that involve migration, are horribly
+ * complicated to avoid having to take two rq->locks.
So ttwu() used to first acquire the 'old' rq->lock, then do
CPU-selection and if new-cpu != old-cpu (a fairly common case) do
set_task_cpu(), drop the old rq->lock, acquire new rq->lock and do the
enqueue. Or something along those lines.
It was found that the initial rq->lock in case of migration added
significantly to the rq->lock contention for a number of workloads, so
we made things complicated...
I'll go explain it better.
Thanks Dave!