Re: [RFC][PATCH 5/5] sched: Reduce ttwu rq->lock contention

From: Peter Zijlstra
Date: Fri Dec 17 2010 - 13:25:31 EST


On Fri, 2010-12-17 at 18:50 +0100, Oleg Nesterov wrote:
> On 12/17, Oleg Nesterov wrote:
> >
> > On 12/16, Peter Zijlstra wrote:
> > >
> > > + if (p->se.on_rq && ttwu_force(p, state, wake_flags))
> > > + return 1;
> >
> > ----- WINDOW -----
> >
> > > + for (;;) {
> > > + unsigned int task_state = p->state;
> > > +
> > > + if (!(task_state & state))
> > > + goto out;
> > > +
> > > + load = task_contributes_to_load(p);
> > > +
> > > + if (cmpxchg(&p->state, task_state, TASK_WAKING) == task_state)
> > > + break;
> >
> > Suppose that we have a task T sleeping in TASK_INTERRUPTIBLE state,
> > and this cpu does try_to_wake_up(TASK_INTERRUPTIBLE). on_rq == false.
> > try_to_wake_up() starts the "for (;;)" loop.
> >
> > However, in the WINDOW above, it is possible that somebody else wakes
> > it up, and then this task changes its state to TASK_INTERRUPTIBLE again.
> >
> > Then we set ->state = TASK_WAKING, but this (still running) T restores
> > TASK_RUNNING after us.
>
> Even simpler. This can race with, say, __migrate_task() which does
> deactivate_task + activate_task and temporary clears on_rq. Although
> this is simple to fix, I think.

Yes, another hole..

> Also. Afaics, without rq->lock, we can't trust "while (p->oncpu)", at
> least we need rmb() after that.

I think Linus once argued that loops like that should be fine without a
rmb(), at worst they'll have to spin a few more times to observe the
1->0 switch (we don't care about the 0->1 switch in this case because
that's ruled out by the ->state test).

> Interestingly, I can't really understand the current meaning of smp_wmb()
> in finish_lock_switch(). Do you know what exactly is buys?

I _think_ its meant to ensure the full contest switch happened and we've
stored all changes to the rq structure (destroying all references to
prev), in particular, we've finished writing the new value of current.

> In any case,
> task_running() (or its callers) do not have the corresponding rmb().
> Say, currently try_to_wake_up()->task_waking() can miss all changes
> starting from prepare_lock_switch(). Hopefully this is OK, but I am
> confused ;)

So I thought I saw how we are OK there, but then I got myself confused
too :-)

My argument was something along the lines of there must be some
serialization between the task going to sleep and another task waking it
(the task setting TASK_UNINTERRUPTIBLE and enqueuing it on a waitqueue,
and the waker finding it on the waitqueue), this should be sufficient to
make ->state visible to the waker.

If the waker observes a !TASK_RUNNING ->state, then by definition it
must see all the changes previous to it (including the ->oncpu 0->1
transition).

But like said, got my brain in a twist too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/