Re: [PATCH 1/4] sched/wakeup: Strengthen current_save_and_set_rtlock_wait_state()

From: Boqun Feng
Date: Sun Sep 12 2021 - 00:01:51 EST


On Thu, Sep 09, 2021 at 04:27:46PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 09, 2021 at 02:45:24PM +0100, Will Deacon wrote:
> > On Thu, Sep 09, 2021 at 12:59:16PM +0200, Peter Zijlstra wrote:
> > > While looking at current_save_and_set_rtlock_wait_state() I'm thinking
> > > it really ought to use smp_store_mb(), because something like:
> > >
> > > current_save_and_set_rtlock_wait_state();
> > > for (;;) {
> > > if (try_lock())
> > > break;
> > >
> > > raw_spin_unlock_irq(&lock->wait_lock);
> > > schedule();
> > > raw_spin_lock_irq(&lock->wait_lock);
> > >
> > > set_current_state(TASK_RTLOCK_WAIT);
> > > }
> > > current_restore_rtlock_saved_state();
> > >
> > > which is the advertised usage in the comment, is actually broken,
> > > since trylock() will only need a load-acquire in general and that
> > > could be re-ordered against the state store, which could lead to a
> > > missed wakeup -> BAD (tm).
> >
> > Why doesn't the UNLOCK of pi_lock in current_save_and_set_rtlock_wait_state()
> > order the state change before the successful try_lock? I'm just struggling
> > to envisage how this actually goes wrong.
>
> Moo yes, so the earlier changelog I wrote was something like:
>
> current_save_and_set_rtlock_wait_state();
> for (;;) {
> if (try_lock())
> break;
>
> raw_spin_unlock_irq(&lock->wait_lock);
> if (!cond)
> schedule();
> raw_spin_lock_irq(&lock->wait_lock);
>
> set_current_state(TASK_RTLOCK_WAIT);
> }
> current_restore_rtlock_saved_state();
>
> which is more what the code looks like before these patches, and in that
> case the @cond load can be lifted before __state.
>
> It all sorta works in the current application because most things are
> serialized by ->wait_lock, but given the 'normal' wait pattern I got
> highly suspicious of there not being a full barrier around.

Hmm.. I think ->pi_lock actually protects us here. IIUC, a mising
wake-up would happen if try_to_wake_up() failed to observe the __state
change by the about-to-wait task, and the about-to-wait task didn't
observe the condition set by the waker task, for example:

TASK 0 TASK 1
====== ======
cond = 1;
...
try_to_wake_up(t0, TASK_RTLOCK_WAIT, ..):
ttwu_state_match(...)
if (t0->__state & TASK_RTLOCK_WAIT) // false
..
return false; // don't wake up
...
current->__state = TASK_RTLOCK_WAIT
...
if (!cond) // !cond is true because of memory reordering
schedule(); // sleep, and may not be waken up again.

But let's add ->pi_lock critical sections into the example:

TASK 0 TASK 1
====== ======
cond = 1;
...
try_to_wake_up(t0, TASK_RTLOCK_WAIT, ..):
raw_spin_lock_irqsave(->pi_lock,...);
ttwu_state_match(...)
if (t0->__state & TASK_RTLOCK_WAIT) // false
..
return false; // don't wake up
raw_spin_unlock_irqrestore(->pi_lock,...); // A
...
raw_spin_lock_irqsave(->pi_lock, ...); // B
current->__state = TASK_RTLOCK_WAIT
raw_spin_unlock_irqrestore(->pi_lock, ...);
if (!cond)
schedule();

Now the read of cond on TASK0 must observe the store of cond on TASK1,
because accesses to __state is serialized by ->pi_lock, so if TASK1's
read to __state didn't observe the write of TASK0 to __state, then the
lock B must read from the unlock A (or another unlock co-after A),
then we have a release-acquire pair to guarantee that the read of cond
on TASK0 sees the write of cond on TASK1. Simplify this by a litmus
test below:

C unlock-lock
{
}

P0(spinlock_t *s, int *cond, int *state)
{
int r1;

spin_lock(s);
WRITE_ONCE(*state, 1);
spin_unlock(s);
r1 = READ_ONCE(*cond);
}

P1(spinlock_t *s, int *cond, int *state)
{
int r1;

WRITE_ONCE(*cond, 1);
spin_lock(s);
r1 = READ_ONCE(*state);
spin_unlock(s);
}

exists (0:r1=0 /\ 1:r1=0)

and result is:

Test unlock-lock Allowed
States 3
0:r1=0; 1:r1=1;
0:r1=1; 1:r1=0;
0:r1=1; 1:r1=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (0:r1=0 /\ 1:r1=0)
Observation unlock-lock Never 0 3
Time unlock-lock 0.01
Hash=e1f914505f07e380405f65d3b0fb6940

In short, since we write to the __state with ->pi_lock held, I don't
think we need to smp_store_mb() for __state. But maybe I'm missing
something subtle here ;-)

Regards,
Boqun