Re: [PATCH v2] rcu: Fix missed wakeup of exp_wq waiters

From: Neeraj Upadhyay
Date: Thu Nov 21 2019 - 22:41:00 EST




On 11/21/2019 8:31 PM, Paul E. McKenney wrote:
On Thu, Nov 21, 2019 at 09:48:05AM +0530, Neeraj Upadhyay wrote:
On 11/21/2019 9:37 AM, Paul E. McKenney wrote:
On Wed, Nov 20, 2019 at 10:28:38AM +0530, Neeraj Upadhyay wrote:

On 11/20/2019 1:08 AM, Paul E. McKenney wrote:
On Tue, Nov 19, 2019 at 03:17:07AM +0000, Neeraj Upadhyay wrote:
For the tasks waiting in exp_wq inside exp_funnel_lock(),
there is a chance that they might be indefinitely blocked
in below scenario:

1. There is a task waiting on exp sequence 0b'100' inside
exp_funnel_lock(). This task blocks at wq index 1.

synchronize_rcu_expedited()
s = 0b'100'
exp_funnel_lock()
wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]

2. The expedited grace period (which above task blocks for)
completes and task (task1) holding exp_mutex queues
worker and schedules out.

synchronize_rcu_expedited()
s = 0b'100'
queue_work(rcu_gp_wq, &rew.rew_work)
wake_up_worker()
schedule()

3. kworker A picks up the queued work and completes the exp gp
sequence and then blocks on exp_wake_mutex, which is held
by another kworker, which is doing wakeups for expedited_sequence
0.

rcu_exp_wait_wake()
rcu_exp_wait_wake()
rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
// to 0b'100'
mutex_lock(&rcu_state.exp_wake_mutex)

4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
and releases exp_mutex.

wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
sync_exp_work_done(rsp, s));
mutex_unlock(&rsp->exp_mutex);

5. Next exp GP completes, and sequence number is incremented:

rcu_exp_wait_wake()
rcu_exp_wait_wake()
rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'

6. kworker A acquires exp_wake_mutex. As it uses current
expedited_sequence, it wakes up workers from wrong wait queue
index - it should have worken wait queue corresponding to
0b'100' sequence, but wakes up the ones for 0b'200' sequence.
This results in task at step 1 indefinitely blocked.

rcu_exp_wait_wake()
wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);

This issue manifested as DPM device timeout during suspend, as scsi
device was stuck in _synchronize_rcu_expedited().

schedule()
synchronize_rcu_expedited()
synchronize_rcu()
scsi_device_quiesce()
scsi_bus_suspend()
dpm_run_callback()
__device_suspend()

Fix this by using the correct exp sequence number, the one which
owner of the exp_mutex initiated and passed to kworker,
to index the wait queue, inside rcu_exp_wait_wake().

Signed-off-by: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx>

Queued, thank you!

I reworked the commit message to make it easier to follow the sequence
of events. Please see below and let me know if I messed anything up.

Thanx, Paul

------------------------------------------------------------------------

commit d887fd2a66861f51ed93b5dde894b9646a5569dd
Author: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx>
Date: Tue Nov 19 03:17:07 2019 +0000

rcu: Fix missed wakeup of exp_wq waiters
Tasks waiting within exp_funnel_lock() for an expedited grace period to
elapse can be starved due to the following sequence of events:
1. Tasks A and B both attempt to start an expedited grace
period at about the same time. This grace period will have
completed when the lower four bits of the rcu_state structure's
->expedited_sequence field are 0b'0100', for example, when the
initial value of this counter is zero. Task A wins, and thus
does the actual work of starting the grace period, including
acquiring the rcu_state structure's .exp_mutex and sets the
counter to 0b'0001'.
2. Because task B lost the race to start the grace period, it
waits on ->expedited_sequence to reach 0b'0100' inside of
exp_funnel_lock(). This task therefore blocks on the rcu_node
structure's ->exp_wq[1] field, keeping in mind that the
end-of-grace-period value of ->expedited_sequence (0b'0100')
is shifted down two bits before indexing the ->exp_wq[] field.
3. Task C attempts to start another expedited grace period,
but blocks on ->exp_mutex, which is still held by Task A.
4. The aforementioned expedited grace period completes, so that
->expedited_sequence now has the value 0b'0100'. A kworker task
therefore acquires the rcu_state structure's ->exp_wake_mutex
and starts awakening any tasks waiting for this grace period.
5. One of the first tasks awakened happens to be Task A. Task A
therefore releases the rcu_state structure's ->exp_mutex,
which allows Task C to start the next expedited grace period,
which causes the lower four bits of the rcu_state structure's
->expedited_sequence field to become 0b'0101'.
6. Task C's expedited grace period completes, so that the lower four
bits of the rcu_state structure's ->expedited_sequence field now
become 0b'1000'.
7. The kworker task from step 4 above continues its wakeups.
Unfortunately, the wake_up_all() refetches the rcu_state
structure's .expedited_sequence field:

This might not be true. I think wake_up_all(), which internally calls
__wake_up(), will use a single wq_head for doing all wakeups. So, a single
.expedited_sequence value would be used to get wq_head.

void __wake_up(struct wait_queue_head *wq_head, ...)

The wake_up_all() really is selecting among four different ->exp_wq[]
array entries:

wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);

So I lost you here. Are you saying that the wake_up_all() will somehow
be operating on ->exp_wq[1], which is where Task B is blocked? Or that
Task B would instead be blocked on ->exp_wq[2]? Or that failing to wake
up Task B is OK for some reason? Or something else entirely?

My bad; I was thinking only of the case where there is only one
rnp node (which is also the root) in RCU tree. In case of only
one rnp node also, issue can be seen. Please ignore my
comment. The commit description looks good to me.

Thank you for checking!

And the sequence of events below looks greatly improved over your
original. I suspect that there are more similar bugs to find in
Linux-kernel RCU, so please use a carefully labeled style like that
below when reporting the next one.

Thanx, Paul


Ok, thanks. I will keep that in mind, while reporting any issues in future.


Thanks
Neeraj

Thanks
Neeraj

Thanx, Paul

However, below sequence of events would result in problem:

1. Tasks A starts an expedited grace period at about the same time.
This grace period will have completed when the lower four bits
of the rcu_state structure's ->expedited_sequence field are 0b'0100',
for example, when the initial value of this counter is zero.
Task A wins, acquires the rcu_state structure's .exp_mutex and
sets the counter to 0b'0001'.

2. The aforementioned expedited grace period completes, so that
->expedited_sequence now has the value 0b'0100'. A kworker task
therefore acquires the rcu_state structure's ->exp_wake_mutex
and starts awakening any tasks waiting for this grace period.
This kworker gets preempted while unlocking wq_head lock

wake_up_all()
__wake_up()
__wake_up_common_lock()
spin_unlock_irqrestore()
__raw_spin_unlock_irqrestore()
preempt_enable()
__preempt_schedule()

3. One of the first tasks awakened happens to be Task A. Task A
therefore releases the rcu_state structure's ->exp_mutex,

4. Tasks B and C both attempt to start an expedited grace
period at about the same time. This grace period will have
completed when the lower four bits of the rcu_state structure's
->expedited_sequence field are 0b'1000'. Task B wins, and thus
does the actual work of starting the grace period, including
acquiring the rcu_state structure's .exp_mutex and sets the
counter to 0b'0101'.

5. Because task C lost the race to start the grace period, it
waits on ->expedited_sequence to reach 0b'1000' inside of
exp_funnel_lock(). This task therefore blocks on the rcu_node
structure's ->exp_wq[2] field, keeping in mind that the
end-of-grace-period value of ->expedited_sequence (0b'1000')
is shifted down two bits before indexing the ->exp_wq[] field.

6. Task B queues work to complete expedited grace period. This
task is preempted just before wait_event call. Kworker task picks
up the work queued by task B and and completes grace period, so
that the lower four bits of the rcu_state structure's
->expedited_sequence field now become 0b'1000'. This kworker starts
waiting on the exp_wake_mutex, which is owned by kworker doing
wakeups for expedited sequence initiated by task A.

7. Task B schedules in and finds its expedited sequence snapshot has
completed; so, it does not enter waitq and releases exp_mutex. This
allows Task D to start the next expedited grace period,
which causes the lower four bits of the rcu_state structure's
->expedited_sequence field to become 0b'1001'.

8. Task D's expedited grace period completes, so that the lower four
bits of the rcu_state structure's ->expedited_sequence field now
become 0b'1100'.

9. kworker from step 2 is scheduled in and releases exp_wake_mutex;
kworker correspnding to Task B's expedited grace period acquires
exp_wake_mutex and starts wakeups. Unfortunately, it used the
rcu_state structure's .expedited_sequence field for determining
the waitq index.


wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);

This results in the wakeup being applied to the rcu_node
structure's ->exp_wq[3] field, which is unfortunate given that
Task C is instead waiting on ->exp_wq[2].


wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
This results in the wakeup being applied to the rcu_node
structure's ->exp_wq[2] field, which is unfortunate given that
Task B is instead waiting on ->exp_wq[1].
On a busy system, no harm is done (or at least no permanent harm is done).
Some later expedited grace period will redo the wakeup. But on a quiet
system, such as many embedded systems, it might be a good long time before
there was another expedited grace period. On such embedded systems,
this situation could therefore result in a system hang.
This issue manifested as DPM device timeout during suspend (which
usually qualifies as a quiet time) due to a SCSI device being stuck in
_synchronize_rcu_expedited(), with the following stack trace:
schedule()
synchronize_rcu_expedited()
synchronize_rcu()
scsi_device_quiesce()
scsi_bus_suspend()
dpm_run_callback()
__device_suspend()
This commit therefore prevents such delays, timeouts, and hangs by
making rcu_exp_wait_wake() use its "s" argument consistently instead of
refetching from rcu_state.expedited_sequence.

Do we need a "fixes" tag here?

Signed-off-by: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6ce598d..4433d00a 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
spin_unlock(&rnp->exp_lock);
}
smp_mb(); /* All above changes before wakeup. */
- wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
+ wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
}
trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
mutex_unlock(&rcu_state.exp_wake_mutex);


--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
the Code Aurora Forum, hosted by The Linux Foundation

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
the Code Aurora Forum, hosted by The Linux Foundation

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation