Re: [PATCH] rcu: Fix missed wakeup of exp_wq waiters

From: Neeraj Upadhyay
Date: Tue Nov 19 2019 - 02:03:24 EST


Hi Paul,

On 11/19/2019 9:35 AM, Paul E. McKenney wrote:
On Tue, Nov 19, 2019 at 03:35:15AM +0000, Neeraj Upadhyay wrote:
Hi Paul,

On 11/18/2019 10:54 PM, Paul E. McKenney wrote:
On Mon, Nov 18, 2019 at 04:41:47PM +0000, Neeraj Upadhyay wrote:
Hi Paul,


On 11/18/2019 8:38 PM, Paul E. McKenney wrote:
On Mon, Nov 18, 2019 at 09:28:39AM +0000, Neeraj Upadhyay wrote:
Hi Paul,

On 11/18/2019 3:06 AM, Paul E. McKenney wrote:
On Fri, Nov 15, 2019 at 10:58:14PM +0530, Neeraj Upadhyay wrote:
For the tasks waiting in exp_wq inside exp_funnel_lock(),
there is a chance that they might be indefinitely blocked
in below scenario:

1. There is a task waiting on exp sequence 0b'100' inside
exp_funnel_lock().

_synchronize_rcu_expedited()

This symbol went away a few versions back, but let's see how this
plays out in current -rcu.


Sorry; for us this problem is observed on 4.19 stable version; I had
checked against the -rcu code, and the relevant portions were present
there.

s = 0b'100
exp_funnel_lock()
wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3]

All of the above could still happen if the expedited grace
period number was zero (or a bit less) when that task invoked

Yes

synchronize_rcu_expedited(). What is the relation, if any,
between this task and "task1" below? Seems like you want them to
be different tasks.


This task is the one which is waiting for the expedited sequence, which
"task1" completes ("task1" holds the exp_mutex for it). "task1" would
wake up this task, on exp GP completion.

Does this task actually block, or is it just getting ready
to block? Seems like you need it to have actually blocked.


Yes, it actually blocked in wait queue.

2. The Exp GP completes and task (task1) holding exp_mutex queues
worker and schedules out.

"The Exp GP" being the one that was initiated when the .expedited_sequence
counter was zero, correct? (Looks that way below.)

Yes, correct.

_synchronize_rcu_expedited()
s = 0b'100
queue_work(rcu_gp_wq, &rew.rew_work)
wake_up_worker()
schedule()

3. kworker A picks up the queued work and completes the exp gp
sequence.

rcu_exp_wait_wake()
rcu_exp_wait_wake()
rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence is incremented
// to 0b'100'

4. task1 does not enter wait queue, as sync_exp_work_done() returns true,
and releases exp_mutex.

wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
sync_exp_work_done(rsp, s));
mutex_unlock(&rsp->exp_mutex);

So task1 is the one that initiated the expedited grace period that
started when .expedited_sequence was zero, right?


Yes, right.

5. Next exp GP completes, and sequence number is incremented:

rcu_exp_wait_wake()
rcu_exp_wait_wake()
rcu_exp_gp_seq_end(rsp) // rsp->expedited_sequence = 0b'200'

6. As kworker A uses current expedited_sequence, it wakes up workers
from wrong wait queue index - it should have worken wait queue
corresponding to 0b'100' sequence, but wakes up the ones for
0b'200' sequence. This results in task at step 1 indefinitely blocked.

rcu_exp_wait_wake()
wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);

So the issue is that the next expedited RCU grace period might
have completed before the completion of the wakeups for the previous
expedited RCU grace period, correct? Then expedited grace periods have

Yes. Actually from the ftraces, I saw that next expedited RCU grace
period completed while kworker A was in D state, while waiting for
exp_wake_mutex. This led to kworker A using sequence 2 (instead of 1) for
its wake_up_all() call; so, task (point 1) was never woken up, as it was
waiting on wq index 1.

to have stopped to prevent any future wakeup from happening, correct?
(Which would make it harder for rcutorture to trigger this, though it
really does have code that attempts to trigger this sort of thing.)

Is this theoretical in nature, or have you actually triggered it?
If actually triggered, what did you do to make this happen?

This issue, we had seen previously - 1 instance in May 2018 (on 4.9 kernel),
another instance in Nov 2018 (on 4.14 kernel), in our customer reported
issues. Both instances were in downstream drivers and we didn't have RCU
traces. Now 2 days back, it was reported on 4.19 kernel, with RCU traces
enabled, where it was observed in suspend scenario, where we are observing
"DPM device timeout" [1], as scsi device is stuck in
_synchronize_rcu_expedited().

schedule+0x70/0x90
_synchronize_rcu_expedited+0x590/0x5f8
synchronize_rcu+0x50/0xa0
scsi_device_quiesce+0x50/0x120
scsi_bus_suspend+0x70/0xe8
dpm_run_callback+0x148/0x388
__device_suspend+0x430/0x8a8

[1]
https://github.com/torvalds/linux/blob/master/drivers/base/power/main.c#L489

What have you done to test the change?


I have given this for testing; will share the results . Current analysis
and patch is based on going through ftrace and code review.

OK, very good. Please include the failure information in the changelog
of the next version of this patch.

Done.


I prefer your original patch, that just uses "s", over the one below
that moves the rcu_exp_gp_seq_end(). The big advantage of your original
patch is that it allow more concurrency between a consecutive pair of
expedited RCU grace periods. Plus it would not be easy to convince
myself that moving rcu_exp_gp_seq_end() down is safe, so your original
is also conceptually simpler with a more manageable state space.

The reason for highlighting the alternate approach of doing gp end inside
exp_wake_mutex is the requirement of 3 wqs. Now, this is a theoretical case;
please correct me if I am wrong here:

1. task0 holds exp_wake_mutex, and is preempted.

Presumably after it has awakened the kthread that initiated the prior
expedited grace period (the one with seq number = -4).

2. task1 initiates new GP (current seq number = 0).

Yes, this can happen.

3. task1 queues worker kworker1 and schedules out.

And thus still holds .exp_mutex, but yes.

4. kworker1 sets exp GP to 1 and waits on exp_wake_mutex

And thus cannot yet have awakened task1.

5. task1 releases exp mutex, w/o entering waitq.

So I do not believe that we can get to #5. What am I missing here?


As mentioned in this patch, task1 could have scheduled out after queuing work:

queue_work(rcu_gp_wq, &rew.rew_work)
wake_up_worker()
schedule()

kworker1 runs and picks up this queued work, and sets exp GP to 1 and waits on exp_wake_mutex.

task1 gets scheduled in and checks sync_exp_work_done(rsp, s), which return true and it does not enter wait queue and releases exp_mutex.

wait_event(rnp->exp_wq[rcu_seq_ctr(s) & 0x3],
sync_exp_work_done(rsp, s));

Thanks
Neeraj

6. task2 initiates new GP (current seq number = 1).
7. task2 queues worker kworker1 and schedules out.
8. kworker 2 sets exp GP to 2 and waits on exp_wake_mutex.
...

This sequence would break the requirement of max 3 wqs.

If we hold the exp_wake_mutex before exp seq end, there will be atmost three
pending GPs - 1. GP which current owner is doing wakeups for,
2. GP which the waiter of exp_wake_mutex would complete, 3. Next GP
waiters, which started after GP @ point 2 had started. This also is inline
with the statement in [1]:

"he key point is that the ->exp_mutex is not released until
the first wakeup is complete, which means that the ->exp_wake_mutex
has already been acquired at that point"

[1] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev

And I believe that this still holds. The task holding .exp_mutex cannot
release it until it has been awakened, and it won't be awakened until after
the .exp_wake_mutex has been acquired. Again, what am I missing here?

Please also add the WARN_ON(), though at first glance your change seems
to have lost the wakeup. (But it is early, so maybe it is just that I
am not yet fully awake.)

My bad, I posted incomplete diff in previous mail:

static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
{
struct rcu_node *rnp;
+ unsigned long exp_low;
+ unsigned long s_low = rcu_seq_ctr(s) & 0x3;

synchronize_sched_expedited_wait(rsp);
rcu_exp_gp_seq_end(rsp);
@@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
unsigned long s)
spin_unlock(&rnp->exp_lock);
}
smp_mb(); /* All above changes before wakeup. */
- wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
+ exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
+ WARN_ON(s_low != exp_low);
+ wake_up_all(&rnp->exp_wq[exp_low]);

Much better!

But I suggest using s_low in the wake_up_all. This hunk is of course
only for testing purposes, not for actual inclusion. (My earlier email
didn't make that clear.)

Thanx, Paul


Ok sure, got it. I will share the results, once the issue is reproduced.

Sounds good!

Thanx, Paul

Thanks
Neeraj

}


Thanks
Neeraj


Thanx, Paul

I was thinking of another way of addressing this problem: Doing exp seq end
inside exp_wake_mutex. This will also ensure that, if we extend the current
scenario and there are multiple expedited GP sequence, which have completed,
before exp_wake_mutex is held, we need to preserve the requirement of 3 wq
entries [2].


[2] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/tree/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.rst?h=dev


@@ -595,8 +595,6 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
unsigned long s)
struct rcu_node *rnp;

synchronize_sched_expedited_wait(rsp);
- rcu_exp_gp_seq_end(rsp);
- trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));

/*
* Switch over to wakeup mode, allowing the next GP, but -only- the
@@ -604,6 +602,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
unsigned long s)
*/
mutex_lock(&rsp->exp_wake_mutex);

+ rcu_exp_gp_seq_end(rsp);
+ trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
+



(Using a WARN_ON() to check for the lower bits of the counter portion
of rcu_state.expedited_sequence differing from the same bits of s
would be one way to detect this problem.)

Thanx, Paul


I have also given the patch for this, for testing:

static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
{
struct rcu_node *rnp;
+ unsigned long exp_low;
+ unsigned long s_low = rcu_seq_ctr(s) & 0x3;

synchronize_sched_expedited_wait(rsp);
rcu_exp_gp_seq_end(rsp);
@@ -613,7 +615,9 @@ static void rcu_exp_wait_wake(struct rcu_state *rsp,
unsigned long s)
spin_unlock(&rnp->exp_lock);
}
smp_mb(); /* All above changes before wakeup. */
- wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp->expedited_sequence) & 0x3]);
+ exp_low = rcu_seq_ctr(rsp->expedited_sequence) & 0x3;
+ WARN_ON(s_low != exp_low);
+

Thanks
Neeraj

Fix this by using the correct sequence number for wake_up_all() inside
rcu_exp_wait_wake().

Signed-off-by: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx>
---
kernel/rcu/tree_exp.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index e4b77d3..28979d3 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -557,7 +557,7 @@ static void rcu_exp_wait_wake(unsigned long s)
spin_unlock(&rnp->exp_lock);
}
smp_mb(); /* All above changes before wakeup. */
- wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);
+ wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]);
}
trace_rcu_exp_grace_period(rcu_state.name, s, TPS("endwake"));
mutex_unlock(&rcu_state.exp_wake_mutex);
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
the Code Aurora Forum, hosted by The Linux Foundation

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
the Code Aurora Forum, hosted by The Linux Foundation

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of
the Code Aurora Forum, hosted by The Linux Foundation

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation