Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock_special()

From: Paul E. McKenney
Date: Mon Apr 01 2019 - 15:03:33 EST


On Mon, Apr 01, 2019 at 10:22:57AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 01, 2019 at 10:32:11AM +0200, Peter Zijlstra wrote:
> > On Fri, Mar 29, 2019 at 11:26:34AM -0700, Paul E. McKenney wrote:
> > > When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
> > > kthreads, a full and unconditional wakeup is required to initiate RCU
> > > core processing. In contrast, when RCU core processing is carried
> > > out by RCU_SOFTIRQ, a raise_softirq() suffices. Of course, there are
> > > situations where raise_softirq() does a full wakeup, but these do not
> > > occur with normal usage of rcu_read_unlock().
> >
> > Do we have a comment somewhere explaining why?
>
> First, thank you for reviewing this!
>
> The "why" is because people normally don't do things like the code
> sequence shown below, but where the scheduler holds locks across the
> second RCU read-side critical section. (If they did, lockdep would
> complain. Nevertheless, it is good to avoid this potential problem.)
>
> > > The initial solution to this problem was to use set_tsk_need_resched() and
> > > set_preempt_need_resched() to force a future context switch, which allows
> > > rcu_preempt_note_context_switch() to report the deferred quiescent state
> > > to RCU's core processing. Unfortunately for expedited grace periods,
> > > there can be a significant delay between the call for a context switch
> > > and the actual context switch.
> >
> > This is all PREEMPT=y kernels, right? Where is the latency coming from?
> > Because PREEMPT=y _should_ react quite quickly.
>
> Yes, PREEMPT=y. It happens like this:
>
> 1. rcu_read_lock() with everything enabled.
>
> 2. Preemption then resumption.
>
> 3. local_irq_disable().
>
> 4. rcu_read_unlock().
>
> 5. local_irq_enable().
>
> From what I know, the scheduler doesn't see anything until the next
> interrupt, local_bh_enable(), return to userspace, etc. Because this
> is PREEMPT=y, preempt_enable() and cond_resched() do nothing. So
> it could be some time (milliseconds, depending on HZ, NO_HZ_FULL, and
> so on) until the scheduler responds. With NO_HZ_FULL, last I knew,
> the delay can be extremely long.
>
> Or am I missing something that gets the scheduler on the job faster?
>
> Hmmm... If your point is that this amount of delay matters only for
> expedited grace periods, you are quite right. So perhaps I shouldn't be
> doing any of the expensive stuff unless there is an expedited grace period
> in flight. Or if NO_HZ_FULL. See below for an updated (and untested)
> patch to this effect.
>
> > > This commit therefore introduces a ->deferred_qs flag to the task_struct
> > > structure's rcu_special structure. This flag is initially false, and
> > > is set to true by the first call to rcu_read_unlock() requiring special
> > > attention, then finally reset back to false when the quiescent state is
> > > finally reported. Then rcu_read_unlock() attempts full wakeups only when
> > > ->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
> > > special attention. Note that a chain of RCU readers linked by some other
> > > sort of reader may find that a later rcu_read_unlock() is once again able
> > > to do a full wakeup, courtesy of an intervening preemption:
> > >
> > > rcu_read_lock();
> > > /* preempted */
> > > local_irq_disable();
> > > rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
> > > rcu_read_lock();
> > > local_irq_enable();
> > > preempt_disable()
> > > rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
> > > rcu_read_lock();
> > > preempt_enable();
> > > /* preempted, >deferred_qs reset. */
> >
> > As it would have without ->deferred_sq and just having done the above
> > which was deemed insufficient.
> >
> > So I'm really puzzled by the need for all this.
>
> On the first round, without the ->deferred_qs, we know the scheduler
> cannot be holding any of its pi or rq locks because if it did, it would
> have disabled interrupts across the entire RCU read-side critical section.
> Wakeups are therefore safe in this case, whether via softirq or wakeup.
> Afterwards, we don't have that guarantee because an earlier critical
> section might have been preempted and the scheduler might have held one
> of its locks across the entire just-ended critical section.
>
> And I believe you are right that we should avoid the wakeups unless
> there is an expedited grace period in flight or in a NO_HZ_FULL kernel.
> Hence the patch shown below.

Which is a stupid patch. It assumes ancient times when expedited grace
periods were guaranteed to preempted pre-existing RCU read-side critical
sections. Back to the drawing board...

Thanx, Paul

> -----------------------------------------------------------------------
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 2e52a77af6be..582c6d88aaa0 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -606,20 +606,26 @@ static void rcu_read_unlock_special(struct task_struct *t)
> local_irq_save(flags);
> irqs_were_disabled = irqs_disabled_flags(flags);
> if (preempt_bh_were_disabled || irqs_were_disabled) {
> + bool exp;
> +
> t->rcu_read_unlock_special.b.exp_hint = false;
> + exp = !!READ_ONCE(this_cpu_ptr(&rcu_data)->mynode->exp_tasks);
> // Need to defer quiescent state until everything is enabled.
> - if (irqs_were_disabled && use_softirq &&
> + if ((exp || IS_ENABLED(CONFIG_NO_HZ_FULL)) &&
> + irqs_were_disabled && use_softirq &&
> (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
> // Using softirq, safe to awaken, and we get
> // no help from enabling irqs, unlike bh/preempt.
> raise_softirq_irqoff(RCU_SOFTIRQ);
> - } else if (irqs_were_disabled && !use_softirq &&
> + } else if ((exp || IS_ENABLED(CONFIG_NO_HZ_FULL)) &&
> + irqs_were_disabled && !use_softirq &&
> !t->rcu_read_unlock_special.b.deferred_qs) {
> // Safe to awaken and we get no help from enabling
> // irqs, unlike bh/preempt.
> invoke_rcu_core();
> } else {
> // Enabling BH or preempt does reschedule, so...
> + // Also if no expediting or NO_HZ_FULL, slow is OK.
> set_tsk_need_resched(current);
> set_preempt_need_resched();
> }