Re: [PATCH v5 net 1/3] rcu: add a helper to report consolidated flavor QS

From: Paul E. McKenney
Date: Fri Apr 05 2024 - 14:13:16 EST


On Fri, Apr 05, 2024 at 03:49:46PM +0200, Sebastian Andrzej Siewior wrote:
> On 2024-03-22 21:02:02 [-0500], Yan Zhai wrote:
> > On Fri, Mar 22, 2024 at 4:31 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 12:24:13PM +0100, Sebastian Andrzej Siewior wrote:
> > > > On 2024-03-19 13:44:34 [-0700], Yan Zhai wrote:
> > > > > + * The macro is not needed when CONFIG_PREEMPT_RT is defined. RT kernels would
> > > > > + * have more chance to invoke schedule() calls and provide necessary quiescent
> > > > > + * states. As a contrast, calling cond_resched() only won't achieve the same
> > > > > + * effect because cond_resched() does not provide RCU-Tasks quiescent states.
> > > > > + */
> > > >
> > > > Paul, so CONFIG_PREEMPTION is affected but CONFIG_PREEMPT_RT is not.
> > > > Why does RT have more scheduling points?
> > >
> > > In RT, isn't BH-disabled code preemptible? But yes, this would not help
> > > RCU Tasks.
> Yes, it is but why does it matter? This is used in the NAPI thread which
> fully preemptible and does cond_resched(). This should be enough.

By the time it gets to RCU, a cond_resched()-induced context switch looks
like a preemption. This is fine for normal RCU, but Tasks RCU needs
a voluntary context switch for a quiescent state. Which makes sense,
given that code running in a trampoline that is invoked from preemptible
code can itself be preempted.

So that additional call to rcu_softirq_qs() is needed. (Which invokes
rcu_tasks_qs() which in turn invokes rcu_tasks_classic_qs().

And maybe it is also needed for RT. The argument against would be that
RT applications have significant idle time on the one hand or spend lots
of time in nohz_full userspace on the other, both of which are quiescent
states for RCU Tasks. But you tell me!

> > By "more chance to invoke schedule()", my thought was that
> > cond_resched becomes no op on RT or PREEMPT kernel. So it will not
> > call __schedule(SM_PEREEMPT), which clears the NEED_RESCHED flag. On a
> It will nop cond_resched(), correct. However once something sends
> NEED_RESCHED then the receiver of this flag will __schedule(SM_PEREEMPT)
> as soon as possible. That is either because the scheduler sends an IPI
> and the CPU will do it in the irq-exit path _or_ the thread does
> preempt_enable() (which includes local_bh_enable()) and the counter hits
> zero at which point the same context switch happens.
>
> Therefore I don't see a difference between CONFIG_PREEMPT and
> CONFIG_PREEMPT_RT.

True, but again RCU Tasks needs a voluntary context switch and the
resulting preemption therefore does not qualify.

> > normal irq exit like timer, when NEED_RESCHED is on,
> > schedule()/__schedule(0) can be called time by time then.
>
> This I can not parse. __schedule(0) means the task gives up on its own
> and goes to sleep. This does not happen for the NAPI-thread loop,
> kworker loop or any other loop that consumes one work item after the
> other and relies on cond_resched() in between.
>
> > __schedule(0) is good for RCU tasks, __schedule(SM_PREEMPT) is not.
> Okay and that is why? This means you expect that every thread gives up
> on its own which may take some time depending on the workload. This
> should not matter.
>
> If I see this right, the only difference is rcu_tasks_classic_qs() and I
> didn't figure out yet what it does.

It marks the task as having passed through a Tasks RCU quiescent state.
Which works because this is called from task level (as opposed to from
irq, softirq, or NMI), so it cannot be returning to a trampoline that
is protected by Tasks RCU.

Later on, the Tasks RCU grace-period kthread will see the marking, and
remove this task from the list that is blocking the current Tasks-RCU
grace period.

> > But I think this code comment does not take into account frequent
> > preempt_schedule and irqentry_exit_cond_resched on a PREEMPT kernel.
> > When returning to these busy kthreads, irqentry_exit_cond_resched is
> > in fact called now, not schedule(). So likely __schedule(PREEMPT) is
> > still called frequently, or even more frequently. So the code comment
> > looks incorrect on the RT argument part. We probably should remove the
> > "IS_ENABLED" condition really. Paul and Sebastian, does this sound
> > reasonable to you?
>
> Can you walk me through it? Why is it so important for a task to give up
> voluntary? There is something wrong here with how RCU tasks works.
> We want to get rid of the sprinkled cond_resched(). This looks like a
> another variant of it that might be required in places with no
> explanation except it takes too long.

Hmmm... I would normally point you at the Requirements.rst [1] document
for a walkthrough, but it does not cover all of this.

How about the upgrade shown below?

I agree that it would be nice if Tasks RCU did not have to depend on
voluntary context switches. In theory, one alternative would be to
examine each task's stack, looking for return addresses in trampolines.
In practice, we recently took a look at this and other alternatives,
but none were feasible [2]. If you know of a better way, please do not
keep it a secret!

Note that kernel live patching has similar needs, and may need
similar annotations/innovations.

Thanx, Paul

[1] Documentation/RCU/Design/Requirements/Requirements.rst
[2] https://docs.google.com/document/d/1kZY6AX-AHRIyYQsvUX6WJxS1LsDK4JA2CHuBnpkrR_U/edit?usp=sharing

------------------------------------------------------------------------
As written:
------------------------------------------------------------------------

Tasks RCU
~~~~~~~~~

Some forms of tracing use “trampolines” to handle the binary rewriting
required to install different types of probes. It would be good to be
able to free old trampolines, which sounds like a job for some form of
RCU. However, because it is necessary to be able to install a trace
anywhere in the code, it is not possible to use read-side markers such
as rcu_read_lock() and rcu_read_unlock(). In addition, it does
not work to have these markers in the trampoline itself, because there
would need to be instructions following rcu_read_unlock(). Although
synchronize_rcu() would guarantee that execution reached the
rcu_read_unlock(), it would not be able to guarantee that execution
had completely left the trampoline. Worse yet, in some situations
the trampoline's protection must extend a few instructions *prior* to
execution reaching the trampoline. For example, these few instructions
might calculate the address of the trampoline, so that entering the
trampoline would be pre-ordained a surprisingly long time before execution
actually reached the trampoline itself.

The solution, in the form of `Tasks
RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
critical sections that are delimited by voluntary context switches, that
is, calls to schedule(), cond_resched(), and
synchronize_rcu_tasks(). In addition, transitions to and from
userspace execution also delimit tasks-RCU read-side critical sections.
Idle tasks are ignored by Tasks RCU, and Tasks Rude RCU may be used to
interact with them.

Note well that involuntary context switches are *not* Tasks-RCU quiescent
states. After all, in preemptible kernels, a task executing code in a
trampoline might be preempted. In this case, the Tasks-RCU grace period
clearly cannot end until that task resumes and its execution leaves that
trampoline. This means, among other things, that cond_resched() does
not provide a Tasks RCU quiescent state. (Instead, use rcu_softirq_qs()
from softirq or rcu_tasks_classic_qs() otherwise.)

The tasks-RCU API is quite compact, consisting only of
call_rcu_tasks(), synchronize_rcu_tasks(), and
rcu_barrier_tasks(). In ``CONFIG_PREEMPTION=n`` kernels, trampolines
cannot be preempted, so these APIs map to call_rcu(),
synchronize_rcu(), and rcu_barrier(), respectively. In
``CONFIG_PREEMPTION=y`` kernels, trampolines can be preempted, and these
three APIs are therefore implemented by separate functions that check
for voluntary context switches.

------------------------------------------------------------------------
As patch:
------------------------------------------------------------------------

diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index cccafdaa1f849..f511476b45506 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -2357,6 +2357,7 @@ section.
#. `Sched Flavor (Historical)`_
#. `Sleepable RCU`_
#. `Tasks RCU`_
+#. `Tasks Trace RCU`_

Bottom-Half Flavor (Historical)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -2610,6 +2611,16 @@ critical sections that are delimited by voluntary context switches, that
is, calls to schedule(), cond_resched(), and
synchronize_rcu_tasks(). In addition, transitions to and from
userspace execution also delimit tasks-RCU read-side critical sections.
+Idle tasks are ignored by Tasks RCU, and Tasks Rude RCU may be used to
+interact with them.
+
+Note well that involuntary context switches are *not* Tasks-RCU quiescent
+states. After all, in preemptible kernels, a task executing code in a
+trampoline might be preempted. In this case, the Tasks-RCU grace period
+clearly cannot end until that task resumes and its execution leaves that
+trampoline. This means, among other things, that cond_resched() does
+not provide a Tasks RCU quiescent state. (Instead, use rcu_softirq_qs()
+from softirq or rcu_tasks_classic_qs() otherwise.)

The tasks-RCU API is quite compact, consisting only of
call_rcu_tasks(), synchronize_rcu_tasks(), and
@@ -2632,6 +2643,11 @@ moniker. And this operation is considered to be quite rude by real-time
workloads that don't want their ``nohz_full`` CPUs receiving IPIs and
by battery-powered systems that don't want their idle CPUs to be awakened.

+Once kernel entry/exit and deep-idle functions have been properly tagged
+``noinstr``, Tasks RCU can start paying attention to idle tasks (except
+those that are idle from RCU's perspective) and then Tasks Rude RCU can
+be removed from the kernel.
+
The tasks-rude-RCU API is also reader-marking-free and thus quite compact,
consisting of call_rcu_tasks_rude(), synchronize_rcu_tasks_rude(),
and rcu_barrier_tasks_rude().