Re: [PATCH] net: raise RCU qs after each threaded NAPI poll
From: Paul E. McKenney
Date: Sat Mar 02 2024 - 19:25:31 EST
On Fri, Mar 01, 2024 at 09:24:15PM -0500, Joel Fernandes wrote:
> (Shrinking CC a bit)
>
> On Thu, Feb 29, 2024 at 1:29 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Thu, Feb 29, 2024 at 12:41:55PM -0500, Joel Fernandes wrote:
> > > > On Feb 29, 2024, at 11:57 AM, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > > On Thu, Feb 29, 2024 at 09:21:48AM -0500, Joel Fernandes wrote:
> > > >>> On 2/28/2024 5:58 PM, Paul E. McKenney wrote:
> > > >>> On Wed, Feb 28, 2024 at 02:48:44PM -0800, Alexei Starovoitov wrote:
> > > >>>> On Wed, Feb 28, 2024 at 2:31 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> > > >>>>>
> > > >>>>> On Wed, 28 Feb 2024 14:19:11 -0800
> > > >>>>> "Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:
> > > >>>>>
> > > >>>>>>>>
> > > >>>>>>>> Well, to your initial point, cond_resched() does eventually invoke
> > > >>>>>>>> preempt_schedule_common(), so you are quite correct that as far as
> > > >>>>>>>> Tasks RCU is concerned, cond_resched() is not a quiescent state.
> > > >>>>>>>
> > > >>>>>>> Thanks for confirming. :-)
> > > >>>>>>
> > > >>>>>> However, given that the current Tasks RCU use cases wait for trampolines
> > > >>>>>> to be evacuated, Tasks RCU could make the choice that cond_resched()
> > > >>>>>> be a quiescent state, for example, by adjusting rcu_all_qs() and
> > > >>>>>> .rcu_urgent_qs accordingly.
> > > >>>>>>
> > > >>>>>> But this seems less pressing given the chance that cond_resched() might
> > > >>>>>> go away in favor of lazy preemption.
> > > >>>>>
> > > >>>>> Although cond_resched() is technically a "preemption point" and not truly a
> > > >>>>> voluntary schedule, I would be happy to state that it's not allowed to be
> > > >>>>> called from trampolines, or their callbacks. Now the question is, does BPF
> > > >>>>> programs ever call cond_resched()? I don't think they do.
> > > >>>>>
> > > >>>>> [ Added Alexei ]
> > > >>>>
> > > >>>> I'm a bit lost in this thread :)
> > > >>>> Just answering the above question.
> > > >>>> bpf progs never call cond_resched() directly.
> > > >>>> But there are sleepable (aka faultable) bpf progs that
> > > >>>> can call some helper or kfunc that may call cond_resched()
> > > >>>> in some path.
> > > >>>> sleepable bpf progs are protected by rcu_tasks_trace.
> > > >>>> That's a very different one vs rcu_tasks.
> > > >>>
> > > >>> Suppose that the various cond_resched() invocations scattered throughout
> > > >>> the kernel acted as RCU Tasks quiescent states, so that as soon as a
> > > >>> given task executed a cond_resched(), synchronize_rcu_tasks() might
> > > >>> return or call_rcu_tasks() might invoke its callback.
> > > >>>
> > > >>> Would that cause BPF any trouble?
> > > >>>
> > > >>> My guess is "no", because it looks like BPF is using RCU Tasks (as you
> > > >>> say, as opposed to RCU Tasks Trace) only to wait for execution to leave a
> > > >>> trampoline. But I trust you much more than I trust myself on this topic!
> > > >>
> > > >> But it uses RCU Tasks Trace as well (for sleepable bpf programs), not just
> > > >> Tasks? Looks like that's what Alexei said above as well, and I confirmed it in
> > > >> bpf/trampoline.c
> > > >>
> > > >> /* The trampoline without fexit and fmod_ret progs doesn't call original
> > > >> * function and doesn't use percpu_ref.
> > > >> * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
> > > >> * Then use call_rcu_tasks() to wait for the rest of trampoline asm
> > > >> * and normal progs.
> > > >> */
> > > >> call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
> > > >>
> > > >> The code comment says it uses both.
> > > >
> > > > BPF does quite a few interesting things with these.
> > > >
> > > > But would you like to look at the update-side uses of RCU Tasks Rude
> > > > to see if lazy preemption affects them? I don't believe that there
> > > > are any problems here, but we do need to check.
> > >
> > > Sure I will be happy to. I am planning look at it in detail over the 3 day weekend. Too much fun! ;-)
> >
> > Thank you, and looking forward to seeing what you come up with!
> >
> > The canonical concern would be that someone somewhere is using either
> > call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() to wait for
> > non-preemptible regions of code that does not account for the possibility
> > of preemption in CONFIG_PREEMPT_NONE or PREEMPT_PREEMPT_VOLUNTARY kernels.
> >
> > I *think* that these are used only to handle the possibility
> > of tracepoints on functions on the entry/exit path and on the
> > RCU-not-watching portions of the idle loop. If so, then there is no
> > difference in behavior for lazy preemption. But who knows?
>
> Hi Paul, regarding CONFIG_PREEMPT_AUTO, for Tasks RCU rude, I think
> the following patch will address your concern about quiescent states
> on CPUs spinning away in kernel mode:
>
> "sched/fair: handle tick expiry under lazy preemption"
> Link: https://lore.kernel.org/all/20240213055554.1802415-24-ankur.a.arora@xxxxxxxxxx/
>
> In this patch Ankur makes sure that the scheduling-clock interrupt
> will reschedule the CPU after a tick and not let queued tasks starve
> due to lazy re-scheduling. So my impression is the
> "schedule_on_each_cpu()" should schedule a worker thread in time to
> apply the implied Tasks RCU quiescent state even if the rescheduling
> was a LAZY-reschedule.
>
> Also, not sure if the "voluntary mode" of CONFIG_PREEMPT_AUTO behaves
> differently. My feeling is regardless of preemption mode,
> CONFIG_PREEMPT_AUTO should always preempt after a tick if something
> else needs to run. It just will not preempt immediately like before
> (although CFS did already have some wakeup preemption logic to slow it
> down a bit). I am reviewing Ankur's patches more to confirm that and
> also reviewing his patches more to see how it could affect.
Thank you for the info!
As you noted, one thing that Ankur's series changes is that preemption
can occur anywhere that it is not specifically disabled in kernels
built with CONFIG_PREEMPT_NONE=y or CONFIG_PREEMPT_VOLUNTARY=y. This in
turn changes Tasks Rude RCU's definition of a quiescent state for these
kernels, adding all code regions where preemption is not specifically
disabled to the list of such quiescent states.
Although from what I know, this is OK, it would be good to check the
calls to call_rcu_tasks_rude() or synchronize_rcu_tasks_rude() are set
up so as to expect these new quiescent states. One example where it
would definitely be OK is if there was a call to synchronize_rcu_tasks()
right before or after that call to synchronize_rcu_tasks_rude().
Would you be willing to check the call sites to verify that they
are OK with this change in semantics?
Thanx, Paul