Re: [PATCH 1/1] Reduce synchronize_rcu() waiting time

From: Paul E. McKenney
Date: Thu Mar 30 2023 - 17:16:50 EST


On Thu, Mar 30, 2023 at 09:18:44PM +0200, Uladzislau Rezki wrote:
> On Thu, Mar 30, 2023 at 11:58:41AM -0700, Paul E. McKenney wrote:
> > On Thu, Mar 30, 2023 at 05:43:15PM +0200, Uladzislau Rezki wrote:
> > > On Thu, Mar 30, 2023 at 03:09:33PM +0000, Joel Fernandes wrote:
> > > > On Tue, Mar 28, 2023 at 08:26:13AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Mar 27, 2023 at 10:29:31PM -0400, Joel Fernandes wrote:
> > > > > > Hello,
> > > > > >
> > > > > > > On Mar 27, 2023, at 9:06 PM, Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Mon, Mar 27, 2023 at 11:21:23AM +0000, Zhang, Qiang1 wrote:
> > > > > > >>>> From: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
> > > > > > >>>> Sent: Tuesday, March 21, 2023 6:28 PM
> > > > > > >>>> [...]
> > > > > > >>>> Subject: [PATCH 1/1] Reduce synchronize_rcu() waiting time
> > > > > > >>>>
> > > > > > >>>> A call to a synchronize_rcu() can be expensive from time point of view.
> > > > > > >>>> Different workloads can be affected by this especially the ones which use this
> > > > > > >>>> API in its time critical sections.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>> This is interesting and meaningful research. ;-)
> > > > > > >>>
> > > > > > >>>> For example in case of NOCB scenario the wakeme_after_rcu() callback
> > > > > > >>>> invocation depends on where in a nocb-list it is located. Below is an example
> > > > > > >>>> when it was the last out of ~3600 callbacks:
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> Can it be implemented separately as follows? it seems that the code is simpler
> > > > > > >> (only personal opinion) 😊.
> > > > > > >>
> > > > > > >> But I didn't test whether this reduce synchronize_rcu() waiting time
> > > > > > >>
> > > > > > >> +static void rcu_poll_wait_gp(struct rcu_tasks *rtp)
> > > > > > >> +{
> > > > > > >> + unsigned long gp_snap;
> > > > > > >> +
> > > > > > >> + gp_snap = start_poll_synchronize_rcu();
> > > > > > >> + while (!poll_state_synchronize_rcu(gp_snap))
> > > > > > >> + schedule_timeout_idle(1);
> > > > > > >
> > > > > > > I could be wrong, but my guess is that the guys working with
> > > > > > > battery-powered devices are not going to be very happy with this loop.
> > > > > > >
> > > > > > > All those wakeups by all tasks waiting for a grace period end up
> > > > > > > consuming a surprisingly large amount of energy.
> > > > > >
> > > > > > Is that really the common case? On the general topic of wake-ups:
> > > > > > Most of the time there should be only one
> > > > > > task waiting synchronously on a GP to end. If that is
> > > > > > true, then it feels like waking
> > > > > > up nocb Kthreads which indirectly wake other threads is doing more work than usual?
> > > > >
> > > > > A good question, and the number of outstanding synchronize_rcu()
> > > > > calls will of course be limited by the number of tasks in the system.
> > > > > But I myself have raised the ire of battery-powered embedded folks with
> > > > > a rather small number of wakeups, so...
> > > >
> > > > But unless I am missing something, even if there is single synchronize_rcu(),
> > > > you have a flurry of potential wakeups right now, instead of the bare minimum
> > > > I think. I have not measured how many wake ups, but I'd love to when I get
> > > > time. Maybe Vlad has some numbers.
> > > >
> > > I will measure and have a look at wake-ups. But, what we have for now is
> > > if there are two callers of synchronize_rcu() on different CPUs, i guess
> > > two nocb-kthreads have to handle it, thus two nocb-kthreads have to be
> > > awaken to do the work. This patch needs only one wake-up to serve all
> > > users.
> >
> > One wakeup per synchronize_rcu(), right?
> >
> The gp-kthread wake-ups only one work, in its turn a worker wake-ups all
> registered users of synchronize_rcu() for which a gp was passed. How many
> users of synchonize_rcu() awaken by one worker depends on how many were
> registered before initiating a new GP by the gp-kthread.
>
> > > Anyway, i will provide some data and analysis of it.
> >
> > Looking forward to seeing it!
> >
> Good. I will switch fully on it soon. I need to sort out some perf.
> issues at work.

And if you are looking for reduced wakeups instead of lower latency for
synchronize_rcu(), I could see where the extra workqueue wakeup might
be a problem for you.

Assuming that this is all default-off, you could keep a count of the
number of required wakeups for each grace period (indexed as usual by
the bottom few bits of the grace-period counter without the low-order
state bits), and do the wakeups directly from the grace-period kthread
if there are not all that many of them.

Except that, given that workqueues try hard to make the handler be on the
same CPU as the one that did the corresponding schedule_work() invocation,
it is not clear that this particular wakeup is really costing you enough
to notice. (That CPU is not idle, after all.) But there is nothing
quite like measuring the actual energy consumption on real hardware!

Thanx, Paul