Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes

From: Joel Fernandes
Date: Tue Aug 30 2022 - 14:40:47 EST


Hi Frederick,

On 8/30/2022 6:26 AM, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
[..]
>>>>> 2) NOCB without nohz_full has extremely rare usecase (RT niche:
>>>>> https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@xxxxxxxxxxxxxx/
>>>>> )
>>>>
>>>> Really? Android has been using it for a long time. It seems to be quite popular
>>>> in the battery-powered space.
>>>
>>> It's really sad that this is the first time I hear about that. I've been working
>>> on this code for years now without this usecase in mind. And yet it's fundamental.
>>>
>>> I asked several times around about other usecases of rcu_nocbs than nohz_full=
>>> and nobody reported that. I can hardly even google a significant link
>>> between power saving and rcu_nocbs=
>>>
>>> If this is really used that way for a long time then it's a cruel disconnection
>>> between users and developers.
>>
>> I was not involved with Android or RCU back then. But my guess is I
>> don't think it was enabled with the intent of saving power, it is just
>> that using RCU_NO_CB_CPU has become the way to go to keep dynick-idle
>> CPUs undisturbed: https://paulmck.livejournal.com/66807.html . Paul ,
>> +Dietmar Eggemann can probably provide more history.
>
> Thanks for the pointer!

No problem. Paul proves to us that any and every bit of writing effort pays off
in one way or the other :-D.

>>>>> 2) NOCB implies performance issues.
>>>>
>>>> Which kinds of? There is slightly worse boot times, but I'm guessing that's do
>>>> with the extra scheduling overhead of the extra threads which is usually not a
>>>> problem except that RCU is used in the critical path of boot up (on ChromeOS).
>>>
>>> I never measured it myself but executing callbacks on another CPUs, with
>>> context switches and locking can only involve significant performance issues if callbacks
>>> are frequent. So it's a tradeoff between power and performance.
>>
>> In my testing of benchmarks on real systems with 8-16 CPUs, the
>> performance hit is down in the noise. It is possible though that maybe
>> one can write a non-realistic synthetic test to force the performance
>> issues, but I've not seen it in the real world. Maybe on
>> networking-heavy servers with lots of cores, you'll see it but their
>> batteries if any would be pretty big :-).
>
> Yeah I suspect this should have an impact on servers. And even servers may
> deserve idleness sometimes.

Sure. I think I am pretty sure you will see power savings on servers but I guess
the big question is will you regress server workloads with those savings. I do
remember, Eric Dumazet from Google posting to LKML that callback flooding can
happen on servers.

>>>>> 3) We are mixing up two very different things in a single list of callbacks:
>>>>> lazy callbacks and flooding callbacks, as a result we are adding lots of
>>>>> off-topic corner cases all around:
>>>>> * a seperate lazy len field to struct rcu_cblist whose purpose is much more
>>>>> general than just bypass/lazy
>>>>> * "lazy" specialized parameters to general purpose cblist management
>>>>> functions
>>>>
>>>> I think just 1 or 2 functions have a new lazy param. It didn't seem too
>>>> intrusive to me.
>>>
>>> What bothers me is that struct cblist has a general purpose and we are adding a field
>>> and a parameter that is relevant to only one specialized user.
>>
>> To Paul's point, we can change it to a flag I think. The 3 states are:
>> - no CBs.
>> - All lazy
>> - All non-lazy
>
> Yeah that makes sense, should we take the generic direction!

Sure, thanks.

>>
>> Or, worse case, we can move the flag to the per-cpu rcu_data
>> structure, I think. Does that alleviate your concern?
>>
>>>>> 4) This is further complexifying bypass core code, nocb timer management, core
>>>>> nocb group management, all of which being already very complicated.
>>>>
>>>> True, I agree, a few more cases to handle for sure, but I think I got them all
>>>> now (hopefully).
>>>
>>> Now I'm worried about maintainability. Hence why I'd rather see a generic code
>>> for them all if possible.
>>
>> Maintainability is a concern for any new code. I rely a lot on testing
>> both synthetic, and real-world. I have spent a lot of time on test
>> code on this.
>
> And I thank you for that. I've always been terrible at performance/power/latency
> testing so I'm always relieved when others take over on that :)

Thanks, appreciate it.

>>>
>>> Note I'm not stricly opposed to the current approach. But I can't say I'm
>>> comfortable with it.
>>
>> Maybe we can find a way in the code to see what can be changed. One
>> concern you rightly raised is the length variables , but I think
>> that's not a blocker.
>
> That's just a detail indeed. If we proceed with the current NOCB approach I'll
> give a deeper review.
>
>>
>>> Can we do a simple test? Would it be possible to affine every rcuo%c/%d kthread
>>> to the corresponding CPU%d? For example affine rcuop/1 to CPU 1, rcuop/2 to
>>> CPU2, etc... And then relaunch your measurements on top of that?
>>
>> We already did that and it does not help, it makes things worse. The
>> decision of where to run RCU threads is best left to the scheduler
>> (for both performance and power).
>
> That's the point, we want to artificially remove NOCB advantages to get
> a behaviour close to !NOCB, so that we know if it's worth expanding the
> patchset to !NOCB.
> Let me clarify. The point of NOCB, powerwise, is that the RCU kthreads
> track and execute the callbacks to an optimized selection of CPUs, thanks to the
> scheduler. And the new lazy flow optimize that further.
>
> Now we would like to know if call_rcu_lazy() also works on !NOCB. But we have
> no way to test that right now because we only have an implemention for NOCB.

I am happy to try this, but I want to make a counter-suggestion. You can 'hack'
a behavior similar to call_rcu_lazy() by just increasing jiffiest_till_first_fqs
to 24 or 48 (via boot param). The downside of doing this, you'll slow down
synchronize_rcu() and tracing will be slow. That can be remedied by passing
rcupdate.rcu_expedited boot parameter to speed up synchronous_rcu(). However, it
will still slow down *all* call_rcu(). Neverthless, its a neat trick to see the
light at the end of the tunnel but not for product.

Would it be possible for you try this out with some workloads on a server, or
even the server just being idle?

Just to mention - the power savings between NO_CB and !NO_CB is enormous for us
(40% !!). We attributed this directly to the tick not getting turned off and the
hardware SoC thus not entering a deeper SoC-wide lower state (known as package
C-state). While a server does not typically have an SoC, I suspect *some* power
savings at least.

> But if we affine each rcuop/%d to their corresponding CPU%d, we get a behaviour
> that is close to !NOCB because then the scheduler doesn't help anymore and
> callbacks are executed locally (the only difference is that we do context
> switches, locking and and we still have rcuog around but still, that
> should give a rough idea of what we get with !NOCB).
>
> Therefore if you test your measurements again before and after your patchset,
> both with each rcuop/%d affine to their CPU%d counterpart, you should get a
> comparison of before VS after your patchset on what would be an implementation
> on !NOCB.

My suspicion is that if I did this experiment on my hardware, it will still not
give you a good idea of what a server will see. Because I don't have those kind
of workloads or many-CPU hardware. So it may not be that representative of what
the !NOCB world wants. Further, my hardware is an Intel SoC which is more
integrated than a typical server board so SoC-wide power savings might not be
representative of the server hardware. However, I am happy to try out your
suggestion out of respect for you, if you still feel its worth doing.

Thanks,

- Joel