Re: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list) presence

From: Odin Ugedal
Date: Tue Nov 02 2021 - 16:20:19 EST


tir. 2. nov. 2021 kl. 16:02 skrev Michal Koutný <mkoutny@xxxxxxxx>:
>
>
> I've observed that the window between unregister_fair_sched_group() and
> free_fair_sched_group() is commonly around 15 ms (based on kprobe
> tracing).
>
> I have a reproducer (attached) that can hit this window quite easily
> after tuning. I can observe consequences of it even with a recent 5.15
> kernel. (And I also have reports from real world workloads failing due
> to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
> unthrottle").)
>
> My original patch was really an uninformed attempt given the length of
> the window.


Thanks for your digging into this! I don't have time to run the
reproduction myself
atm but it looks good. Will look more deeply into it this weekend.

> On Wed, Oct 13, 2021 at 07:45:59PM +0100, Odin Ugedal <odin@xxxxxxx> wrote:
>
> I say no to reverting 31bc6aeaab1d ("sched/fair: Optimize
> update_blocked_averages()") (it solves reported performance issues, it's
> way too old :-).

Yeah, I am fine with not reverting that one.


> 1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups.
> (Not exclusive with the other suggestions, rather a stop-gap for the
> time being.)

I am fine with reverting this commit for the short term. Can you post
a revert patch of it together
with your bug details, as well as references to this discussion. Feel
free to add me to it
Acked-by: Odin Ugedal <odin@xxxxxxx>

I do not think we need to revert its fixups, as they help the
consistency of the list in other cases
as well. Might be wrong about this one, so it would probably need some
testing. They should
at least not cause _this_ race issue.

The performance regression of reverting that commit should also be
_fairly_ small, even though
it can cause quite severe fairness issues in some cases. That is
however preferable compared to
a hard lockup.

We can then look at solutions for that issue, and use some time to
figure out the best way
to mitigate it.

> 2) Don't add offlined task_groups into the undecayed list
> - Your proposal with overloaded on_list=2 could serve as mark of that,
> but it's a hack IMO.
> - Proper way (tm) would be to use css_tryget_online() and css_put() when
> dealing with the list (my favorite at the moment).

Ahh. Not familiar with css_tryget_online() and css_put(), but they
might be useful here.
Maybe cc the cgroups list about this? But yeah, this is my favorite
medium to long term
as well.

> 3) Narrowing the race-window dramatically
> - that is by moving list removal from unregister_fair_sched_group() to
> free_fair_sched_group(),
> - <del>or use list_empty(tg->list) as indicator whether we're working
> with onlined task_group.</del> (won't work for RCU list)

Outside what I can wrap my head around this late, but I vote for a solution
that does _not_ include the phrase "narrowing a race window"
if we can eliminate it.

> 4) Rework how throttled load is handled (hand waving)
> - There is remove_entity_load_avg() that moves the load to parent upon
> final removal. Maybe it could be generalized for temporary removals by
> throttling (so that unthrottling could again add only non-empty
> cfs_rqs to the list and undecayed load won't skew fairness).
> - or the way of [1].

Yeah, we can do this if the performance impact of always grabbing the rq lock
on control group delete (essentially your initial patch here). My hunch is that
the impact of the lock is essentially zero, and that doing something like 2)
is better if we can do that. Have not sketched up a proper solution like this,
so not sure how it will look.

> 5) <your ideas>

Nah, I think you found all the possible solutions I have thought of. :)

> Opinions?

Again, thanks for working on this, and for the nice script to reproduce!

> [1] https://lore.kernel.org/lkml/CAFpoUr1AO_qStNOYrFWGnFfc=uSFrXSYD8A5cQ8h0t2pioQzDA@xxxxxxxxxxxxxx/

Thanks
Odin