task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs_rq list) presence
From: Michal Koutný
Date: Tue Nov 02 2021 - 12:37:06 EST
Hello.
(Getting back to this after some more analysis.)
On Wed, Oct 13, 2021 at 04:26:43PM +0200, Michal Koutný <mkoutny@xxxxxxxx> wrote:
> On Wed, Oct 13, 2021 at 09:57:17AM +0200, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> > This seems to closes the race window in your case but this could still
> > happen AFAICT.
>
> You seem to be right.
> Hopefully, I'll be able to collect more data evaluating this.
I've observed that the window between unregister_fair_sched_group() and
free_fair_sched_group() is commonly around 15 ms (based on kprobe
tracing).
I have a reproducer (attached) that can hit this window quite easily
after tuning. I can observe consequences of it even with a recent 5.15
kernel. (And I also have reports from real world workloads failing due
to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
unthrottle").)
My original patch was really an uninformed attempt given the length of
the window.
[snip]
On Wed, Oct 13, 2021 at 07:45:59PM +0100, Odin Ugedal <odin@xxxxxxx> wrote:
> Ref. your comment about reverting a7b359fc6a37
> ("sched/fair: Correctly insert cfs_rq's to list on unthrottle"), I
> think that is fine as long as we revert the commit it fixes as well,
> to avoid a regression of that (but yeah, that regression itself is
> less bad than your discovery).
I say no to reverting 31bc6aeaab1d ("sched/fair: Optimize
update_blocked_averages()") (it solves reported performance issues, it's
way too old :-).
> set cfs_rq->on_list=2 inside that lock under your code change? If we
> then treat on_list=2
> as "not on list, and do not add"?
The possibilities for the current problem:
1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups.
(Not exclusive with the other suggestions, rather a stop-gap for the
time being.)
2) Don't add offlined task_groups into the undecayed list
- Your proposal with overloaded on_list=2 could serve as mark of that,
but it's a hack IMO.
- Proper way (tm) would be to use css_tryget_online() and css_put() when
dealing with the list (my favorite at the moment).
3) Narrowing the race-window dramatically
- that is by moving list removal from unregister_fair_sched_group() to
free_fair_sched_group(),
- <del>or use list_empty(tg->list) as indicator whether we're working
with onlined task_group.</del> (won't work for RCU list)
4) Rework how throttled load is handled (hand waving)
- There is remove_entity_load_avg() that moves the load to parent upon
final removal. Maybe it could be generalized for temporary removals by
throttling (so that unthrottling could again add only non-empty
cfs_rqs to the list and undecayed load won't skew fairness).
- or the way of [1].
5) <your ideas>
Opinions?
Thanks,
Michal
[1] https://lore.kernel.org/lkml/CAFpoUr1AO_qStNOYrFWGnFfc=uSFrXSYD8A5cQ8h0t2pioQzDA@xxxxxxxxxxxxxx/
Attachment:
run2.sh
Description: Bourne shell script