Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer

From: bsegall
Date: Tue Mar 05 2019 - 15:45:42 EST


Phil Auld <pauld@xxxxxxxxxx> writes:

> On Tue, Mar 05, 2019 at 10:49:01AM -0800 bsegall@xxxxxxxxxx wrote:
>> Phil Auld <pauld@xxxxxxxxxx> writes:
>>
>> >> >
>> >> > raw_spin_lock(&cfs_b->lock);
>> >> > for (;;) {
>> >> > overrun = hrtimer_forward_now(timer, cfs_b->period);
>> >> > if (!overrun)
>> >> > break;
>> >> >
>> >> > idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> >> > }
>> >> > if (idle)
>> >> > cfs_b->period_active = 0;
>> >> > raw_spin_unlock(&cfs_b->lock);
>> >> >
>> >> >
>> >> > There's no guarantee that hrtimer_forward_now() will ever return 0 and also
>> >> > do_sched_cfs_period_timer() will drop and re-acquire the cfs_b->lock to call
>> >> > distribute_cfs_runtime.
>> >> >
>> >> > I considered code to tune the period and quota up (proportionally so they have the same
>> >> > relative effect) but I did not like that because of the above statement and the fact that
>> >> > it would be changing the valid values the user configured. I have two versions that do that
>> >> > differently but they both still call out for protection from the forever loop. I think they
>> >> > add complexity for no real gain.
>> >> >
>> >> > For my current testing I'm using a more direct but less elegant approach of simply limiting
>> >> > the number of times the loop can execute. This has the potential to skip handling a period
>> >> > I think but will prevent the problem reliably. I'm not sure how many iterations this loop
>> >> > was expected to take. I'm seeing numbers in the low thousands at times.
>> >>
>> >> I mean the answer should be "do_sched_cfs_period_timer runs once" the
>> >> vast majority of the time; if it's not then your machine/setup can't
>> >> handle whatever super low period you've set, or there's some bad
>> >> behavior somewhere in the period timer handling.
>> >> CFS_PERIOD_TIMER_EXIT_COUNT could reasonably be like 2 or 3 - this would
>> >> mean that you've already spent an entire period inside the handler.
>> >>
>> >
>> > Thanks for looking at this.
>> >
>> > That's sort of what I though, too. Would it not make more sense to get rid
>> > of the loop? I find forever loops inherently suspect.
>>
>> I mean, probably better to have a loop than copy paste the code a couple
>> of times; you could rework the condition in your patch to be part of the
>> loop condition in general, though it might not be any clearer (heck,
>> the loop /could/ be for (;overrun = hrtimer_forward_now(...);) { ... },
>> it's just that's kinda questionable about whether it's clearer).
>>
>
> So I'm not clear on this loop thing, though. If it's okay for the loop
> to run 2-3 times that implies it's okay for do_sched_cfs_period_timer()
> to take longer than a period to run. So if that's expected twice or
> thrice what would ensure it does not continually take that long?

So, the loop was originally copied from the rt period timer, and I'm not
sure we really thought that much about it. hrtimer_forward_now returing
!0 twice is probably best to allow, because the hrtimer could be
delayed most-but-not-all of a period; the third time means that the
fault is definitely in do_sched_cfs_period_timer though and if it was
too slow once there's no reason it could not continue to be slow. I was
just being vague on which of 2 or 3 winds up with that exact meaning.

>
>
>> >
>> > I think the fact that we drop the lock to do the distribue is the real cuplrit.
>> > It's not do_sched_cfs_period_timer()'s code itself that takes the time,
>> > I think, it's the re-acquire of the cfs_b->lock and the contention on it due
>> > to all the cgroups. Lock_stat smaples during that run show pretty high contention
>> > on that lock and some really long wait times.
>>
>> Tons of cgroups shouldn't increase contention; cfs_b->lock is
>> per-cgroup, and I don't see or remember anything offhand where it even
>> should be worse for deep nesting. Large machines with many cpus where
>> threads in the cgroup are running on each make it worse of course, and
>> tons of cgroups with throttling enabled have O(log n) cost on hrtimers
>> insertion and of course O(n) cost in actual runtime in irq handlers.
>
> The lock is per cgroup but it's taken in assign_cfs_rq_runtime() via
> __account_cfs_rq_runtime() from every non-idle cpu, I think.

Yes, that's proportional to cpus though, not cgroups.

>
>>
>> If you're seeing increasing contention as the cgroup depth or cgroup
>> count increases, that may be a good thing to fix regardless.
>
> I'll look at it some more.
>
> Interestingly, if I limit the number of child cgroups to the number of
> them I'm actually putting processes into (16 down from 2500) the problem
> does not reproduce.

That is indeed interesting, and definitely not something we'd want to
matter. (Particularly if it's not root->a->b->c...->throttled_cgroup or
root->throttled->a->...->thread vs root->throttled_cgroup, which is what
I was originally thinking of)

>
> Fwiw this is a 80 cpu (incl. smt) 4-numa system. So there are a lot
> of cfs_rqs that might be doing __account_cfs_rq_runtime().
>
> Here is lock_stat output from a short bit of the run when the pr_warn has come a couple of times.
> Edited for screen width. This is the 8th lock on the list. (I wonder in this case of the really long
> holds are the pr_warn firing.)

Yeah, it's probably best to check without the pr_warn.

>
> class name &cfs_b->lock:
> con-bounces 33770
> contentions 33776
> waittime-min 0.10
> waittime-max 48088.04
> waittime-total 856027.70
> waittime-avg 25.34
> acq-bounces 148379
> acquisitions 162184
> holdtime-min 0.00
> holdtime-max 48240.79
> holdtime-total 354683.04
> holdtime-avg 2.19
>
> ------------
> &cfs_b->lock 29414 [<00000000cfc57971>] __account_cfs_rq_runtime+0xd5/0x1a0
> &cfs_b->lock 4195 [<00000000754af0b8>] throttle_cfs_rq+0x193/0x2c0
> &cfs_b->lock 166 [<00000000b6333ad0>] unthrottle_cfs_rq+0x54/0x1d0
> &cfs_b->lock 1 [<00000000fc0c15d2>] sched_cfs_period_timer+0xe6/0x1d0
> ------------
> &cfs_b->lock 28602 [<00000000cfc57971>] __account_cfs_rq_runtime+0xd5/0x1a0
> &cfs_b->lock 3215 [<00000000754af0b8>] throttle_cfs_rq+0x193/0x2c0
> &cfs_b->lock 1938 [<00000000b6333ad0>] unthrottle_cfs_rq+0x54/0x1d0
> &cfs_b->lock 21 [<00000000fc0c15d2>] sched_cfs_period_timer+0xe6/0x1d0
>
>
>>
>> The lock has to be dropped due to lock ordering vs rq locks, and the
>> reverse order wouldn't be possible. That said, each cfs_rq unthrottle in
>> distribute grabs the lock, and then that cpu will grab the lock when it
>> wakes up, which can be while we're still in distribute. I'm not sure if
>> it would be possible to move the resched_curr calls until after doing
>> all the rest of unthrottle, and even if we did then we'd be taking each
>> rq lock twice, which wouldn't be great either. It might at least be
>> possible to coalesce all the cfs_b accounting in unthrottle to be done
>> in a single locked section, but I don't know if that would actually
>> help; it would still all have to be serialized regardless after all.
>
> Yeah, that makes sense, thanks.
>
> Cheers,
> Phil
>
>>
>> >
>> > My reproducer is artificial, but is designed to trigger the issue as has
>> > been hit in various customer workloads. So yes, it's exaggerated, but only
>> > so I don't have to wait weeks between hits :)
>> >
>> >
>> > Thanks,
>> > Phil
>> >
>> >> >
>> >> > A more complete fix to make sure do_sched_cfs_period_timer never takes longer than period
>> >> > would be good but I'm not sure what that would be and we still have this potential forever
>> >> > loop.
>> >> >
>> >> > Below is the bandaid version.
>> >> >
>> >> > Thoughts?
>> >> >
>> >> >
>> >> > Cheers,
>> >> > Phil
>> >> >
>> >> > ---
>> >> >
>> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> > index 310d0637fe4b..33e55620f891 100644
>> >> > --- a/kernel/sched/fair.c
>> >> > +++ b/kernel/sched/fair.c
>> >> > @@ -4859,12 +4859,16 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>> >> > return HRTIMER_NORESTART;
>> >> > }
>> >> >
>> >> > +/* This is pretty arbitrary just to escape the "forever" loop before the watchdog
>> >> > + * kills us as there is no guarantee hrtimer_forward_now ever returns 0. */
>> >> > +#define CFS_PERIOD_TIMER_EXIT_COUNT 100
>> >> > static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> >> > {
>> >> > struct cfs_bandwidth *cfs_b =
>> >> > container_of(timer, struct cfs_bandwidth, period_timer);
>> >> > int overrun;
>> >> > int idle = 0;
>> >> > + int count = 0;
>> >> >
>> >> > raw_spin_lock(&cfs_b->lock);
>> >> > for (;;) {
>> >> > @@ -4872,6 +4876,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> >> > if (!overrun)
>> >> > break;
>> >> >
>> >> > + if (++count > CFS_PERIOD_TIMER_EXIT_COUNT) {
>> >> > + pr_warn_ratelimited("cfs_period_timer(%d): too many overruns. Consider raising cfs_period_us (%lld)\n",
>> >> > + smp_processor_id(), cfs_b->period/NSEC_PER_USEC);
>> >> > + // Make sure we restart the timer.
>> >> > + idle = 0;
>> >> > + break;
>> >> > + }
>> >> > +
>> >> > idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> >> > }
>> >> > if (idle)