Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer

From: bsegall
Date: Tue Mar 05 2019 - 13:49:57 EST


Phil Auld <pauld@xxxxxxxxxx> writes:

>> >
>> > raw_spin_lock(&cfs_b->lock);
>> > for (;;) {
>> > overrun = hrtimer_forward_now(timer, cfs_b->period);
>> > if (!overrun)
>> > break;
>> >
>> > idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> > }
>> > if (idle)
>> > cfs_b->period_active = 0;
>> > raw_spin_unlock(&cfs_b->lock);
>> >
>> >
>> > There's no guarantee that hrtimer_forward_now() will ever return 0 and also
>> > do_sched_cfs_period_timer() will drop and re-acquire the cfs_b->lock to call
>> > distribute_cfs_runtime.
>> >
>> > I considered code to tune the period and quota up (proportionally so they have the same
>> > relative effect) but I did not like that because of the above statement and the fact that
>> > it would be changing the valid values the user configured. I have two versions that do that
>> > differently but they both still call out for protection from the forever loop. I think they
>> > add complexity for no real gain.
>> >
>> > For my current testing I'm using a more direct but less elegant approach of simply limiting
>> > the number of times the loop can execute. This has the potential to skip handling a period
>> > I think but will prevent the problem reliably. I'm not sure how many iterations this loop
>> > was expected to take. I'm seeing numbers in the low thousands at times.
>>
>> I mean the answer should be "do_sched_cfs_period_timer runs once" the
>> vast majority of the time; if it's not then your machine/setup can't
>> handle whatever super low period you've set, or there's some bad
>> behavior somewhere in the period timer handling.
>> CFS_PERIOD_TIMER_EXIT_COUNT could reasonably be like 2 or 3 - this would
>> mean that you've already spent an entire period inside the handler.
>>
>
> Thanks for looking at this.
>
> That's sort of what I though, too. Would it not make more sense to get rid
> of the loop? I find forever loops inherently suspect.

I mean, probably better to have a loop than copy paste the code a couple
of times; you could rework the condition in your patch to be part of the
loop condition in general, though it might not be any clearer (heck,
the loop /could/ be for (;overrun = hrtimer_forward_now(...);) { ... },
it's just that's kinda questionable about whether it's clearer).

>
> I think the fact that we drop the lock to do the distribue is the real cuplrit.
> It's not do_sched_cfs_period_timer()'s code itself that takes the time,
> I think, it's the re-acquire of the cfs_b->lock and the contention on it due
> to all the cgroups. Lock_stat smaples during that run show pretty high contention
> on that lock and some really long wait times.

Tons of cgroups shouldn't increase contention; cfs_b->lock is
per-cgroup, and I don't see or remember anything offhand where it even
should be worse for deep nesting. Large machines with many cpus where
threads in the cgroup are running on each make it worse of course, and
tons of cgroups with throttling enabled have O(log n) cost on hrtimers
insertion and of course O(n) cost in actual runtime in irq handlers.

If you're seeing increasing contention as the cgroup depth or cgroup
count increases, that may be a good thing to fix regardless.

The lock has to be dropped due to lock ordering vs rq locks, and the
reverse order wouldn't be possible. That said, each cfs_rq unthrottle in
distribute grabs the lock, and then that cpu will grab the lock when it
wakes up, which can be while we're still in distribute. I'm not sure if
it would be possible to move the resched_curr calls until after doing
all the rest of unthrottle, and even if we did then we'd be taking each
rq lock twice, which wouldn't be great either. It might at least be
possible to coalesce all the cfs_b accounting in unthrottle to be done
in a single locked section, but I don't know if that would actually
help; it would still all have to be serialized regardless after all.

>
> My reproducer is artificial, but is designed to trigger the issue as has
> been hit in various customer workloads. So yes, it's exaggerated, but only
> so I don't have to wait weeks between hits :)
>
>
> Thanks,
> Phil
>
>> >
>> > A more complete fix to make sure do_sched_cfs_period_timer never takes longer than period
>> > would be good but I'm not sure what that would be and we still have this potential forever
>> > loop.
>> >
>> > Below is the bandaid version.
>> >
>> > Thoughts?
>> >
>> >
>> > Cheers,
>> > Phil
>> >
>> > ---
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 310d0637fe4b..33e55620f891 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -4859,12 +4859,16 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
>> > return HRTIMER_NORESTART;
>> > }
>> >
>> > +/* This is pretty arbitrary just to escape the "forever" loop before the watchdog
>> > + * kills us as there is no guarantee hrtimer_forward_now ever returns 0. */
>> > +#define CFS_PERIOD_TIMER_EXIT_COUNT 100
>> > static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> > {
>> > struct cfs_bandwidth *cfs_b =
>> > container_of(timer, struct cfs_bandwidth, period_timer);
>> > int overrun;
>> > int idle = 0;
>> > + int count = 0;
>> >
>> > raw_spin_lock(&cfs_b->lock);
>> > for (;;) {
>> > @@ -4872,6 +4876,14 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> > if (!overrun)
>> > break;
>> >
>> > + if (++count > CFS_PERIOD_TIMER_EXIT_COUNT) {
>> > + pr_warn_ratelimited("cfs_period_timer(%d): too many overruns. Consider raising cfs_period_us (%lld)\n",
>> > + smp_processor_id(), cfs_b->period/NSEC_PER_USEC);
>> > + // Make sure we restart the timer.
>> > + idle = 0;
>> > + break;
>> > + }
>> > +
>> > idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> > }
>> > if (idle)