Re: [PATCH] sched/fair: Limit sched_cfs_period_timer loop to avoid hard lockup

From: Phil Auld
Date: Fri Mar 15 2019 - 09:51:29 EST


On Fri, Mar 15, 2019 at 11:33:57AM +0100 Peter Zijlstra wrote:
> On Fri, Mar 15, 2019 at 11:11:50AM +0100, Peter Zijlstra wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ea74d43924b2..b71557be6b42 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> > return HRTIMER_NORESTART;
> > }
> >
> > +extern const u64 max_cfs_quota_period;
> > +
> > static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > {
> > struct cfs_bandwidth *cfs_b =
> > @@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > unsigned long flags;
> > int overrun;
> > int idle = 0;
> > + int count = 0;
> >
> > raw_spin_lock_irqsave(&cfs_b->lock, flags);
> > for (;;) {
> > @@ -4899,6 +4902,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > if (!overrun)
> > break;
> >
> > + if (++count > 3) {
> > + u64 new, old = ktime_to_ns(cfs_b->period);
> > +
> > + new = (old * 147) / 128; /* ~115% */
> > + new = min(new, max_cfs_quota_period);
>
> Also, we can still engineer things to come unstuck; if we explicitly
> configure period at 1e9 and then set a really small quota and then
> create this insane amount of cgroups you have..
>
> this code has no room to manouvre left.
>
> Do we want to do anything about that? Or leave it as is, don't do that
> then?
>

If the period is 1s it would be hard to make this loop fire repeatedly. I don't think
it's that dependent on the quota other than getting some rqs throttled. The small quota
would also mean fewer of them would get unthrottled per distribute call. You'd probably
need _significantly_ more cgroups than my insane 2500 to hit it.

Right now it settles out with a new period of ~12-15ms. So ~200,000 cgroups?

Ben and I talked a little about this in another thread. I think hitting this is enough of
an edge case that this approach will make the problem go away. The only alternative we
came up with to reduce the time taken in unthrottle involved a fair bit of complexity
added to the every day code paths. And might not help if the children all had their
own quota/period settings active.

Thoughts?


Cheers,
Phil



> > +
> > + cfs_b->period = ns_to_ktime(new);
> > +
> > + /* since max is 1s, this is limited to 1e9^2, which fits in u64 */
> > + cfs_b->quota *= new;
> > + cfs_b->quota /= old;
> > +
> > + pr_warn_ratelimited(
> > + "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > + smp_processor_id(),
> > + new/NSEC_PER_USEC,
> > + cfs_b->quota/NSEC_PER_USEC);
> > +
> > + /* reset count so we don't come right back in here */
> > + count = 0;
> > + }
> > +
> > idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
> > }
> > if (idle)

--