Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer

From: Phil Auld
Date: Wed Mar 13 2019 - 14:50:31 EST


On Wed, Mar 13, 2019 at 10:44:09AM -0700 bsegall@xxxxxxxxxx wrote:
> Phil Auld <pauld@xxxxxxxxxx> writes:
>
> > On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
> >> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@xxxxxxxxxx wrote:
> >> > Letting it spin for 100ms and then only increasing by 6% seems extremely
> >> > generous. If we went this route I'd probably say "after looping N
> >> > times, set the period to time taken / N + X%" where N is like 8 or
> >> > something. I think I'd probably perfer something like this to the
> >> > previous "just abort and let it happen again next interrupt" one.
> >>
> >> Okay. I'll try to spin something up that does this. It may be a little
> >> trickier to keep the quota proportional to the new period. I think that's
> >> important since we'll be changing the user's setting.
> >>
> >> Do you mean to have it break when it hits N and recalculates the period or
> >> reset the counter and keep going?
> >>
> >
> > Let me know what you think of the below. It's working nicely. I like your
> > suggestion to limit it quickly based on number of loops and use that to
> > scale up. I think it is best to break out and let it fire again if needed.
> > The warning fires once, very occasionally twice, and then things are quiet.
> >
> > If that looks reasonable I'll do some more testing and spin it up as a real
> > patch submission.
>
> Yeah, this looks reasonable. I should probably see how unreasonable the
> other thing would be, but if your previous periods were kinda small (and
> it's just that the machine crashing isn't an ok failure mode) I suppose
> it's not a big deal.
>

I posted it a little while ago. The periods were tiny (~2000us vs a minimum
of 1000) and with 2500 mostly unused child cgroups (I didn't narrow that
down much but it did reproduce still with 1250 children). That's why I was
thinking edge case. It also requires a fairly small quota and load to make
sure cfs_rqs get throttled.

I'm still wrapping my head around the scheduler code but I'd be happy to
try it the other way if you can give me a bit more description of what
you have in mind. Also happy to test a patch with my repro.


Cheers,
Phil


> >
> > Cheers,
> > Phil
> > ---
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 310d0637fe4b..54b30adfc89e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4859,19 +4859,51 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> > return HRTIMER_NORESTART;
> > }
> >
> > +extern const u64 max_cfs_quota_period;
> > +int cfs_period_autotune_loop_limit = 8;
> > +int cfs_period_autotune_cushion_pct = 15; /* percentage added to period recalculation */
> > +
> > static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > {
> > struct cfs_bandwidth *cfs_b =
> > container_of(timer, struct cfs_bandwidth, period_timer);
> > + s64 nsstart, nsnow, new_period;
> > int overrun;
> > int idle = 0;
> > + int count = 0;
> >
> > raw_spin_lock(&cfs_b->lock);
> > + nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
> > for (;;) {
> > overrun = hrtimer_forward_now(timer, cfs_b->period);
> > if (!overrun)
> > break;
> >
> > + if (++count > cfs_period_autotune_loop_limit) {
> > + ktime_t old_period = ktime_to_ns(cfs_b->period);
> > +
> > + nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
> > + new_period = (nsnow - nsstart)/cfs_period_autotune_loop_limit;
> > +
> > + /* Make sure new period will be larger than old. */
> > + if (new_period < old_period) {
> > + new_period = old_period;
> > + }
> > + new_period += (new_period * cfs_period_autotune_cushion_pct) / 100;
>
> This ordering means that it will always increase by at least 15%. This
> is a bit odd but probably a good thing; I'd just change the comment to
> make it clear this is deliberate.
>
> > +
> > + if (new_period > max_cfs_quota_period)
> > + new_period = max_cfs_quota_period;
> > +
> > + cfs_b->period = ns_to_ktime(new_period);
> > + cfs_b->quota += (cfs_b->quota * ((new_period - old_period) * 100)/old_period)/100;
>
> In general it makes sense to do fixed point via 1024 or something that
> can be optimized into shifts (and a larger number is better in general
> for better precision).
>
> > + pr_warn_ratelimited(
> > + "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > + smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> > +
> > + idle = 0;
> > + break;
> > + }
> > +
> > idle = do_sched_cfs_period_timer(cfs_b, overrun);
> > }
> > if (idle)

--