Re: [RFC] sched/fair: hard lockup in sched_cfs_period_timer

From: bsegall
Date: Wed Mar 13 2019 - 13:44:14 EST


Phil Auld <pauld@xxxxxxxxxx> writes:

> On Mon, Mar 11, 2019 at 04:25:36PM -0400 Phil Auld wrote:
>> On Mon, Mar 11, 2019 at 10:44:25AM -0700 bsegall@xxxxxxxxxx wrote:
>> > Letting it spin for 100ms and then only increasing by 6% seems extremely
>> > generous. If we went this route I'd probably say "after looping N
>> > times, set the period to time taken / N + X%" where N is like 8 or
>> > something. I think I'd probably perfer something like this to the
>> > previous "just abort and let it happen again next interrupt" one.
>>
>> Okay. I'll try to spin something up that does this. It may be a little
>> trickier to keep the quota proportional to the new period. I think that's
>> important since we'll be changing the user's setting.
>>
>> Do you mean to have it break when it hits N and recalculates the period or
>> reset the counter and keep going?
>>
>
> Let me know what you think of the below. It's working nicely. I like your
> suggestion to limit it quickly based on number of loops and use that to
> scale up. I think it is best to break out and let it fire again if needed.
> The warning fires once, very occasionally twice, and then things are quiet.
>
> If that looks reasonable I'll do some more testing and spin it up as a real
> patch submission.

Yeah, this looks reasonable. I should probably see how unreasonable the
other thing would be, but if your previous periods were kinda small (and
it's just that the machine crashing isn't an ok failure mode) I suppose
it's not a big deal.

>
> Cheers,
> Phil
> ---
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 310d0637fe4b..54b30adfc89e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4859,19 +4859,51 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
> return HRTIMER_NORESTART;
> }
>
> +extern const u64 max_cfs_quota_period;
> +int cfs_period_autotune_loop_limit = 8;
> +int cfs_period_autotune_cushion_pct = 15; /* percentage added to period recalculation */
> +
> static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> {
> struct cfs_bandwidth *cfs_b =
> container_of(timer, struct cfs_bandwidth, period_timer);
> + s64 nsstart, nsnow, new_period;
> int overrun;
> int idle = 0;
> + int count = 0;
>
> raw_spin_lock(&cfs_b->lock);
> + nsstart = ktime_to_ns(hrtimer_cb_get_time(timer));
> for (;;) {
> overrun = hrtimer_forward_now(timer, cfs_b->period);
> if (!overrun)
> break;
>
> + if (++count > cfs_period_autotune_loop_limit) {
> + ktime_t old_period = ktime_to_ns(cfs_b->period);
> +
> + nsnow = ktime_to_ns(hrtimer_cb_get_time(timer));
> + new_period = (nsnow - nsstart)/cfs_period_autotune_loop_limit;
> +
> + /* Make sure new period will be larger than old. */
> + if (new_period < old_period) {
> + new_period = old_period;
> + }
> + new_period += (new_period * cfs_period_autotune_cushion_pct) / 100;

This ordering means that it will always increase by at least 15%. This
is a bit odd but probably a good thing; I'd just change the comment to
make it clear this is deliberate.

> +
> + if (new_period > max_cfs_quota_period)
> + new_period = max_cfs_quota_period;
> +
> + cfs_b->period = ns_to_ktime(new_period);
> + cfs_b->quota += (cfs_b->quota * ((new_period - old_period) * 100)/old_period)/100;

In general it makes sense to do fixed point via 1024 or something that
can be optimized into shifts (and a larger number is better in general
for better precision).

> + pr_warn_ratelimited(
> + "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n",
> + smp_processor_id(), cfs_b->period/NSEC_PER_USEC, cfs_b->quota/NSEC_PER_USEC);
> +
> + idle = 0;
> + break;
> + }
> +
> idle = do_sched_cfs_period_timer(cfs_b, overrun);
> }
> if (idle)