Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock

From: Peter Zijlstra
Date: Wed May 21 2014 - 03:59:04 EST

Next message: Wolfram Sang: "Re: [PATCH v4 1/2] i2c: add DMA support for freescale i2c driver"
Previous message: çéæ: "ååï [PATCH] arch: unicore32: kernel: ksyms: remove 'bswapsi2' and 'muldi3' to avoid compiling failure"
In reply to: bsegall: "Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock"
Next in thread: Peter Zijlstra: "Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, May 20, 2014 at 12:08:34PM -0700, bsegall@xxxxxxxxxx wrote:
> Yeah, I believe the whole timer_active/cancel-before-start was to not
> lose the timer while still avoiding the HRTIMER_RESTART/hrtimer_start
> race. I'm not sure what, if anything, all the other ~35 users of
> HRTIMER_RESTART do about this. At least some do hrtimer_cancel();
> hrtimer_start(); though.

Yeah, I've not yet had the courage to go look through the tree :-/

That said, I've got a growing suspicion that I'm missing something
obvious and simple here.

> So, this and the other changes will ensure that a period timer is always
> queued. However, this also means that if we race so that we take
> cfs_b->lock before the period timer, but late enough that the interrupt
> has fired and we see !queued, we will hrtimer_forward the entire period
> here.

Right you are, missed that.

> What assign_cfs_rq_runtime does is insufficient - it doesn't
> unthrottle any other cfs_rqs, and since start_bandwidth_timer
> hrtimer_forwarded the period away, neither will
> (do_)sched_cfs_period_timer. (In addition throttle doesn't even do that,
> but that's probably easy enough to fix)

So I had a closer look at the bandwidth bits because of this, and it
strikes me as odd that __refill_cfs_bandwidth_runtime() does an
unconditional fill of cfs_b->runtime.

What I would have expected to see is something like:

cfs_b->runtime += overruns * cfs_b->quota;
if (cfs_b->runtime > cfs_b->quota)
cfs_b->runtime = cfs_b->quota;

Which refills the runtime based on the overruns that happened and
respects the negative runtime accrued from previous inaccuracies.

If you unconditionally top-up like currently happens its possible to
consistently use more time than specified; due to the 1 jiffy throttle
granularity.

> I suppose we could check hrtimer_active and if so record any periods
> start_bandwidth_timer used for sched_cfs_period_timer, which could then
> force a do_sched_cfs_period_timer call. Perhaps something like this
> (equally untested) patch on top of your patch.

You left the new ->stored_periods unused but the idea was clear enough.

Anyway, given that you don't actually use overruns for anything except
pointless stats we could cheat and assume any trigger of the handler is
due to at least 1 overrun, that should trigger the unthrottle code
methinks.

In any case, I think your tracking overruns like that does make sense,
and its something the rt.c code needs too, as it suffers from the same
problem, so maybe we can roll that into the generic
start_bandwidth_timer() thing.

/me goes see if he can track down his suspicion and find shiny simple
truth at the end of it.

Attachment: pgpSeFxJJEzco.pgp
Description: PGP signature

Next message: Wolfram Sang: "Re: [PATCH v4 1/2] i2c: add DMA support for freescale i2c driver"
Previous message: çéæ: "ååï [PATCH] arch: unicore32: kernel: ksyms: remove 'bswapsi2' and 'muldi3' to avoid compiling failure"
In reply to: bsegall: "Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock"
Next in thread: Peter Zijlstra: "Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]