Re: INFO: rcu detected stall in do_idle

From: Peter Zijlstra
Date: Wed Oct 31 2018 - 13:39:12 EST


On Wed, Oct 31, 2018 at 05:18:00PM +0100, Daniel Bristot de Oliveira wrote:
> Brazilian part of the Ph.D we are dealing with probabilistic worst case
> execution time, and to be able to use probabilistic methods, we need to remove
> the noise of the IRQs in the execution time [1]. So, IMHO, using
> CONFIG_IRQ_TIME_ACCOUNTING is a good thing.

> With this in mind: we do *not* use/have an exact admission test for all cases.
> By not having an exact admission test, we assume the user knows what he/she is
> doing. In this case, if they have a high load of IRQs... they need to know that:

So I mostly agree with the things you said; IRQs are a different
'context' or 'task' from the normal scheduled task. For AC we can
consider an average IRQ load etc..

But even if we get AC sufficient with an average IRQ load, there are
still the cases where the IRQs cluster. So, esp. for very short
runtimes, you can get this scenario.

> 1) Their periods should be consistent with the "interference" they might receive.
> 2) Their tasks can miss the deadline because of IRQs (and there is no way to
> avoid this without "throttling" IRQs...)

True.

> So, is it worth to put a duct tape for this case?

My point was mostly to to not misbehave. Maybe I got it wrong, that
happens ;-)

> >> @@ -1171,6 +1162,17 @@ static void update_curr_dl(struct rq *rq)
> >> return;
> >> }
> >>
> >> + wall = rq_clock();
> >> + delta_wall = wall - dl_se->wallstamp;
> >> + if (delta_wall > 0) {
> >> + dl_se->walltime += delta_wall;
> >> + dl_se->wallstamp = wall;
> >> + }
> >> +
> >> + /* check if rq_clock_task() has been too slow */
> >> + if (unlikely(dl_se->walltime > dl_se->period))
> >> + goto throttle;
> >> +
>
> If I got it correctly, it can be the case that we would throttle a thread that,
> because of IRQs, received less CPU time than expected, right?

So the thinking was that when the actual walltime runtime (which
includes IRQs) exceeds the period, our whole CBS model breaks and we
need to replenish and push forward the deadline.

Maybe it should do that and not throttle.

Also, the walltime -= period thing should be fixed to never go negative
I think.