Re: INFO: rcu detected stall in do_idle

From: Juri Lelli
Date: Wed Oct 24 2018 - 08:03:57 EST


On 19/10/18 22:50, luca abeni wrote:
> On Fri, 19 Oct 2018 13:39:42 +0200
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> > On Thu, Oct 18, 2018 at 01:08:11PM +0200, luca abeni wrote:
> > > Ok, I see the issue now: the problem is that the "while
> > > (dl_se->runtime <= 0)" loop is executed at replenishment time, but
> > > the deadline should be postponed at enforcement time.
> > >
> > > I mean: in update_curr_dl() we do:
> > > dl_se->runtime -= scaled_delta_exec;
> > > if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> > > ...
> > > enqueue replenishment timer at dl_next_period(dl_se)
> > > But dl_next_period() is based on a "wrong" deadline!
> > >
> > >
> > > I think that inserting a
> > > while (dl_se->runtime <= -pi_se->dl_runtime) {
> > > dl_se->deadline += pi_se->dl_period;
> > > dl_se->runtime += pi_se->dl_runtime;
> > > }
> > > immediately after "dl_se->runtime -= scaled_delta_exec;" would fix
> > > the problem, no?
> >
> > That certainly makes sense to me.
>
> Good; I'll try to work on this idea in the weekend.

So, we (me and Luca) managed to spend some more time on this and found a
few more things worth sharing. I'll try to summarize what we have got so
far (including what already discussed) because impression is that each
point might deserve a fix or at least consideration (just amazing how a
simple random fuzzer thing can highlight all that :). Apologies for the
long email.

Reproducer runs on a CONFIG_HZ=100, CONFIG_IRQ_TIME_ACCOUNTING kernel
and does something like this (only the bits that seems to matter here)

int main(void)
{
[...]
[setup stuff at 0x2001d000]
syscall(__NR_perf_event_open, 0x2001d000, 0, -1, -1, 0);
*(uint32_t*)0x20000000 = 0;
*(uint32_t*)0x20000004 = 6;
*(uint64_t*)0x20000008 = 0;
*(uint32_t*)0x20000010 = 0;
*(uint32_t*)0x20000014 = 0;
*(uint64_t*)0x20000018 = 0x9917; <-- ~40us
*(uint64_t*)0x20000020 = 0xffff; <-- ~65us (~60% bandwidth)
*(uint64_t*)0x20000028 = 0;
syscall(__NR_sched_setattr, 0, 0x20000000, 0);
[busy loop]
return 0;
}

And this causes problems because the task is actually never throttled.

Pain points:

1. Granularity of enforcement (at each tick) is huge compared with
the task runtime. This makes starting the replenishment timer,
when runtime is depleted, always to fail (because old deadline
is way in the past). So, the task is fully replenished and put
back to run.

- Luca's proposal should help here, since the deadline is postponed
at throttling time, and replenishment timer set to that (and it
should be in the future)

1.1 Even if we fix 1. in a configuration like this, the task would
still be able to run for ~10ms (worst case) and potentially starve
other tasks. It doesn't seem a too big interval maybe, but there
might be other very short activities that might miss an occasion
to run "quickly".

- Might be fixed by imposing (via sysctl) reasonable defaults for
minimum runtime (w.r.t. HZ, like HZ/2) and maximum for period
(as also a very small bandwidth task can have a big runtime if
period is big as well)

(1.2) When runtime becomes very negative (because delta_exec was big)
we seem to spend lot of time inside the replenishment loop.

- Not sure it's such a big problem, might need more profiling.
Feeling is that once the other points will be addressed this
won't matter anymore

2. This is related to perf_event_open syscall reproducer does before
becoming DEADLINE and entering the busy loop. Enabling of perf
swevents generates lot of hrtimers load that happens in the
reproducer task context. Now, DEADLINE uses rq_clock() for setting
deadlines, but rq_clock_task() for doing runtime enforcement.
In a situation like this it seems that the amount of irq pressure
becomes pretty big (I'm seeing this on kvm, real hw should maybe do
better, pain point remains I guess), so rq_clock() and
rq_clock_task() might become more a more skewed w.r.t. each other.
Since rq_clock() is only used when setting absolute deadlines for
the first time (or when resetting them in certain cases), after a
bit the replenishment code will start to see postponed deadlines
always in the past w.r.t. rq_clock(). And this brings us back to the
fact that the task is never stopped, since it can't keep up with
rq_clock().

- Not sure yet how we want to address this [1]. We could use
rq_clock() everywhere, but tasks might be penalized by irq
pressure (theoretically this would mandate that irqs are
explicitly accounted for I guess). I tried to use the skew between
the two clocks to "fix" deadlines, but that puts us at risks of
de-synchronizing userspace and kernel views of deadlines.

3. HRTICK is not started for new entities.

- Already got a patch for it.

This should be it, I hope. Luca (thanks a lot for your help) and please
add or correct me if I was wrong.

Thoughts?

Best,

- Juri

1 - https://elixir.bootlin.com/linux/latest/source/kernel/sched/deadline.c#L1162