Re: INFO: rcu detected stall in do_idle

From: luca abeni
Date: Tue Oct 30 2018 - 07:08:16 EST


Hi Peter,

On Tue, 30 Oct 2018 11:45:54 +0100
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
[...]
> > 2. This is related to perf_event_open syscall reproducer does
> > before becoming DEADLINE and entering the busy loop. Enabling of
> > perf swevents generates lot of hrtimers load that happens in the
> > reproducer task context. Now, DEADLINE uses rq_clock() for
> > setting deadlines, but rq_clock_task() for doing runtime
> > enforcement. In a situation like this it seems that the amount of
> > irq pressure becomes pretty big (I'm seeing this on kvm, real hw
> > should maybe do better, pain point remains I guess), so rq_clock()
> > and rq_clock_task() might become more a more skewed w.r.t. each
> > other. Since rq_clock() is only used when setting absolute
> > deadlines for the first time (or when resetting them in certain
> > cases), after a bit the replenishment code will start to see
> > postponed deadlines always in the past w.r.t. rq_clock(). And this
> > brings us back to the fact that the task is never stopped, since it
> > can't keep up with rq_clock().
> >
> > - Not sure yet how we want to address this [1]. We could use
> > rq_clock() everywhere, but tasks might be penalized by irq
> > pressure (theoretically this would mandate that irqs are
> > explicitly accounted for I guess). I tried to use the skew
> > between the two clocks to "fix" deadlines, but that puts us at
> > risks of de-synchronizing userspace and kernel views of deadlines.
>
> Hurm.. right. We knew of this issue back when we did it.
> I suppose now it hurts and we need to figure something out.
>
> By virtue of being a real-time class, we do indeed need to have
> deadline on the wall-clock. But if we then don't account runtime on
> that same clock, but on a potentially slower clock, we get the
> problem that we can run longer than our period/deadline, which is
> what we're running into here I suppose.

I might be hugely misunderstanding something here, but in my impression
the issue is just that if the IRQ time is not accounted to the
-deadline task, then the non-deadline tasks might be starved.

I do not see this as a skew between two clocks, but as an accounting
thing:
- if we decide that the IRQ time is accounted to the -deadline
task (this is what happens with CONFIG_IRQ_TIME_ACCOUNTING disabled),
then the non-deadline tasks are not starved (but of course the
-deadline tasks executes for less than its reserved time in the
period);
- if we decide that the IRQ time is not accounted to the -deadline task
(this is what happens with CONFIG_IRQ_TIME_ACCOUNTING enabled), then
the -deadline task executes for the expected amount of time (about
60% of the CPU time), but an IRQ load of 40% will starve non-deadline
tasks (this is what happens in the bug that triggered this discussion)

I think this might be seen as an adimission control issue: when
CONFIG_IRQ_TIME_ACCOUNTING is disabled, the IRQ time is accounted for
in the admission control (because it ends up in the task's runtime),
but when CONFIG_IRQ_TIME_ACCOUNTING is enabled the IRQ time is not
accounted for in the admission test (the IRQ handler becomes some sort
of entity with a higher priority than -deadline tasks, on which no
accounting or enforcement is performed).



> And yes, at some point RT workloads need to be aware of the jitter
> injected by things like IRQs and such. But I believe the rationale was
> that for soft real-time workloads this current semantic was 'easier'
> because we get to ignore IRQ overhead for workload estimation etc.
>
> What we could maybe do is track runtime in both rq_clock_task() and
> rq_clock() and detect where the rq_clock based one exceeds the period
> and then push out the deadline (and add runtime).
>
> Maybe something along such lines; does that make sense?

Uhm... I have to study and test your patch... I'll comment on this
later.



Thanks,
Luca


>
> ---
> include/linux/sched.h | 3 +++
> kernel/sched/deadline.c | 53
> ++++++++++++++++++++++++++++++++----------------- 2 files changed, 38
> insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8f8a5418b627..6aec81cb3d2e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -522,6 +522,9 @@ struct sched_dl_entity {
> u64 deadline; /*
> Absolute deadline for this instance */ unsigned
> int flags; /* Specifying the
> scheduler behaviour */
> + u64 wallstamp;
> + s64 walltime;
> +
> /*
> * Some bool flags:
> *
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 91e4202b0634..633c8f36c700 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -683,16 +683,7 @@ static void replenish_dl_entity(struct
> sched_dl_entity *dl_se, if (dl_se->dl_yielded && dl_se->runtime > 0)
> dl_se->runtime = 0;
>
> - /*
> - * We keep moving the deadline away until we get some
> - * available runtime for the entity. This ensures correct
> - * handling of situations where the runtime overrun is
> - * arbitrary large.
> - */
> - while (dl_se->runtime <= 0) {
> - dl_se->deadline += pi_se->dl_period;
> - dl_se->runtime += pi_se->dl_runtime;
> - }
> + /* XXX what do we do with pi_se */
>
> /*
> * At this point, the deadline really should be "in
> @@ -1148,9 +1139,9 @@ static void update_curr_dl(struct rq *rq)
> {
> struct task_struct *curr = rq->curr;
> struct sched_dl_entity *dl_se = &curr->dl;
> - u64 delta_exec, scaled_delta_exec;
> + u64 delta_exec, scaled_delta_exec, delta_wall;
> int cpu = cpu_of(rq);
> - u64 now;
> + u64 now, wall;
>
> if (!dl_task(curr) || !on_dl_rq(dl_se))
> return;
> @@ -1171,6 +1162,17 @@ static void update_curr_dl(struct rq *rq)
> return;
> }
>
> + wall = rq_clock();
> + delta_wall = wall - dl_se->wallstamp;
> + if (delta_wall > 0) {
> + dl_se->walltime += delta_wall;
> + dl_se->wallstamp = wall;
> + }
> +
> + /* check if rq_clock_task() has been too slow */
> + if (unlikely(dl_se->walltime > dl_se->period))
> + goto throttle;
> +
> schedstat_set(curr->se.statistics.exec_max,
> max(curr->se.statistics.exec_max, delta_exec));
>
> @@ -1204,14 +1206,27 @@ static void update_curr_dl(struct rq *rq)
>
> dl_se->runtime -= scaled_delta_exec;
>
> -throttle:
> if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> +throttle:
> dl_se->dl_throttled = 1;
>
> - /* If requested, inform the user about runtime
> overruns. */
> - if (dl_runtime_exceeded(dl_se) &&
> - (dl_se->flags & SCHED_FLAG_DL_OVERRUN))
> - dl_se->dl_overrun = 1;
> + if (dl_runtime_exceeded(dl_se)) {
> + /* If requested, inform the user about
> runtime overruns. */
> + if (dl_se->flags & SCHED_FLAG_DL_OVERRUN)
> + dl_se->dl_overrun = 1;
> +
> + }
> +
> + /*
> + * We keep moving the deadline away until we get
> some available
> + * runtime for the entity. This ensures correct
> handling of
> + * situations where the runtime overrun is arbitrary
> large.
> + */
> + while (dl_se->runtime <= 0 || dl_se->walltime >
> dl_se->period) {
> + dl_se->deadline += dl_se->dl_period;
> + dl_se->runtime += dl_se->dl_runtime;
> + dl_se->walltime -= dl_se->dl_period;
> + }
>
> __dequeue_task_dl(rq, curr, 0);
> if (unlikely(dl_se->dl_boosted
> || !start_dl_timer(curr))) @@ -1751,9 +1766,10 @@
> pick_next_task_dl(struct rq *rq, struct task_struct *prev, struct
> rq_flags *rf) p = dl_task_of(dl_se);
> p->se.exec_start = rq_clock_task(rq);
> + dl_se->wallstamp = rq_clock(rq);
>
> /* Running task will never be pushed. */
> - dequeue_pushable_dl_task(rq, p);
> + dequeue_pushable_dl_task(rq, p);
>
> if (hrtick_enabled(rq))
> start_hrtick_dl(rq, p);
> @@ -1811,6 +1827,7 @@ static void set_curr_task_dl(struct rq *rq)
> struct task_struct *p = rq->curr;
>
> p->se.exec_start = rq_clock_task(rq);
> + p->dl_se.wallstamp = rq_clock(rq);
>
> /* You can't push away the running task */
> dequeue_pushable_dl_task(rq, p);