Re: [RFC v3 2/6] Improve the tracking of active utilisation

From: luca abeni
Date: Tue Nov 01 2016 - 17:46:48 EST


Hi Juri,

On Tue, 1 Nov 2016 16:46:04 +0000
Juri Lelli <juri.lelli@xxxxxxx> wrote:
[...]
> > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> > index 3d95c1d..80d1541 100644
> > --- a/kernel/sched/deadline.c
> > +++ b/kernel/sched/deadline.c
> > @@ -47,6 +47,7 @@ static void add_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> > {
> > u64 se_bw = dl_se->dl_bw;
> >
> > + lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
>
> This and the one below go in 1/6.
Ok; I'll move it there


>
> > dl_rq->running_bw += se_bw;
> > }
> >
> > @@ -54,11 +55,52 @@ static void sub_running_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> > {
> > u64 se_bw = dl_se->dl_bw;
> >
> > + lockdep_assert_held(&(rq_of_dl_rq(dl_rq))->lock);
> > dl_rq->running_bw -= se_bw;
> > if (WARN_ON(dl_rq->running_bw < 0))
> > dl_rq->running_bw = 0;
> > }
> >
> > +static void task_go_inactive(struct task_struct *p)
> > +{
> > + struct sched_dl_entity *dl_se = &p->dl;
> > + struct hrtimer *timer = &dl_se->inactive_timer;
> > + struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > + struct rq *rq = rq_of_dl_rq(dl_rq);
> > + s64 zerolag_time;
> > +
> > + WARN_ON(dl_se->dl_runtime == 0);
> > +
> > + /* If the inactive timer is already armed, return immediately */
> > + if (hrtimer_active(&dl_se->inactive_timer))
> > + return;
> > +
> > + zerolag_time = dl_se->deadline -
> > + div64_long((dl_se->runtime * dl_se->dl_period),
> > + dl_se->dl_runtime);
> > +
> > + /*
> > + * Using relative times instead of the absolute "0-lag time"
> > + * allows to simplify the code
> > + */
> > + zerolag_time -= rq_clock(rq);
> > +
> > + /*
> > + * If the "0-lag time" already passed, decrease the active
> > + * utilization now, instead of starting a timer
> > + */
> > + if (zerolag_time < 0) {
> > + sub_running_bw(dl_se, dl_rq);
> > + if (!dl_task(p))
> > + __dl_clear_params(p);
> > +
> > + return;
> > + }
> > +
> > + get_task_struct(p);
> > + hrtimer_start(timer, ns_to_ktime(zerolag_time), HRTIMER_MODE_REL);
> > +}
> > +
> > static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
> > {
> > struct sched_dl_entity *dl_se = &p->dl;
> > @@ -514,7 +556,20 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
> > struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> > struct rq *rq = rq_of_dl_rq(dl_rq);
> >
> > - add_running_bw(dl_se, dl_rq);
> > + if (hrtimer_is_queued(&dl_se->inactive_timer)) {
> > + hrtimer_try_to_cancel(&dl_se->inactive_timer);
>
> Why we are OK with just trying to cancel the inactive timer?
This is my understanding of the things: if the timer cannot be
cancelled, it either:
1) Already decreased the active utilisation (and the timer handler is
just finishing its execution). In this case, cancelling it or not
cancelling it does not make any difference
2) Still have to decrease the active utilisation (and is probably
waiting for the runqueue lock). In this case, the timer handler
will find the task RUNNING, and will not decrease the active
utilisation

>
> > + WARN_ON(dl_task_of(dl_se)->nr_cpus_allowed > 1);
>
> What's wrong with nr_cpus_allowed > 1 tasks?
If nr_cpus_allowed > 1, then select_task_rq_dl() executed before
enqueueing the task, and it already took care of cancelling the
inactive timer. I added this WARN_ON() to check for possible races
between the timer handler and select_task_rq / enqueue_task_dl...
I tried hard to trigger such a race, but the WARN_ON() never
happened (it only happened when I had tasks bound to one single
CPU - I tried this combination for testing purposes).

[...]
> > @@ -1074,6 +1161,14 @@ select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
> > }
> > rcu_read_unlock();
> >
> > + rq = task_rq(p);
> > + raw_spin_lock(&rq->lock);
> > + if (hrtimer_active(&p->dl.inactive_timer)) {
> > + sub_running_bw(&p->dl, &rq->dl);
> > + hrtimer_try_to_cancel(&p->dl.inactive_timer);
>
> Can't we subtract twice if it happens that after we grabbed rq_lock the timer
> fired, so it's now waiting for that lock and it goes ahead and sub_running_bw
> again after we release the lock?
Uhm... I somehow convinced myself that this could not happen, but I do not
remember the details, sorry :(
I'll check again in the next days (BTW, wouldn't this trigger the WARN_ON you
mentioned above?)


>
> > + }
> > + raw_spin_unlock(&rq->lock);
> > +
> > out:
> > return cpu;
> > }
> > @@ -1244,6 +1339,11 @@ static void task_dead_dl(struct task_struct *p)
> > /* XXX we should retain the bw until 0-lag */
> > dl_b->total_bw -= p->dl.dl_bw;
> > raw_spin_unlock_irq(&dl_b->lock);
> > + if (hrtimer_active(&p->dl.inactive_timer)) {
> > + raw_spin_lock_irq(&task_rq(p)->lock);
> > + sub_running_bw(&p->dl, dl_rq_of_se(&p->dl));
>
> Don't we still need to wait for the 0-lag? Or maybe since the task is dying we
> can release it's bw instantaneously? In this case I'd add a comment about it.
Uhmm... I think you are right; I'll double-check this.


> > static void set_curr_task_dl(struct rq *rq)
> > @@ -1720,15 +1820,22 @@ void __init init_sched_dl_class(void)
> > static void switched_from_dl(struct rq *rq, struct task_struct *p)
> > {
> > /*
> > - * Start the deadline timer; if we switch back to dl before this we'll
> > - * continue consuming our current CBS slice. If we stay outside of
> > - * SCHED_DEADLINE until the deadline passes, the timer will reset the
> > - * task.
> > + * task_go_inactive() can start the "inactive timer" (if the 0-lag
> > + * time is in the future). If the task switches back to dl before
> > + * the "inactive timer" fires, it can continue to consume its current
> > + * runtime using its current deadline. If it stays outside of
> > + * SCHED_DEADLINE until the 0-lag time passes, inactive_task_timer()
> > + * will reset the task parameters.
> > */
> > - if (!start_dl_timer(p))
> > - __dl_clear_params(p);
> > + if (task_on_rq_queued(p) && p->dl.dl_runtime)
> > + task_go_inactive(p);
> >
> > - if (task_on_rq_queued(p))
> > + /*
> > + * We cannot use inactive_task_timer() to invoke sub_running_bw()
> > + * at the 0-lag time, because the task could have been migrated
> > + * while SCHED_OTHER in the meanwhile.
>
> But, from a theoretical pow, we very much should, right?
Yes, we should.

> Is this taken care of in next patch?
No, I left this small divergence between theory and practice. The problem is
that the inactive timer handler looks at the task's runqueue, and decreases
the active utilization of such a runqueue. If the task is a SCHED_DEADLINE
task, this is not an issue, because the inactive timer is armed when the
task is not on a runqueue... When the task is inserted in a runqueue again,
the timer is cancelled. But when the task becomes SCHED_NORMAL, it can go
on a runqueue (and migrate to different runqueues) without cancelling the
inactive timer... So, when the inactive timer fires we have no guarantee
that the task's runqueue is still the one from which we need to subtract
the utilisation (I've actually seen this bug in action, before changing the
code in the way it is now).

If this is not acceptable, I'll try to find another solution to this issue.




Thanks,
Luca