Re: [PATCH 06/11] sched/irq: add irq utilization tracking

From: Wanpeng Li
Date: Mon Jul 30 2018 - 23:32:50 EST


On Tue, 31 Jul 2018 at 00:43, Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:
>
> Hi Wanpeng,
>
> On Thu, 26 Jul 2018 at 05:09, Wanpeng Li <kernellwp@xxxxxxxxx> wrote:
> >
> > Hi Vincent,
> > On Fri, 29 Jun 2018 at 03:07, Vincent Guittot
> > <vincent.guittot@xxxxxxxxxx> wrote:
> > >
> > > interrupt and steal time are the only remaining activities tracked by
> > > rt_avg. Like for sched classes, we can use PELT to track their average
> > > utilization of the CPU. But unlike sched class, we don't track when
> > > entering/leaving interrupt; Instead, we take into account the time spent
> > > under interrupt context when we update rqs' clock (rq_clock_task).
> > > This also means that we have to decay the normal context time and account
> > > for interrupt time during the update.
> > >
> > > That's also important to note that because
> > > rq_clock == rq_clock_task + interrupt time
> > > and rq_clock_task is used by a sched class to compute its utilization, the
> > > util_avg of a sched class only reflects the utilization of the time spent
> > > in normal context and not of the whole time of the CPU. The utilization of
> > > interrupt gives an more accurate level of utilization of CPU.
> > > The CPU utilization is :
> > > avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
> > >
> > > Most of the time, avg_irq is small and neglictible so the use of the
> > > approximation CPU utilization = /Sum avg_rq was enough
> > >
> > > Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > > ---
> > > kernel/sched/core.c | 4 +++-
> > > kernel/sched/fair.c | 13 ++++++++++---
> > > kernel/sched/pelt.c | 40 ++++++++++++++++++++++++++++++++++++++++
> > > kernel/sched/pelt.h | 16 ++++++++++++++++
> > > kernel/sched/sched.h | 3 +++
> > > 5 files changed, 72 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 78d8fac..e5263a4 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -18,6 +18,8 @@
> > > #include "../workqueue_internal.h"
> > > #include "../smpboot.h"
> > >
> > > +#include "pelt.h"
> > > +
> > > #define CREATE_TRACE_POINTS
> > > #include <trace/events/sched.h>
> > >
> > > @@ -186,7 +188,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
> > >
> > > #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
> > > if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
> > > - sched_rt_avg_update(rq, irq_delta + steal);
> > > + update_irq_load_avg(rq, irq_delta + steal);
> >
> > I think we should not add steal time into irq load tracking, steal
> > time is always 0 on native kernel which doesn't matter, what will
> > happen when guest disables IRQ_TIME_ACCOUNTING and enables
> > PARAVIRT_TIME_ACCOUNTING? Steal time is not the real irq util_avg. In
> > addition, we haven't exposed power management for performance which
> > means that e.g. schedutil governor can not cooperate with passive mode
> > intel_pstate driver to tune the OPP. To decay the old steal time avg
> > and add the new one just wastes cpu cycles.
>
> In fact, I have kept the same behavior as with rt_avg, which was
> already adding steal time when computing scale_rt_capacity, which is
> used to reflect the remaining capacity for FAIR tasks and is used in
> load balance. I'm not sure that it's worth using different variables
> for irq and steal.
> That being said, I see a possible optimization in schedutil when
> PARAVIRT_TIME_ACCOUNTING is enable and IRQ_TIME_ACCOUNTING is disable.
> With this kind of config, scale_irq_capacity can be a nop for
> schedutil but scales the utilization for scale_rt_capacity

Yeah, this is what in my mind before, you can make a patch for that. :)

Regards,
Wanpeng Li