Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue

From: Vincent Guittot
Date: Tue Apr 12 2016 - 07:56:54 EST

Next message: Sudip Mukherjee: "Re: [PATCH v2] mwifiex: fix possible NULL dereference"
Previous message: Vlastimil Babka: "Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers"
In reply to: Yuyang Du: "Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue"
Next in thread: Yuyang Du: "Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Le Tuesday 12 Apr 2016 à 03:41:41 (+0800), Yuyang Du a écrit :
> Hi Vincent,
>
> On Mon, Apr 11, 2016 at 11:08:04AM +0200, Vincent Guittot wrote:
> > > @@ -2704,11 +2694,14 @@ static __always_inline int
> > > __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > > {
> > > - u64 delta, scaled_delta, periods;
> > > - u32 contrib;
> > > - unsigned int delta_w, scaled_delta_w, decayed = 0;
> > > + u64 delta;
> > > + u32 contrib, periods;
> > > unsigned long scale_freq, scale_cpu;
> > >
> > > + /*
> > > + * now rolls down to a period boundary
> > > + */
> > > + now = now && (u64)(~0xFFFFF);
> > > delta = now - sa->last_update_time;
> > > /*
> > > * This should only happen when time goes backwards, which it
> > > @@ -2720,89 +2713,56 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > > }
> > >
> > > /*
> > > - * Use 1024ns as the unit of measurement since it's a reasonable
> > > - * approximation of 1us and fast to compute.
> > > + * Use 1024*1024ns as an approximation of 1ms period, pretty close.
> > > */
> > > - delta >>= 10;
> > > - if (!delta)
> > > + periods = delta >> 20;
> > > + if (!periods)
> > > return 0;
> > > sa->last_update_time = now;
> >
> > The optimization looks quite interesting but I see one potential issue
> > with migration as we will lose the part of the ongoing period that is
> > now not saved anymore. This lost part can be quite significant for a
> > short task that ping pongs between CPUs.
>
> Yes, basically, it is we lose precision (~1ms scale in contrast with ~1us scale).

But with a HZ set to 1000 and a sched-slice in the same range, having a precision
of 1ms instead of 1us makes the precision of load tracking of short task quite
random IMHO.

you can keep recording this partial period without using it in the load tracking.
Something like below keep precision without sacrifying the optimization.

---
kernel/sched/fair.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68273e8..b234169 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -674,6 +674,12 @@ void init_entity_runnable_average(struct sched_entity *se)
struct sched_avg *sa = &se->avg;

sa->last_update_time = 0;
+ /*
+ * sched_avg's period_contrib should be strictly less then 1024 * 1024, so
+ * we give it 1023 * 1024 to make sure it is almost a period (1024us), and
+ * will definitely be updated (after enqueue).
+ */
+ sa->period_contrib = 1023*1024;
sa->load_avg = scale_load_down(se->load.weight);
sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
/*
@@ -2698,10 +2704,6 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
u32 contrib, periods;
unsigned long scale_freq, scale_cpu;

- /*
- * now rolls down to a period boundary
- */
- now = now && (u64)(~0xFFFFF);
delta = now - sa->last_update_time;
/*
* This should only happen when time goes backwards, which it
@@ -2712,6 +2714,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
return 0;
}

+ /* Add how much left for the current period */
+ delta += sa->period_contrib;
+
/*
* Use 1024*1024ns as an approximation of 1ms period, pretty close.
*/
@@ -2720,6 +2725,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
return 0;
sa->last_update_time = now;

+ /* Get how much left for the next period */
+ sa->period_contrib = delta & (u64)(0xFFFFF);
+
scale_freq = arch_scale_freq_capacity(NULL, cpu);
scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

> But as I wrote, we may either lose a sub-1ms, or gain a sub-1ms, statistically,
> they should even out, given the load/util updates are quite a large number of
> samples, and we do want a lot of samples for the metrics, this is the point of
> the entire average thing. Plus, as you also said, the incomplete current period
> also plays a (somewhat) negative role here.
>
> In addition, excluding the flat hierarchical util patch, we gain quite some
> efficiency.

Next message: Sudip Mukherjee: "Re: [PATCH v2] mwifiex: fix possible NULL dereference"
Previous message: Vlastimil Babka: "Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers"
In reply to: Yuyang Du: "Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue"
Next in thread: Yuyang Du: "Re: [PATCH 2/4] sched/fair: Drop out incomplete current period when sched averages accrue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]