Re: [RFC][PATCH] sched: attach extra runtime to the right avg

From: Josef Bacik
Date: Mon Jul 03 2017 - 10:44:42 EST


On Mon, Jul 03, 2017 at 09:26:10AM +0200, Vincent Guittot wrote:
> Hi Josef,
>
> On 30 June 2017 at 03:56, <josef@xxxxxxxxxxxxxx> wrote:
> > From: Josef Bacik <jbacik@xxxxxx>
> >
> > We only track the load avg of a se in 1024 ns chunks, so in order to
> > make up for the loss of the < 1024 ns part of a run/sleep delta we only
> > add the time we processed to the se->avg.last_update_time. The problem
> > is there is no way to know if this extra time was while we were asleep
> > or while we were running. Instead keep track of the remainder and apply
> > it in the appropriate place. If the remainder was while we were
> > running, add it to the delta the next time we update the load avg while
> > running, and the same for sleeping. This (coupled with other fixes)
> > mostly fixes the regression to my workload introduced by Peter's
> > experimental runnable load propagation patches.
>
> IIUC, your workload is sensible to the fact that the min granularity
> of the load tracking is 1us ?
> The contribution seems to be quite small to have a real impact on the load_avg.
> May be rounding last_update_time to the closest value policy instead
> of the bottom value would be enough ? we would have 512ns precision
>
> Have you got details about your use case that needs this sub
> microsecond precision ?
>

Yup here's the artificial reproducer

https://github.com/josefbacik/debug-scripts/tree/master/unbalanced-reproducer

The problem is we put two sets of tasks in two different cgroups that have equal
weight. One group is a cpu hog, it's never taken off of the runqueue as it
never sleeps. The other is a process that does actual work, the reproducer has
a rt-app config file that is a rough analog of the real workload. This one goes
to sleep and wakes up and stuff. The task that goes to sleep and wakes up will
end up with about 75% of the time the cpu hog ends up with. But this patch is
only 1/3 of the solution. I'm on top of peterz's sched/experimental branch +
some fixes to fix the regression those patches introduce to my workload.

This patch is needed because the 'interactive' tasks will slowly lose load
average, which means that every time they go onto the cpu they contribute less
and less to the load of the cpu and thus screw up the load balancing. With this
fix and all of my other fixes in place I get an even 50-50 split between the two
groups.

Note this is only for two groups with disparate levels of interactivity. If I
put two of my sample workload in two different groups everything works out fine,
same if I put two cpu hogs in the different groups, all is well. We only see
this huge difference if one group is on the CPU more, thus losing less load
average over time. Thanks,

Josef