Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

From: Dietmar Eggemann
Date: Tue Oct 18 2016 - 07:15:25 EST


On 18/10/16 10:07, Peter Zijlstra wrote:
> On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:

[...]

>> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
>> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
>> of system.slice tg is 0 after startup.
>
> Right, so the reason for using present_mask is that it avoids having to
> deal with hotplug, also all the per-cpu memory is allocated and present
> for !online CPUs anyway, so might as well set it up properly anyway.
>
> (You might want to start booting your laptop with "possible_cpus=4" to
> save some memory FWIW.)

The question for me is could this be the reason for the X1 Carbon
platform as well?

The initial pastebin from Joseph (http://paste.ubuntu.com/23312351)
showed .tg_load_avg : 381697 on a 4 logical cpu thing. With a couple of
more services than 80 this might be the problem.

>
> But yes, we have a bug here too... /me ponders
>
> So aside from funny BIOSes, this should also show up when creating
> cgroups when you have offlined a few CPUs, which is far more common I'd
> think.

Yes.

> On IRC you mentioned that adding list_add_leaf_cfs_rq() to
> online_fair_sched_group() cures this, this would actually match with
> unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
> a few instructions on the enqueue path, so that's all good.

Yes, I was able to recreate a similar problem (not related to the cpu
masks) on ARM64 (6 logical cpus). I created 100 2. level tg's but only
put one task (no cpu affinity, so it could run on multiple cpus) in one
of these tg's (mainly to see the related cfs_rq's in /proc/sched_debug).

I get a remaining .tg_load_avg : 49898 for cfs_rq[x]:/tg_1

> I'm just not immediately seeing how that cures things. The only relevant
> user of the leaf_cfs_rq list seems to be update_blocked_averages() which
> is called from the balance code (idle_balance() and
> rebalance_domains()). But neither should call that for offline (or
> !present) CPUs.

Assuming this is load from the 99 2. level tg's which never had a task
running, putting list_add_leaf_cfs_rq() into online_fair_sched_group()
for all cpus makes sure that all the 'blocked load' get's decayed.

Doing what Vincent just suggested, not initializing tg se's w/ 1024 but
w/ 0 instead prevents this from being necessary.

[...]