Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

From: Vincent Guittot
Date: Tue Oct 18 2016 - 05:46:22 EST


On 18 October 2016 at 11:07, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Mon, Oct 17, 2016 at 11:52:39PM +0100, Dietmar Eggemann wrote:
>>
>> Something looks weird related to the use of for_each_possible_cpu(i) in
>> online_fair_sched_group() on my i5-3320M CPU (4 logical cpus).
>>
>> In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i)
>> I get:
>>
>> [ 5.462368] cpu=0 cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>
> OK, you have a buggy BIOS :-) It enumerates too many CPU slots. There is
> no reason to have 4 empty CPU slots on a machine that cannot do physical
> hotplug.
>
> This also explains why it doesn't show on many machines, most machines
> will not have this and possible_mask == present_mask == online_mask for
> most all cases.
>
> x86 folk, can we detect the lack of physical hotplug capability and FW_WARN
> on this and lowering possible_mask to present_mask?
>
>> [ 5.462370] cpu=1 cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462370] cpu=2 cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462371] cpu=3 cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>> [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3
>>
>> T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l
>> 80
>>
>> /proc/sched_debug:
>>
>> cfs_rq[0]:/system.slice
>> ...
>> .tg_load_avg : 323584
>> ...
>>
>> 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay,
>> this could be this extra load on the systen.slice tg)
>>
>> Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in
>> online_fair_sched_group() works on this machine, i.e. the .tg_load_avg
>> of system.slice tg is 0 after startup.
>
> Right, so the reason for using present_mask is that it avoids having to
> deal with hotplug, also all the per-cpu memory is allocated and present
> for !online CPUs anyway, so might as well set it up properly anyway.
>
> (You might want to start booting your laptop with "possible_cpus=4" to
> save some memory FWIW.)
>
> But yes, we have a bug here too... /me ponders
>
> So aside from funny BIOSes, this should also show up when creating
> cgroups when you have offlined a few CPUs, which is far more common I'd
> think.

The problem is also that the load of the tg->se[cpu] that represents
the tg->cfs_rq[cpu] is initialized to 1024 in:
alloc_fair_sched_group
for_each_possible_cpu(i) {
init_entity_runnable_average(se);
sa->load_avg = scale_load_down(se->load.weight);

Initializing sa->load_avg to 1024 for a newly created task makes
sense as we don't know yet what will be its real load but i'm not sure
that we have to do the same for se that represents a task group. This
load should be initialized to 0 and it will increase when task will be
moved/attached into task group

>
> On IRC you mentioned that adding list_add_leaf_cfs_rq() to
> online_fair_sched_group() cures this, this would actually match with
> unregister_fair_sched_group() doing list_del_leaf_cfs_rq() and avoid
> a few instructions on the enqueue path, so that's all good.
>
> I'm just not immediately seeing how that cures things. The only relevant
> user of the leaf_cfs_rq list seems to be update_blocked_averages() which
> is called from the balance code (idle_balance() and
> rebalance_domains()). But neither should call that for offline (or
> !present) CPUs.
>
>
> Humm..