Re: [PATCH -v2 15/18] sched/fair: Align PELT windows between cfs_rq and its se
From: Dietmar Eggemann
Date: Mon Oct 09 2017 - 08:15:13 EST
On 06/10/17 14:02, Peter Zijlstra wrote:
> On Wed, Oct 04, 2017 at 08:27:01PM +0100, Dietmar Eggemann wrote:
>> On 01/09/17 14:21, Peter Zijlstra wrote:
>>> The PELT _sum values are a saw-tooth function, dropping on the decay
>>> edge and then growing back up again during the window.
>>>
>>> When these window-edges are not aligned between cfs_rq and se, we can
>>> have the situation where, for example, on dequeue, the se decays
>>> first.
>>>
>>> Its _sum values will be small(er), while the cfs_rq _sum values will
>>> still be on their way up. Because of this, the subtraction:
>>> cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
>>> will then, once the cfs_rq reaches an edge, translate into its _avg
>>> value jumping up.
>>>
>>> This is especially visible with the runnable_load bits, since they get
>>> added/subtracted a lot.
>>>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>>> ---
>>> kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++--------------
>>> 1 file changed, 31 insertions(+), 14 deletions(-)
>>
>> [...]
>>
>>> @@ -3644,7 +3634,34 @@ update_cfs_rq_load_avg(u64 now, struct c
>>> */
>>> static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>> {
>>> + u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
>>> +
>>> + /*
>>> + * When we attach the @se to the @cfs_rq, we must align the decay
>>> + * window because without that, really weird and wonderful things can
>>> + * happen.
>>> + *
>>> + * XXX illustrate
>>> + */
>>> se->avg.last_update_time = cfs_rq->avg.last_update_time;
>>> + se->avg.period_contrib = cfs_rq->avg.period_contrib;
>>> +
>>> + /*
>>> + * Hell(o) Nasty stuff.. we need to recompute _sum based on the new
>>> + * period_contrib. This isn't strictly correct, but since we're
>>> + * entirely outside of the PELT hierarchy, nobody cares if we truncate
>>> + * _sum a little.
>>> + */
>>> + se->avg.util_sum = se->avg.util_avg * divider;
>>> +
>>> + se->avg.load_sum = divider;
>>> + if (se_weight(se)) {
>>> + se->avg.load_sum =
>>> + div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
>>> + }
>>
>> Can scale_load_down(se->load.weight) ever become 0 here?
>
> Yeah, don't see why not.
Tasks should be safe since for them scale_load_down(se->load.weight)
should be >= 15 when they get attached.
task groups get attached via attach_entity_cfs_rq()
# mkdir /sys/fs/cgroup/foo
[ 63.885333] [<ffff00000811a558>] attach_entity_load_avg+0x138/0x558
[ 63.885345] [<ffff0000081208e0>] attach_entity_cfs_rq+0x298/0x998
[ 63.885357] [<ffff00000812d660>] online_fair_sched_group+0x70/0xb0
[ 63.885369] [<ffff0000081182d4>] sched_online_group+0x94/0xb0
[ 63.885381] [<ffff000008118318>] cpu_cgroup_css_online+0x28/0x38
[ 63.885393] [<ffff0000081a5e80>] online_css+0x38/0xd0
[ 63.885406] [<ffff0000081acae0>] cgroup_apply_control_enable+0x260
[ 63.885418] [<ffff0000081afc54>] cgroup_mkdir+0x314/0x4e8
mkdir-2501 [004] 63.689455: bprint:
attach_entity_load_avg: cpu=1 se=0xffff800072fffc00 load.weight=1048576
If I apply the smallest shares possible to that task group, it is
already attched.
# echo 2 > /sys/fs/cgroup/foo/cpu.shares
create_cpu_cgro-2500 [003] 64.591878: bprint:
update_cfs_group: cpu=1 se=0xffff800072fffc00 tg->shares=2048 shares=0
load=0 tg_weight=0
create_cpu_cgro-2500 [003] 64.591904: bprint:
reweight_entity: cpu=1 se=0xffff800072fffc00 se->load.weight=2
I can't see right now how a task group can get attached with
se->load.weight < 1024 (32 bit)/1048576 (64 bit)? Do I miss something?