Re: [PATCH] sched/fair: Update rq_clock, cfs_rq before migrating for asym cpu capacity

From: Chris Redpath
Date: Tue Jul 09 2019 - 11:23:49 EST


Hi Peter,

On 09/07/2019 14:50, Peter Zijlstra wrote:
> On Tue, Jul 09, 2019 at 12:57:59PM +0100, Chris Redpath wrote:
>> The ancient workaround to avoid the cost of updating rq clocks in the
>> middle of a migration causes some issues on asymmetric CPU capacity
>> systems where we use task utilization to determine which cpus fit a task.
>> On quiet systems we can inflate task util after a migration which
>> causes misfit to fire and force-migrate the task.
>>
>> This occurs when:
>>
>> (a) a task has util close to the non-overutilized capacity limit of a
>> particular cpu (cpu0 here); and
>> (b) the prev_cpu was quiet otherwise, such that rq clock is
>> sufficiently out of date (cpu1 here).
>>
>> e.g.
>> _____
>> cpu0: ________________________| |______________
>>
>> |<- misfit happens
>> ______ ___ ___
>> cpu1: ____| |______________|___| |_________|
>>
>> ->| |<- wakeup migration time
>> last rq clock update
>>
>> When the task util is in just the right range for the system, we can end
>> up migrating an unlucky task back and forth many times until we are lucky
>> and the source rq happens to be updated close to the migration time.
>>
>> In order to address this, lets update both rq_clock and cfs_rq where
>> this could be an issue.
>
> Can you quantify how much of a problem this really is? It is really sad,
> but this is already the second place where we take rq->lock on
> migration. We worked so hard to avoid having to acquire it :/
>

I think you're familiar with the way we test the EAS and misfit stuff,
but some might not be, so I'll just outline them.

We have performance and placement tests for a suite of simple synthetic
scenarios selected to trigger the EAS & misfit mechanisms. The
performance tests use rt-app's slack metric, and we try to minimise
negative slack (i.e. missed deadlines).

In the placement tests we estimate the minimum energy consumed to run a
particular synthetic test job and we calculate the energy consumed in
the actual execution according to a trace. We pass the test if our
estimate of actual is less than ideal+20%.

We enter this code quite often in our testing, most individual runs of a
test which has small tasks involved have at least one hit where we make
a change to the clock with this patch in.

That said - despite the relatively high number of hits only about 5% of
runs see enough additional energy consumed to trigger a test failure. We
do try to keep a quiet system as much as possible and only run for a few
seconds so the impact we see in testing is also probably higher than in
the real world.

I totally appreciate the reluctance to add this - I don't much like it
either, but I was hoping that sticking it behind the asym_cpucapacity
key might be a reasonable compromise.

At least only those people who select a CPU using task util and capacity
see this cost, and we have smaller systems so in theory the cost is smaller.

I'm very open to exploring alternatives :)

>> Signed-off-by: Chris Redpath <chris.redpath@xxxxxxx>
>> ---
>> kernel/sched/fair.c | 15 +++++++++++++++
>> 1 file changed, 15 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b798fe7ff7cd..51791db26a2a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6545,6 +6545,21 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>> * wakee task is less decayed, but giving the wakee more load
>> * sounds not bad.
>> */
>> + if (static_branch_unlikely(&sched_asym_cpucapacity) &&
>> + p->state == TASK_WAKING) {
>
> nit: indent fail.
>

oops, will tweak it

--Chris
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.