Re: [PATCH v3 1/2] sched: smart wake-affine foundation

From: Michael Wang
Date: Tue Jul 09 2013 - 22:12:57 EST


On 07/10/2013 09:52 AM, Sam Ben wrote:
> On 07/08/2013 10:36 AM, Michael Wang wrote:
>> Hi, Sam
>>
>> On 07/07/2013 09:31 AM, Sam Ben wrote:
>>> On 07/04/2013 12:55 PM, Michael Wang wrote:
>>>> wake-affine stuff is always trying to pull wakee close to waker, by
>>>> theory,
>>>> this will bring benefit if waker's cpu cached hot data for wakee, or
>>>> the
>>>> extreme ping-pong case.
>>> What's the meaning of ping-pong case?
>> PeterZ explained it well in here:
>>
>> https://lkml.org/lkml/2013/3/7/332
>>
>> And you could try to compare:
>> taskset 1 perf bench sched pipe
>> with
>> perf bench sched pipe
>
> Why sched pipe is special?

I think the link already explained the reason well, or you can read the
code of that pipe implementation, and you will find out there is a high
chances to match the ping-pong cases :)

Regards,
Michael Wang

>
>>
>> to confirm it ;-)
>>
>> Regards,
>> Michael Wang
>>
>>>> And testing show it could benefit hackbench 15% at most.
>>>>
>>>> However, the whole stuff is somewhat blindly and time-consuming, some
>>>> workload therefore suffer.
>>>>
>>>> And testing show it could damage pgbench 50% at most.
>>>>
>>>> Thus, wake-affine stuff should be more smart, and realise when to stop
>>>> it's thankless effort.
>>>>
>>>> This patch introduced 'nr_wakee_switch', which will be increased each
>>>> time the task switch it's wakee.
>>>>
>>>> So a high 'nr_wakee_switch' means the task has more than one wakee, and
>>>> bigger the number, higher the wakeup frequency.
>>>>
>>>> Now when making the decision on whether to pull or not, pay
>>>> attention on
>>>> the wakee with a high 'nr_wakee_switch', pull such task may benefit
>>>> wakee,
>>>> but also imply that waker will face cruel competition later, it
>>>> could be
>>>> very cruel or very fast depends on the story behind 'nr_wakee_switch',
>>>> whatever, waker therefore suffer.
>>>>
>>>> Furthermore, if waker also has a high 'nr_wakee_switch', imply that
>>>> multiple
>>>> tasks rely on it, then waker's higher latency will damage all of them,
>>>> pull
>>>> wakee seems to be a bad deal.
>>>>
>>>> Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become
>>>> higher
>>>> and higher, the deal seems to be worse and worse.
>>>>
>>>> The patch therefore help wake-affine stuff to stop it's work when:
>>>>
>>>> wakee->nr_wakee_switch > factor &&
>>>> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>>>>
>>>> The factor here is the node-size of current-cpu, so bigger node will
>>>> lead
>>>> to more pull since the trial become more severe.
>>>>
>>>> After applied the patch, pgbench show 40% improvement at most.
>>>>
>>>> Test:
>>>> Tested with 12 cpu X86 server and tip 3.10.0-rc7.
>>>>
>>>> pgbench base smart
>>>>
>>>> | db_size | clients | tps | | tps |
>>>> +---------+---------+-------+ +-------+
>>>> | 22 MB | 1 | 10598 | | 10796 |
>>>> | 22 MB | 2 | 21257 | | 21336 |
>>>> | 22 MB | 4 | 41386 | | 41622 |
>>>> | 22 MB | 8 | 51253 | | 57932 |
>>>> | 22 MB | 12 | 48570 | | 54000 |
>>>> | 22 MB | 16 | 46748 | | 55982 | +19.75%
>>>> | 22 MB | 24 | 44346 | | 55847 | +25.93%
>>>> | 22 MB | 32 | 43460 | | 54614 | +25.66%
>>>> | 7484 MB | 1 | 8951 | | 9193 |
>>>> | 7484 MB | 2 | 19233 | | 19240 |
>>>> | 7484 MB | 4 | 37239 | | 37302 |
>>>> | 7484 MB | 8 | 46087 | | 50018 |
>>>> | 7484 MB | 12 | 42054 | | 48763 |
>>>> | 7484 MB | 16 | 40765 | | 51633 | +26.66%
>>>> | 7484 MB | 24 | 37651 | | 52377 | +39.11%
>>>> | 7484 MB | 32 | 37056 | | 51108 | +37.92%
>>>> | 15 GB | 1 | 8845 | | 9104 |
>>>> | 15 GB | 2 | 19094 | | 19162 |
>>>> | 15 GB | 4 | 36979 | | 36983 |
>>>> | 15 GB | 8 | 46087 | | 49977 |
>>>> | 15 GB | 12 | 41901 | | 48591 |
>>>> | 15 GB | 16 | 40147 | | 50651 | +26.16%
>>>> | 15 GB | 24 | 37250 | | 52365 | +40.58%
>>>> | 15 GB | 32 | 36470 | | 50015 | +37.14%
>>>>
>>>> CC: Ingo Molnar <mingo@xxxxxxxxxx>
>>>> CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>>>> CC: Mike Galbraith <efault@xxxxxx>
>>>> Signed-off-by: Michael Wang <wangyun@xxxxxxxxxxxxxxxxxx>
>>>> ---
>>>> include/linux/sched.h | 3 +++
>>>> kernel/sched/fair.c | 47
>>>> +++++++++++++++++++++++++++++++++++++++++++++++
>>>> 2 files changed, 50 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>>> index 178a8d9..1c996c7 100644
>>>> --- a/include/linux/sched.h
>>>> +++ b/include/linux/sched.h
>>>> @@ -1041,6 +1041,9 @@ struct task_struct {
>>>> #ifdef CONFIG_SMP
>>>> struct llist_node wake_entry;
>>>> int on_cpu;
>>>> + struct task_struct *last_wakee;
>>>> + unsigned long nr_wakee_switch;
>>>> + unsigned long last_switch_decay;
>>>> #endif
>>>> int on_rq;
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index c61a614..a4ddbf5 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -2971,6 +2971,23 @@ static unsigned long cpu_avg_load_per_task(int
>>>> cpu)
>>>> return 0;
>>>> }
>>>> +static void record_wakee(struct task_struct *p)
>>>> +{
>>>> + /*
>>>> + * Rough decay(wiping) for cost saving, don't worry
>>>> + * about the boundary, really active task won't care
>>>> + * the loose.
>>>> + */
>>>> + if (jiffies > current->last_switch_decay + HZ) {
>>>> + current->nr_wakee_switch = 0;
>>>> + current->last_switch_decay = jiffies;
>>>> + }
>>>> +
>>>> + if (current->last_wakee != p) {
>>>> + current->last_wakee = p;
>>>> + current->nr_wakee_switch++;
>>>> + }
>>>> +}
>>>> static void task_waking_fair(struct task_struct *p)
>>>> {
>>>> @@ -2991,6 +3008,7 @@ static void task_waking_fair(struct
>>>> task_struct *p)
>>>> #endif
>>>> se->vruntime -= min_vruntime;
>>>> + record_wakee(p);
>>>> }
>>>> #ifdef CONFIG_FAIR_GROUP_SCHED
>>>> @@ -3109,6 +3127,28 @@ static inline unsigned long
>>>> effective_load(struct task_group *tg, int cpu,
>>>> #endif
>>>> +static int wake_wide(struct task_struct *p)
>>>> +{
>>>> + int factor = nr_cpus_node(cpu_to_node(smp_processor_id()));
>>>> +
>>>> + /*
>>>> + * Yeah, it's the switching-frequency, could means many wakee or
>>>> + * rapidly switch, use factor here will just help to automatically
>>>> + * adjust the loose-degree, so bigger node will lead to more pull.
>>>> + */
>>>> + if (p->nr_wakee_switch > factor) {
>>>> + /*
>>>> + * wakee is somewhat hot, it needs certain amount of cpu
>>>> + * resource, so if waker is far more hot, prefer to leave
>>>> + * it alone.
>>>> + */
>>>> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
>>>> + return 1;
>>>> + }
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> static int wake_affine(struct sched_domain *sd, struct task_struct
>>>> *p, int sync)
>>>> {
>>>> s64 this_load, load;
>>>> @@ -3118,6 +3158,13 @@ static int wake_affine(struct sched_domain *sd,
>>>> struct task_struct *p, int sync)
>>>> unsigned long weight;
>>>> int balanced;
>>>> + /*
>>>> + * If we wake multiple tasks be careful to not bounce
>>>> + * ourselves around too much.
>>>> + */
>>>> + if (wake_wide(p))
>>>> + return 0;
>>>> +
>>>> idx = sd->wake_idx;
>>>> this_cpu = smp_processor_id();
>>>> prev_cpu = task_cpu(p);
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/