Re: [RFC PATCH] sched: smart wake-affine

From: Michael Wang
Date: Wed Jun 12 2013 - 23:09:33 EST

Next message: Li Zefan: "Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()"
Previous message: Tejun Heo: "Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi, Peter

Would you like to give some comments on this approach? or may be just
some hint on what's the concern? the damage on pgbench is still there...

Regards,
Michael Wang

On 05/28/2013 01:05 PM, Michael Wang wrote:
> wake-affine stuff is always trying to pull wakee close to waker, by theory,
> this will bring benefit if waker's cpu cached hot data for wakee, or the
> extreme ping-pong case.
>
> And testing show it could benefit hackbench 15% at most.
>
> However, the whole stuff is somewhat blindly and time-consuming, some
> workload therefore suffer.
>
> And testing show it could damage pgbench 50% at most.
>
> Thus, wake-affine stuff should be smarter, and realise when to stop
> it's thankless effort.
>
> This patch introduced per task 'nr_wakee_switch', which will be increased
> each time the task switch it's wakee.
>
> So a high 'nr_wakee_switch' means the task has more than one wakee, and
> less the wakee number, higher the wakeup frequency.
>
> Now when making the decision on whether to pull or not, pay attention on
> the wakee with a high 'nr_wakee_switch', pull such task may benefit wakee,
> but that imply waker will face cruel competition later, it could be very
> crule or very fast depends on the story behind 'nr_wakee_switch', whatever,
> waker therefore suffer.
>
> Furthermore, if waker also has a high 'nr_wakee_switch', that imply multiple
> tasks rely on it, waker's higher latency will damage all those tasks, pull
> wakee in such cases seems to be a bad deal.
>
> Thus, when 'waker->nr_wakee_switch / wakee->nr_wakee_switch' become higher
> and higher, the deal seems to be worse and worse.
>
> This patch therefore help wake-affine stuff to stop it's work when:
>
> wakee->nr_wakee_switch > factor &&
> waker->nr_wakee_switch > (factor * wakee->nr_wakee_switch)
>
> The factor here is the online cpu number, so more cpu will lead to more pull
> since the trial become more severe.
>
> After applied the patch, pgbench show 42% improvement at most.
>
> Test:
> Test with 12 cpu X86 server and tip 3.10.0-rc1.
>
> base smart
>
> | db_size | clients | tps | | tps |
> +---------+---------+-------+ +-------+
> | 21 MB | 1 | 10749 | | 10337 |
> | 21 MB | 2 | 21382 | | 21391 |
> | 21 MB | 4 | 41570 | | 41808 |
> | 21 MB | 8 | 52828 | | 58792 |
> | 21 MB | 12 | 48447 | | 54553 |
> | 21 MB | 16 | 46246 | | 56726 | +22.66%
> | 21 MB | 24 | 43850 | | 56853 | +29.65%
> | 21 MB | 32 | 43455 | | 55846 | +28.51%
> | 7483 MB | 1 | 9290 | | 8848 |
> | 7483 MB | 2 | 19347 | | 19351 |
> | 7483 MB | 4 | 37135 | | 37511 |
> | 7483 MB | 8 | 47310 | | 50210 |
> | 7483 MB | 12 | 42721 | | 49396 |
> | 7483 MB | 16 | 41016 | | 51826 | +26.36%
> | 7483 MB | 24 | 37540 | | 52579 | +40.06%
> | 7483 MB | 32 | 36756 | | 51332 | +39.66%
> | 15 GB | 1 | 8758 | | 8670 |
> | 15 GB | 2 | 19204 | | 19249 |
> | 15 GB | 4 | 36997 | | 37199 |
> | 15 GB | 8 | 46578 | | 50681 |
> | 15 GB | 12 | 42141 | | 48671 |
> | 15 GB | 16 | 40518 | | 51280 | +26.56%
> | 15 GB | 24 | 36788 | | 52329 | +42.24%
> | 15 GB | 32 | 36056 | | 50350 | +39.64%
>
>
>
> CC: Ingo Molnar <mingo@xxxxxxxxxx>
> CC: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> CC: Mike Galbraith <efault@xxxxxx>
> Signed-off-by: Michael Wang <wangyun@xxxxxxxxxxxxxxxxxx>
> ---
> include/linux/sched.h | 3 +++
> kernel/sched/fair.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 48 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 178a8d9..1c996c7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1041,6 +1041,9 @@ struct task_struct {
> #ifdef CONFIG_SMP
> struct llist_node wake_entry;
> int on_cpu;
> + struct task_struct *last_wakee;
> + unsigned long nr_wakee_switch;
> + unsigned long last_switch_decay;
> #endif
> int on_rq;
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f62b16d..eaaceb7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3127,6 +3127,45 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
>
> #endif
>
> +static void record_wakee(struct task_struct *p)
> +{
> + /*
> + * Rough decay, don't worry about the boundary, really active
> + * task won't care the loose.
> + */
> + if (jiffies > current->last_switch_decay + HZ) {
> + current->nr_wakee_switch = 0;
> + current->last_switch_decay = jiffies;
> + }
> +
> + if (current->last_wakee != p) {
> + current->last_wakee = p;
> + current->nr_wakee_switch++;
> + }
> +}
> +
> +static int nasty_pull(struct task_struct *p)
> +{
> + int factor = cpumask_weight(cpu_online_mask);
> +
> + /*
> + * Yeah, it's the switching-frequency, could means many wakee or
> + * rapidly switch, use factor here will just help to automatically
> + * adjust the loose-degree, so more cpu will lead to more pull.
> + */
> + if (p->nr_wakee_switch > factor) {
> + /*
> + * wakee is somewhat hot, it needs certain amount of cpu
> + * resource, so if waker is far more hot, prefer to leave
> + * it alone.
> + */
> + if (current->nr_wakee_switch > (factor * p->nr_wakee_switch))
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> {
> s64 this_load, load;
> @@ -3136,6 +3175,9 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
> unsigned long weight;
> int balanced;
>
> + if (nasty_pull(p))
> + return 0;
> +
> idx = sd->wake_idx;
> this_cpu = smp_processor_id();
> prev_cpu = task_cpu(p);
> @@ -3428,6 +3470,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> /* while loop will break here if sd == NULL */
> }
> unlock:
> + if (sd_flag & SD_BALANCE_WAKE)
> + record_wakee(p);
> +
> rcu_read_unlock();
>
> return new_cpu;
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Li Zefan: "Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()"
Previous message: Tejun Heo: "Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]