Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

From: Waiman Long
Date: Thu Dec 03 2015 - 14:56:45 EST


On 12/03/2015 06:12 AM, Peter Zijlstra wrote:

I made this:

---
Subject: sched/fair: Move hot load_avg into its own cacheline
From: Waiman Long<Waiman.Long@xxxxxxx>
Date: Wed, 2 Dec 2015 13:41:49 -0500

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt
8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
8.56% 0.00% java [kernel.vmlinux] [k] update_process_times
8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick
6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair
5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq& se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

Before patch - Max-jOPs: 907533 Critical-jOps: 134877
After patch - Max-jOPs: 916011 Critical-jOps: 142366

Cc: Scott J Norton<scott.norton@xxxxxxx>
Cc: Douglas Hatch<doug.hatch@xxxxxxx>
Cc: Ingo Molnar<mingo@xxxxxxxxxx>
Cc: Yuyang Du<yuyang.du@xxxxxxxxx>
Cc: Paul Turner<pjt@xxxxxxxxxx>
Cc: Ben Segall<bsegall@xxxxxxxxxx>
Cc: Morten Rasmussen<morten.rasmussen@xxxxxxx>
Signed-off-by: Waiman Long<Waiman.Long@xxxxxxx>
Signed-off-by: Peter Zijlstra (Intel)<peterz@xxxxxxxxxxxxx>
Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@xxxxxxx
---
kernel/sched/core.c | 10 +++++++---
kernel/sched/sched.h | 7 ++++++-
2 files changed, 13 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
*/
struct task_group root_task_group;
LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
#endif

DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7402,11 +7405,12 @@ void __init sched_init(void)
#endif /* CONFIG_RT_GROUP_SCHED */

#ifdef CONFIG_CGROUP_SCHED
+ task_group_cache = KMEM_CACHE(task_group, 0);
+
Thanks for making that change.

Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is defined. Other than that, I am fine with the change.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/