Re: [RFC PATCH 3/3] sched/fair: Use different cachelines for readers and writers of load_avg

From: Waiman Long
Date: Mon Nov 30 2015 - 14:13:40 EST


On 11/30/2015 05:22 AM, Peter Zijlstra wrote:
Please always Cc the people who wrote the code.

+CC pjt, ben, morten, yuyang

Sorry for that. Their names didn't show up when I did get_maintainer.pl.

On Wed, Nov 25, 2015 at 02:09:40PM -0500, Waiman Long wrote:
The load_avg statistical counter is only changed if the load on a CPU
deviates significantly from the previous tick. So there are usually
more readers than writers of load_avg. Still, on a large system,
the cacheline contention can cause significant slowdown and impact
performance.

This patch attempts to separate those load_avg readers
(update_cfs_shares) and writers (task_tick_fair) to use different
cachelines instead. Writers of load_avg will now accumulates the
load delta into load_avg_delta which sits in a different cacheline.
If load_avg_delta is sufficiently large (> load_avg/64), it will then
be added back to load_avg.

Running a java benchmark on a 16-socket IvyBridge-EX system (240 cores,
480 threads), the perf profile before the patch was:

9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer
7.74% 0.00% java [kernel.vmlinux] [k] update_process_times
7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick
5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair
4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares

After the patch, it became:

2.94% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt
2.52% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt
2.25% 0.02% java [kernel.vmlinux] [k] tick_sched_timer
2.21% 0.00% java [kernel.vmlinux] [k] update_process_times
1.70% 0.03% java [kernel.vmlinux] [k] scheduler_tick
0.96% 0.34% java [kernel.vmlinux] [k] task_tick_fair
0.61% 0.48% java [kernel.vmlinux] [k] update_cfs_shares
This begs the question tough; why are you running a global load in a
cgroup; and do we really need to update this for the root cgroup? It
seems to me we don't need calc_tg_weight() for the root cgroup, it
doesn't need to normalize its weight numbers.

That is; isn't this simply a problem we should avoid?

I didn't use any cgroup in my test setup. Autogroup was enabled, though. Booting up a 4.4-rc2 kernel caused sched_create_group() to be called 56 times.

The benchmark results before and after the patch were as follows:

Before patch - Max-jOPs: 916011 Critical-jOps: 142366
AFter patch - Max-jOPs: 939130 Critical-jOps: 211937

There was significant improvement in Critical-jOps which was latency
sensitive.

This patch does introduce additional delay in getting the real load
average reflected in load_avg. It may also incur additional overhead
if the number of CPUs in a task group is small. As a result, this
change is only activated when running on a 4-socket or larger systems
which can get the most benefit from it.
So I'm not particularly charmed by this; it rather makes a mess of
things. Also this really wants a run of the cgroup fairness test thingy
pjt/ben have somewhere.

I will be glad to run any additional tests, if necessary. I do need pointers to those test, though.

Signed-off-by: Waiman Long<Waiman.Long@xxxxxxx>
---
kernel/sched/core.c | 9 +++++++++
kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
kernel/sched/sched.h | 8 ++++++++
3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..f3075da 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7356,6 +7356,12 @@ void __init sched_init(void)
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);

+#ifdef CONFIG_SMP
+ /*
+ * Use load_avg_delta if not 2P or less
+ */
+ root_task_group.use_la_delta = (num_possible_nodes()> 2);
+#endif /* CONFIG_SMP */
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -7691,6 +7697,9 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!alloc_rt_sched_group(tg, parent))
goto err;

+#if defined(CONFIG_FAIR_GROUP_SCHED)&& defined(CONFIG_SMP)
+ tg->use_la_delta = root_task_group.use_la_delta;
+#endif
return tg;

err:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f1eccc..44732cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2663,15 +2663,41 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,

#ifdef CONFIG_FAIR_GROUP_SCHED
/*
- * Updating tg's load_avg is necessary before update_cfs_share (which is done)
+ * Updating tg's load_avg is necessary before update_cfs_shares (which is done)
* and effective_load (which is not done because it is too costly).
+ *
+ * The tg's use_la_delta flag, if set, will cause the load_avg delta to be
+ * accumulated into the load_avg_delta variable instead to reduce cacheline
+ * contention on load_avg at the expense of more delay in reflecting the real
+ * load_avg. The tg's load_avg and load_avg_delta variables are in separate
+ * cachelines. With that flag set, load_avg will be read mostly whereas
+ * load_avg_delta will be write mostly.
*/
static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
{
long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;

if (force || abs(delta)> cfs_rq->tg_load_avg_contrib / 64) {
- atomic_long_add(delta,&cfs_rq->tg->load_avg);
+ struct task_group *tg = cfs_rq->tg;
+ long load_avg, tot_delta;
+
+ if (!tg->use_la_delta) {
+ /*
+ * If the use_la_delta isn't set, just add the
+ * delta directly into load_avg.
+ */
+ atomic_long_add(delta,&tg->load_avg);
+ goto set_contrib;
+ }
+
+ tot_delta = atomic_long_add_return(delta,&tg->load_avg_delta);
+ load_avg = atomic_long_read(&tg->load_avg);
+ if (abs(tot_delta)> load_avg / 64) {
+ tot_delta = atomic_long_xchg(&tg->load_avg_delta, 0);
+ if (tot_delta)
+ atomic_long_add(tot_delta,&tg->load_avg);
+ }
+set_contrib:
cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
}
}
I'm thinking that its now far too big to retain the inline qualifier.

I can take the inline keyword out.


diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e679895..aef4e4e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -252,8 +252,16 @@ struct task_group {
* load_avg can be heavily contended at clock tick time, so put
* it in its own cacheline separated from the fields above which
* will also be accessed at each tick.
+ *
+ * The use_la_delta flag, if set, will enable the use of load_avg_delta
+ * to accumulate the delta and only change load_avg when the delta
+ * is big enough. This reduces the cacheline contention on load_avg.
+ * This flag will be set at allocation time depending on the system
+ * configuration.
*/
+ int use_la_delta;
atomic_long_t load_avg ____cacheline_aligned;
+ atomic_long_t load_avg_delta ____cacheline_aligned;
This would only work if the structure itself is allocated with cacheline
alignment, and looking at sched_create_group(), we use a plain kzalloc()
for this, which doesn't guarantee any sort of alignment beyond machine
word size IIRC.

With a RHEL 6 derived .config file, the size of the task_group structure was 460 bytes on a 32-bit x86 kernel. Adding a ____cacheline_aligned tag increase the size to 512 bytes. So it did make the structure a multiple of the cacheline size. With both slub and slab, the allocated task group pointers from kzalloc() in sched_create_group() were all multiples of 0x200. So they were properly aligned for the ____cacheline_aligned tag to work.

Also, you unconditionally grow the structure by a whole cacheline.

I know it is a drawback of using ____cacheline_aligned tag. However, we probably won't create too many task groups in normal use. So the increase in memory consumption shouldn't be noticeable.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/