[Patch] numa:x86_64: Cacheline aliasing makesfor_each_populated_zone extremely expensive -V2.

From: Robin Holt
Date: Wed Aug 18 2010 - 14:30:34 EST



While testing on a 256 node, 4096 cpus system, Jack Steiner noticed
that we would use between 0.08% average and 0.8% max of every second
in vmstat_update. This could be tuned using sysctl's stat_interval,
but that was simply reducing the impact of the problem.

When I investigated, I noticed that all the zone_data[] structures are
allocated precisely at the beginning of the individual node's physical
memory. By simply staggering based upon nodeid, I reduced the average
down to 0.0006% of every second.

With this patch, the max value did not change. I believe that is a
combination of cacheline contention updating the zone's vmstat information
combined with round_jiffies_common spattering unrelated cpus onto the same
jiffie for their next update. I will investigate those issues seperately.

Signed-off-by: Robin Holt <holt@xxxxxxx>
Signed-off-by: Jack Steiner <steiner@xxxxxxx>
To: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
To: Ingo Molnar <mingo@xxxxxxx>
Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
Cc: x86@xxxxxxxxxx
Cc: Yinghai Lu <yinghai@xxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Joerg Roedel <joerg.roedel@xxxxxxx>
Cc: Linux Kernel <linux-kernel@xxxxxxxxxxxxxxx>
Cc: Stable Maintainers <stable@xxxxxxxxxx>

---

Changes from -V1, clean up a compile warning.

This patch applies cleanly to v2.6.34 and later. It manually applies
to previous kernels but the x86-bootmem fixes introduce differences in
the surrounding areas.

I had no idea whether to ask stable@xxxxxxxxxx to pull this back to the
stable releases. My reading of the stable_kernel_rules.txt criteria is
only fuzzy as to whether this meets the "oh, that's not good" standard.
I personally think this meets that criteria, but I am unwilling to defend
that position too stridently. In the end, I punted and added them to
the Cc list. We will be asking both SuSE and RedHat to add this to
their upcoming update releases as we expect it to affect their customers.

arch/x86/mm/numa_64.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

Index: round_jiffies/arch/x86/mm/numa_64.c
===================================================================
--- round_jiffies.orig/arch/x86/mm/numa_64.c 2010-08-18 11:39:20.495141178 -0500
+++ round_jiffies/arch/x86/mm/numa_64.c 2010-08-18 13:24:42.679080103 -0500
@@ -198,6 +198,7 @@ setup_node_bootmem(int nodeid, unsigned
unsigned long start_pfn, last_pfn, nodedata_phys;
const int pgdat_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
int nid;
+ unsigned long cache_alias_offset;
#ifndef CONFIG_NO_BOOTMEM
unsigned long bootmap_start, bootmap_pages, bootmap_size;
void *bootmap;
@@ -221,9 +222,16 @@ setup_node_bootmem(int nodeid, unsigned
start_pfn = start >> PAGE_SHIFT;
last_pfn = end >> PAGE_SHIFT;

- node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
+ /*
+ * Allocate an extra cacheline per node to reduce cacheline
+ * aliasing when scanning all node's node_data.
+ */
+ cache_alias_offset = nodeid * SMP_CACHE_BYTES;
+ node_data[nodeid] = cache_alias_offset +
+ early_node_mem(nodeid, start, end,
+ pgdat_size + cache_alias_offset,
SMP_CACHE_BYTES);
- if (node_data[nodeid] == NULL)
+ if (node_data[nodeid] == (void *)cache_alias_offset)
return;
nodedata_phys = __pa(node_data[nodeid]);
reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/