Re: [Patch] numa:x86_64: Cacheline aliasing makesfor_each_populated_zone extremely expensive -V2.

From: Roedel, Joerg
Date: Thu Aug 19 2010 - 13:29:02 EST


On Wed, Aug 18, 2010 at 02:30:24PM -0400, Robin Holt wrote:
> When I investigated, I noticed that all the zone_data[] structures are
> allocated precisely at the beginning of the individual node's physical
> memory. By simply staggering based upon nodeid, I reduced the average
> down to 0.0006% of every second.

Interesting. Have you measured cache misses for both cases?

> With this patch, the max value did not change. I believe that is a
> combination of cacheline contention updating the zone's vmstat information
> combined with round_jiffies_common spattering unrelated cpus onto the same
> jiffie for their next update. I will investigate those issues seperately.

The max value probably the case where none of the data the code accesses
is actually in the cache. Cache aliasing does not help then so I would
not expect a change in the maximum amount here.

> I had no idea whether to ask stable@xxxxxxxxxx to pull this back to the
> stable releases. My reading of the stable_kernel_rules.txt criteria is
> only fuzzy as to whether this meets the "oh, that's not good" standard.
> I personally think this meets that criteria, but I am unwilling to defend
> that position too stridently. In the end, I punted and added them to
> the Cc list. We will be asking both SuSE and RedHat to add this to
> their upcoming update releases as we expect it to affect their customers.

I don't think this is stable material. It improves performance and does
not fix a bug. But I am not the one to decide this :-)

>
> arch/x86/mm/numa_64.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> Index: round_jiffies/arch/x86/mm/numa_64.c
> ===================================================================
> --- round_jiffies.orig/arch/x86/mm/numa_64.c 2010-08-18 11:39:20.495141178 -0500
> +++ round_jiffies/arch/x86/mm/numa_64.c 2010-08-18 13:24:42.679080103 -0500
> @@ -198,6 +198,7 @@ setup_node_bootmem(int nodeid, unsigned
> unsigned long start_pfn, last_pfn, nodedata_phys;
> const int pgdat_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
> int nid;
> + unsigned long cache_alias_offset;
> #ifndef CONFIG_NO_BOOTMEM
> unsigned long bootmap_start, bootmap_pages, bootmap_size;
> void *bootmap;
> @@ -221,9 +222,16 @@ setup_node_bootmem(int nodeid, unsigned
> start_pfn = start >> PAGE_SHIFT;
> last_pfn = end >> PAGE_SHIFT;
>
> - node_data[nodeid] = early_node_mem(nodeid, start, end, pgdat_size,
> + /*
> + * Allocate an extra cacheline per node to reduce cacheline
> + * aliasing when scanning all node's node_data.
> + */
> + cache_alias_offset = nodeid * SMP_CACHE_BYTES;
> + node_data[nodeid] = cache_alias_offset +
> + early_node_mem(nodeid, start, end,
> + pgdat_size + cache_alias_offset,
> SMP_CACHE_BYTES);
> - if (node_data[nodeid] == NULL)
> + if (node_data[nodeid] == (void *)cache_alias_offset)

It is cleaner to keep the NULL check here and add the cache_alias_offset
to the pointer after that check.

> return;
> nodedata_phys = __pa(node_data[nodeid]);
> reserve_early(nodedata_phys, nodedata_phys + pgdat_size, "NODE_DATA");
>

--
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/