Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

From: Zi Yan
Date: Wed Mar 13 2019 - 22:39:33 EST


On 19 Feb 2019, at 20:38, Anshuman Khandual wrote:

On 02/19/2019 06:26 PM, Matthew Wilcox wrote:
On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
But the location of this temp page matters as well because you would like to
saturate the inter node interface. It needs to be either of the nodes where
the source or destination page belongs. Any other node would generate two
internode copy process which is not what you intend here I guess.
That makes no sense. It should be allocated on the local node of the CPU
performing the copy. If the CPU is in node A, the destination is in node B
and the source is in node C, then you're doing 4k worth of reads from node C,
4k worth of reads from node B, 4k worth of writes to node C followed by
4k worth of writes to node B. Eventually the 4k of dirty cachelines on
node A will be written back from cache to the local memory (... or not,
if that page gets reused for some other purpose first).

If you allocate the page on node B or node C, that's an extra 4k of writes
to be sent across the inter-node link.

Thats right there will be an extra remote write. My assumption was that the CPU
performing the copy belongs to either node B or node C.


I have some interesting throughput results for exchange per u64 and exchange per 4KB page.
What I discovered is that using a 4KB page as the temporary storage for exchanging
2MB THPs does not improve the throughput. On contrary, when we are exchanging more than 2^4=16 THPs,
exchanging per 4KB page has lower throughput than exchanging per u64. Please see results below.

The experiments are done on a two socket machine with two Intel Xeon E5-2640 v3 CPUs.
All exchanges are done via the QPI link across two sockets.


Results
===

Throughput (GB/s) of exchanging 2 order-N 2MB pages between two NUMA nodes

| 2mb_page_order | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
| u64 | 5.31 | 5.58 | 5.89 | 5.69 | 8.97 | 9.51 | 9.21 | 9.50 | 9.57 | 9.62
| per_page | 5.85 | 6.48 | 6.20 | 5.26 | 7.22 | 7.25 | 7.28 | 7.30 | 7.32 | 7.31

Normalized throughput (to per_page)

2mb_page_order | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
u64 | 0.90 | 0.86 | 0.94 | 1.08 | 1.24 | 1.31 |1.26 | 1.30 | 1.30 | 1.31



Exchange page code
===

For exchanging per u64, I use the following function:

static void exchange_page(char *to, char *from)
{
u64 tmp;
int i;

for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
tmp = *((u64 *)(from + i));
*((u64 *)(from + i)) = *((u64 *)(to + i));
*((u64 *)(to + i)) = tmp;
}
}


For exchange per 4KB, I use the following function:

static void exchange_page2(char *to, char *from)
{
int cpu = smp_processor_id();

VM_BUG_ON(!in_atomic());

if (!page_tmp[cpu]) {
int nid = cpu_to_node(cpu);
struct page *page_tmp_page = alloc_pages_node(nid, GFP_KERNEL, 0);
if (!page_tmp_page) {
exchange_page(to, from);
return;
}
page_tmp[cpu] = kmap(page_tmp_page);
}

copy_page(page_tmp[cpu], to);
copy_page(to, from);
copy_page(from, page_tmp[cpu]);
}

where page_tmp is pre-allocated local to each CPU and alloc_pages_node() above
is for hot-added CPUs, which is not used in the tests.


The kernel is available at: https://gitlab.com/ziy/linux-contig-mem-rfc
To do a comparison, you can clone this repo: https://gitlab.com/ziy/thp-migration-bench,
then make, ./run_test.sh, and ./get_results.sh using the kernel from above.

Let me know if I missed anything or did something wrong. Thanks.


--
Best Regards,
Yan Zi