[PATCH 0/5] Parallel hugepage migration optimization

From: Zi Yan
Date: Tue Nov 22 2016 - 11:26:43 EST

From: Zi Yan <zi.yan@xxxxxxxxxxxxxx>

Hi all,

This patchset boosts the hugepage migration throughput and helps THP migration
which is added by Naoya's patches: https://lwn.net/Articles/705879/.


In x86, 4KB page migrations are underutilizing the memory bandwidth compared
to 2MB THP migrations. I did some page migration benchmarking on a two-socket
Intel Xeon E5-2640v3 box, which has 23.4GB/s bandwidth, and discover
there are big throughput gap, ~3x, between 4KB and 2MB page migrations.

Here are the throughput numbers for different page sizes and page numbers:
| 512 4KB pages | 1 2MB THP | 1 4KB page
x86_64 | 0.98GB/s | 2.97GB/s | 0.06GB/s

As Linux currently use single-threaded page migration, the throughput is still
much lower than the hardware bandwidth, 2.97GB/s vs 23.4GB/s. So I parallelize
the copy_page() part of THP migration with workqueue and achieve 2.8x throughput.

Here are the throughput numbers of 2MB page migration:
| single-threaded | 8-thread
x86_64 2MB | 2.97GB/s | 8.58GB/s

Here is the benchmark you can use to compare page migration time:

As this patchset requires Naoya's patch, this repo has both patchset applied:

Patchset desciption

This patchset adds a new migrate_mode MIGRATE_MT, which leads to parallelized
page migration routine. Only copy_huge_page() will be parallelized. This
MIGRATE_MT is enabled by a sysctl knob, vm.accel_page_copy, or an additional
flag, MPOL_MF_MOVE_MT, to move_pages() system call.

The parallelized copy page routine distributes a single huge page into 4
workqueue threads and wait until they finish.

1. For testing purpose, I choose to use sysctl to enable and disable the
parallel huge page migration. I need comments on how to enable and disable it,
or just enable it for all huge page migrations.

2. The hard-coded "4" workqueue threads is not adaptive, any suggestion?
Like boot time benchmark to find an appropriate number?

3. The parallel huge page migration works best with threads allocated at
different physical cores, not all in the same hyper-threaded core. Is there
any way to find out the core topology easily?

Any comments are welcome. Thanks.

Best Regards,
Zi Yan

Zi Yan (5):
mm: migrate: Add mode parameter to support additional page copy
mm: migrate: Change migrate_mode to support combination migration
migrate: Add copy_page_mt to use multi-threaded page migration.
mm: migrate: Add copy_page_mt into migrate_pages.
mm: migrate: Add vm.accel_page_copy in sysfs to control whether to use
multi-threaded to accelerate page copy.

fs/aio.c | 2 +-
fs/hugetlbfs/inode.c | 2 +-
fs/ubifs/file.c | 2 +-
include/linux/highmem.h | 2 +
include/linux/migrate.h | 6 ++-
include/linux/migrate_mode.h | 7 +--
include/uapi/linux/mempolicy.h | 2 +
kernel/sysctl.c | 12 ++++++
mm/Makefile | 2 +
mm/compaction.c | 20 ++++-----
mm/copy_page.c | 96 ++++++++++++++++++++++++++++++++++++++++++
mm/migrate.c | 61 ++++++++++++++++++---------
12 files changed, 175 insertions(+), 39 deletions(-)
create mode 100644 mm/copy_page.c