Re: [PATCH 0/6] Enable parallel page migration

From: Mel Gorman
Date: Thu Mar 09 2017 - 17:15:38 EST


On Thu, Mar 09, 2017 at 11:38:00AM -0600, David Nellans wrote:
> On 03/09/2017 09:09 AM, Mel Gorman wrote:
> > I didn't look into the patches in detail except to get a general feel
> > for how it works and I'm not convinced that it's a good idea at all.
> >
> > I accept that memory bandwidth utilisation may be higher as a result but
> > consider the impact. THP migrations are relatively rare and when they
> > occur, it's in the context of a single thread. To parallelise the copy,
> > an allocation, kmap and workqueue invocation are required. There may be a
> > long delay before the workqueue item can start which may exceed the time
> > to do a single copy if the CPUs on a node are saturated. Furthermore, a
> > single thread can preempt operations of other unrelated threads and incur
> > CPU cache pollution and future misses on unrelated CPUs. It's compounded by
> > the fact that a high priority system workqueue is used to do the operation,
> > one that is used for CPU hotplug operations and rolling back when a netdevice
> > fails to be registered. It treats a hugepage copy as an essential operation
> > that can preempt all other work which is very questionable.
> >
> > The series leader has no details on a workload that is bottlenecked by
> > THP migrations and even if it is, the primary question should be *why*
> > THP migrations are so frequent and alleviating that instead of
> > preempting multiple CPUs to do the work.
> >
> >
>
> Mel - I sense on going frustration around some of the THP migration,
> migration acceleration, CDM, and other patches. Here is a 10k foot
> description that I hope adds to what John & Anshuman have said in other
> threads.
>

Hi David,

I recognise the motivation for some of these patches but disagree on the
mechanisms used, more on this later.

> Vendors are currently providing systems that have both traditional
> DDR3/4 memory (lets call it 100GB/s) and high bandwidth memory (HBM)
> (lets call it 1TB/s) within a single system. GPUs have been doing this
> with HBM on the GPU and DDR on the CPU complex, but they've been
> attached via PCIe and thus HBM has been GPU private memory.

I completely understand although I'd point out that HBM is slightly
different in that it could be expressed in terms of a hierarchical node
system whereby some nodes migrate to each other -- from faster to slower by
a "migrate on LRU reclaim" and from slower to faster with automatic NUMA
balancing using sampling. However, HBM is extremely specific and dealing
with that is not necessarily compatible with devices that are not coherent.

> <SNIP>

Again, I understand the motivation and have no further comment to make.
In the interest of trying to be helpful, I'll propose an alternative to
this series and expand upon why I think it's problematic.

> the HBM node from the DDR node. The expectation is that on such systems
> either the user, a daemon, or kernel/autonuma is going to be migrating
> (TH)pages between the NUMA zones to optimize overall system
> bandwidth/throughput. Because of the 10x discrepancy in memory
> bandwidth, despite the best paging policies to optimize for page
> locality in the HBM nodes, pages will often still be moving at a high
> rate between zones. This differs from a traditional NUMA system where
> moving a page from one 100GB/s node to the other 100GB/s node has
> dubious value, like you say.
>
> To your specific question - what workloads benefit from this improved
> migration throughput and why THPs?

So the key potential issue here in my mind is that THP migration is too slow
in some cases. What I object to is improving that using a high priority
workqueue that potentially starves other CPUs and pollutes their cache
which is generally very expensive.

Lets look at the core of what copy_huge_page does in mm/migrate.c which
is the function that gets parallelised by the series in question. For
a !HIGHMEM system, it's woefully inefficient. Historically, it was an
implementation that would work generically which was fine but maybe not
for future systems. It was also fine back when hugetlbfs was the only huge
page implementation and COW operations were incredibly rare on the grounds
due to the risk that they could terminate the process with prejudice.

The function takes a huge page, splits it into PAGE_SIZE chunks, kmap_atomics
the source and destination for each PAGE_SIZE chunk and copies it. The
parallelised version does one kmap and copies it in chunks assuming the
THP is fully mapped and accessible. Fundamentally, this is broken in the
generic sense as the kmap is not guaranteed to make the whole page necessary
but it happens to work on !highmem systems. What is more important to
note is that it's multiple preempt and pagefault enables and disables
on a per-page basis that happens 512 times (for THP on x86-64 at least),
all of which are expensive operations depending on the kernel config and
I suspect that the parallisation is actually masking that stupid overhead.

At the very least, I would have expected an initial attempt of one patch that
optimised for !highmem systems to ignore kmap, simply disable preempt (if
that is even necessary, I didn't check) and copy a pinned physical->physical
page as a single copy without looping on a PAGE_SIZE basis and see how
much that gained. Do it initially for THP only and worry about gigantic
pages when or if that is a problem.

That would be patch 1 of a series. Maybe that'll be enough, maybe not but
I feel it's important to optimise the serialised case as much as possible
before considering parallelisation to highlight and justify why it's
necessary[1]. If nothing else, what if two CPUs both parallelise a migration
at the same time and end up preempting each other? Between that and the
workqueue setup, it's potentially much slower than an optimised serial copy.

It would be tempting to experiment but the test case was not even included
with the series (maybe it's somewhere else)[2]. While it's obvious how
such a test case could be constructed, it feels unnecessary to construct
it when it should be in the changelog.

To some extent, CDM suffered from the same pushback. Like this series,
it introduced something new, complex and with high maintenance overhead
without considering whether the existing mechanisms (cpuset, mempolicies
or some combination of both) or out-of tree proposals such as HMM that
could be added to and finalised[3].

[1] If nothing else, it would make it clear to the reviewer that
additional complexity is 100% justified which is woefully missing in
this series.

[2] For complex series that have an alleged performance improvement, it
should always be possible to supply a test case that can demonstrate
that. Granted, this does not always happen but at least for my own
series I have the test case in question automated and can point people
to the repository that stores the test case and if necessary, supply
instructions on how to reproduce the results.

[3] Preferably with an in-tree user because the lack of such a user was
one of the major factors that gave HMM a kicking.

--
Mel Gorman
SUSE Labs