Re: [PATCH 0/6] Enable parallel page migration
From: David Nellans
Date: Thu Mar 09 2017 - 12:38:17 EST
On 03/09/2017 09:09 AM, Mel Gorman wrote:
> I didn't look into the patches in detail except to get a general feel
> for how it works and I'm not convinced that it's a good idea at all.
>
> I accept that memory bandwidth utilisation may be higher as a result but
> consider the impact. THP migrations are relatively rare and when they
> occur, it's in the context of a single thread. To parallelise the copy,
> an allocation, kmap and workqueue invocation are required. There may be a
> long delay before the workqueue item can start which may exceed the time
> to do a single copy if the CPUs on a node are saturated. Furthermore, a
> single thread can preempt operations of other unrelated threads and incur
> CPU cache pollution and future misses on unrelated CPUs. It's compounded by
> the fact that a high priority system workqueue is used to do the operation,
> one that is used for CPU hotplug operations and rolling back when a netdevice
> fails to be registered. It treats a hugepage copy as an essential operation
> that can preempt all other work which is very questionable.
>
> The series leader has no details on a workload that is bottlenecked by
> THP migrations and even if it is, the primary question should be *why*
> THP migrations are so frequent and alleviating that instead of
> preempting multiple CPUs to do the work.
>
>
Mel - I sense on going frustration around some of the THP migration,
migration acceleration, CDM, and other patches. Here is a 10k foot
description that I hope adds to what John & Anshuman have said in other
threads.
Vendors are currently providing systems that have both traditional
DDR3/4 memory (lets call it 100GB/s) and high bandwidth memory (HBM)
(lets call it 1TB/s) within a single system. GPUs have been doing this
with HBM on the GPU and DDR on the CPU complex, but they've been
attached via PCIe and thus HBM has been GPU private memory. The GPU has
managed this memory by effectively mlocking pages on the CPU and copying
the data into the GPU while its being computed on and then copying it
back to the CPU when CPU faults on trying to touch it or the GPU is
done. Because HBM is limited in capacity (10's of GB max) versus DDR3
(100's+ GB), runtimes like Nvidia's unified memory dynamically page
memory in and out of the GPU to get the benefits of high bandwidth,
while still allowing access a total footprint of system memory. Its
effectively page protection based CPU/GPU memory coherence.
PCIe attached GPUs+HBM are the bulk or whats out there today and will
continue to be, so there are efforts to try and improve how GPUs (and
other devices in the same PCIe boat) interact with -mm given the
limitations of PCIe (see HMM).
Jumping to what is essentially a different platform - there will be
systems where that same GPU HBM memory is now part of the OS controlled
memory (aka NUMA node) because these systems have a cache coherent link
attaching them (could be NVLINK, QPI, CAPI, HT, or something else) This
HBM zone might have CPU cores in it, it might have GPU cores in it, or
an FPGA, its not necessarily GPU specific. NVIDIA has talked about
systems that look like this, as has Intel (KNL with flat memory), and
there are likely others. Systems like this can be thought of (just for
exampke) as 2 NUMA node box where you've got 100GB/s of bandwidth on one
node, 1TB/s on the other, connected via some cache coherent link. That
link is probably order 100GB/s max too (maybe lower, but certainly not
1TB/s yet).
Cores (CPU/GPU/FPGA) can access either NUMA node via the coherent link
(just like a multi-socket CPU box) but you also want to be able to
optimize page placement so that hot pages physically get migrated into
the HBM node from the DDR node. The expectation is that on such systems
either the user, a daemon, or kernel/autonuma is going to be migrating
(TH)pages between the NUMA zones to optimize overall system
bandwidth/throughput. Because of the 10x discrepancy in memory
bandwidth, despite the best paging policies to optimize for page
locality in the HBM nodes, pages will often still be moving at a high
rate between zones. This differs from a traditional NUMA system where
moving a page from one 100GB/s node to the other 100GB/s node has
dubious value, like you say.
To your specific question - what workloads benefit from this improved
migration throughput and why THPs? We have seen that there can be a
1.7x improvement in GPU perf by improving NVLink page migration
bandwidth from 6GB/s->32.5GB/s. In comparison, 4KB page migration on
x86 over QPI (today) gets < 100MB/s of throughput even though QPI and
NVLink can provide 32GB/s+. We couldn't cripple the link enough to get
down to ~100MB/s, but obviously using small base page sizes at 100MB/s
of migration throughput would kill performance. So good THP
functionality + good migration throughput appear critical to us (and
maybe KNL too?).
https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/
There's a laundry list of things to make it work well related to THPs,
migration policy and eviction, numa zone isolation, etc. We hope
building functionality into mm + autonuma is useful for everyone so that
less of it needs to live in proprietary runtimes.
Hope that helps with 10k foot view and a specific use case on why at
least NVIDIA is very interested in optimizing kernel NUMA+THP functionality.