Re: [RFC PATCH v2 0/5] Reduce NUMA balance caused TLB-shootdowns in a VM
From: John Hubbard
Date: Fri Aug 11 2023 - 15:35:51 EST
On 8/11/23 11:39, David Hildenbrand wrote:
...
Should we want to disable NUMA hinting for such VMAs instead (for example, by QEMU/hypervisor) that knows that any NUMA hinting activity on these ranges would be a complete waste of time? I recall that John H. once mentioned that there are
similar issues with GPU memory: NUMA hinting is actually counter-productive and they end up disabling it.
Yes, NUMA balancing is incredibly harmful to performance, for GPU and
accelerators that map memory...and VMs as well, it seems. Basically,
anything that has its own processors and page tables needs to be left
strictly alone by NUMA balancing. Because the kernel is (still, even
today) unaware of what those processors are doing, and so it has no way
to do productive NUMA balancing.
Is there any existing way we could handle that better on a per-VMA level, or on the process level? Any magic toggles?
MMF_HAS_PINNED might be too restrictive. MMF_HAS_PINNED_LONGTERM might be better, but with things like iouring still too restrictive eventually.
I recall that setting a mempolicy could prevent auto-numa from getting active, but that might be undesired.
CCing Mel.
Let's discern between page pinning situations, and HMM-style situations.
Page pinning of CPU memory is unnecessary when setting up for using that
memory by modern GPUs or accelerators, because the latter can handle
replayable page faults. So for such cases, the pages are in use by a GPU
or accelerator, but unpinned.
The performance problem occurs because for those pages, the NUMA
balancing causes unmapping, which generates callbacks to the device
driver, which dutifully unmaps the pages from the GPU or accelerator,
even if the GPU might be busy using those pages. The device promptly
causes a device page fault, and the driver then re-establishes the
device page table mapping, which is good until the next round of
unmapping from the NUMA balancer.
hmm_range_fault()-based memory management in particular might benefit
from having NUMA balancing disabled entirely for the memremap_pages()
region, come to think of it. That seems relatively easy and clean at
first glance anyway.
For other regions (allocated by the device driver), a per-VMA flag
seems about right: VM_NO_NUMA_BALANCING ?
thanks,
--
John Hubbard
NVIDIA