Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
From: Christian König
Date: Thu Jun 25 2026 - 08:38:38 EST
On 6/25/26 12:59, Yitao Jiang wrote:
> Hi,
>
> This series fixes a THP policy problem I found while debugging
> frequent ROCm GPU failures on an AMD Radeon 780M system during ML
> training.
>
> Some AMDGPU/KFD user mappings are registered through interval
> notifiers and cannot safely tolerate the backing VMA changing from base
> pages to a transparent huge page after registration.
That's certainly not correct. This is a must have for a whole lot of use cases.
Why exactly isn't that working for your use case?
Regards,
Christian.
> Userspace can
> still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
> collapse the range, after the GPU mapping has been registered.
>
> On my system this showed up as asynchronous ROCm/HIP kernel launch
> failures, often reported later at a synchronization or copy point. I
> expect the issue to be relevant to AMDGPU/KFD mappings on
> XNACK-disabled GPUs more generally, because those mappings cannot rely
> on replayable GPU faults after a CPU-side THP remap. I have validated
> the failure and fix on AMD Radeon 780M / gfx1103.
>
> Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
> users can ask the MM core to keep the covered VMA range out of THP
> while the notifier is active. The MM core applies VM_NOHUGEPAGE and
> clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
> over an active opt-in range is treated as an ignored hint, and
> MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.
>
> Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
> HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
> GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
> current behavior.
>
> This does not disable THP globally and does not add work to GPU
> command submission or kernel launch paths. Additional work is limited
> to opt-in notifier registration, opt-in notifier flag transitions, and
> MADV_HUGEPAGE attempts that overlap an active opt-in range.
>
> I tested this on top of torvalds/linux commit ab9de95c9cf9 with:
>
> - scripts/checkpatch.pl --strict --no-tree
> - git apply --check
> - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
> DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
> - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
> originally exposed the failure on my Radeon 780M system
>
> The standalone reproducers depend on ROCm userspace libraries, so I
> have not included them in this series. I can send them separately if
> useful.
>
> This series was prepared with assistance from OpenAI Codex (GPT-5.5).
> I reviewed the resulting code and take responsibility for the
> submission.
>
> Yitao Jiang (3):
> mm/mmu_notifier: let interval notifiers block THP
> drm/amdgpu: block THP for HSA userptr notifiers
> drm/amdkfd: block THP for non-replayable SVM ranges
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c | 25 ++-
> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 36 ++++-
> include/linux/huge_mm.h | 5 +-
> include/linux/mmu_notifier.h | 28 ++++
> mm/khugepaged.c | 9 +-
> mm/madvise.c | 3 +-
> mm/mmu_notifier.c | 204 +++++++++++++++++++++++-
> 7 files changed, 286 insertions(+), 24 deletions(-)
>
>
> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
> --
> 2.53.0