Re: 回复: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings

From: Kuehling, Felix

Date: Thu Jun 25 2026 - 16:51:47 EST


If there are MES queue eviction failures, then the root cause is most likely an MES firmware problem or some bug in the driver's interaction with MES. Your application dies in the GPU reset that follows. The MMU notifier handling and THP change is not the root cause. It's only the thing that happens to trigger the MES problem. The same thing could happen with NUMA migrations, applications forking or being terminated with Ctrl+C. In all of these scenarios the driver depends on MES to preempt the user mode queues before the MMU notifier returns.

Regards,
  Felix


On 2026-06-25 09:06, Christian König wrote:
Hi Yitao,

adding Philip Yang.

Thanks for the investigation, that sounds like some kind of bug in the KFD SVM handling. The driver should be perfectly capable of handling this.

I strongly suggest to open up a bug report for ROCm and describe how to reproduce this, Philip can probably point you to the right location for that.

Regards,
Christian.

On 6/25/26 15:01, 蒋 亦韬 wrote:
Hi Christian,

I agree that my previous approach was wrong. Sorry about that. Please let me clarify the problem I was seeing and how I ended up with that incorrect conclusion.

The original problem was not a synthetic THP test. I was running ROCm/PyTorch ML training on an AMD Radeon 780M system, and the workload frequently failed with asynchronous HIP kernel launch failures. The userspace error usually surfaced later in PyTorch, for example around a copy/to_device/SetDevice path, but the kernel log showed GPU resets and KFD/MES queue eviction failures.

The relevant kernel messages I repeatedly saw were along these lines:

  MES failed to respond to msg=REMOVE_QUEUE
  MES failed to respond to msg=SUSPEND
  failed to suspend all gangs
  failed to remove hardware queue from MES
  Failed to evict queue
  Failed to evict process queues
  GPU reset begin

While trying to reduce the issue, I saw memory invalidations and THP-related page-table/backing-page activity driving the AMDGPU/KFD path through SVM eviction. On this system, the path I was looking at was roughly:

  svm_range_cpu_invalidate_pagetables()
    -> svm_range_evict()
    -> kgd2kfd_quiesce_mm()
    -> KFD process queue eviction
    -> MES REMOVE_QUEUE / SUSPEND

One thing that misled me was the XNACK-disabled path. Since the issue appeared on an XNACK-disabled APU, and that path requires queue eviction/quiesce when CPU page table invalidations affect GPU mappings, I incorrectly thought the backing-page change itself was something the driver had to prevent.

Another thing that misled me was that the application was not intentionally asking for THP behavior. From the workload’s point of view, these page transitions looked unrelated to the model computation. I therefore incorrectly assumed that userspace should not be able to change backing-page characteristics in a way that affects a driver mapping already registered with MMU interval notifiers. I now understand from the MM feedback that this is expected behavior, and that the notifier user must handle unmap/remap correctly.

So the more precise problem is that THP/remap is only one way to trigger the invalidation path. What is failing for my workload is the AMDGPU/KFD/MES queue quiesce/eviction path during those invalidations. When that fails, the GPU resets, and userspace later observes an asynchronous HIP failure.

Please allow me to continue investigating a more appropriate fix for this problem. I will try to keep the fix boundary within AMDGPU/KFD/MES and avoid changing MM-core or THP policy semantics.

Regards,
Yitao
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
*发件人:* Christian König <christian.koenig@xxxxxxx>
*发送时间:* 2026年6月25日 8:35
*收件人:* Yitao Jiang <jytscientist@xxxxxxxxxxx>; Alex Deucher <alexander.deucher@xxxxxxx>; David Airlie <airlied@xxxxxxxxx>; Simona Vetter <simona@xxxxxxxx>; Felix Kuehling <Felix.Kuehling@xxxxxxx>; Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>; David Hildenbrand <david@xxxxxxxxxx>; Lorenzo Stoakes <ljs@xxxxxxxxxx>
*抄送:* Zi Yan <ziy@xxxxxxxxxx>; Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>; Liam R . Howlett <liam@xxxxxxxxxxxxx>; Nico Pache <npache@xxxxxxxxxx>; Ryan Roberts <ryan.roberts@xxxxxxx>; Dev Jain <dev.jain@xxxxxxx>; Barry Song <baohua@xxxxxxxxxx>; Lance Yang <lance.yang@xxxxxxxxx>; Vlastimil Babka <vbabka@xxxxxxxxxx>; Mike Rapoport <rppt@xxxxxxxxxx>; Suren Baghdasaryan <surenb@xxxxxxxxxx>; Michal Hocko <mhocko@xxxxxxxx>; Jann Horn <jannh@xxxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; dri-devel@xxxxxxxxxxxxxxxxxxxxx <dri-devel@xxxxxxxxxxxxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx <linux-kernel@xxxxxxxxxxxxxxx>; linux-mm@xxxxxxxxx <linux-mm@xxxxxxxxx>
*主题:* Re: [PATCH 0/3] mm/mmu_notifier, drm/amdgpu: block THP for GPU user mappings
On 6/25/26 12:59, Yitao Jiang wrote:
Hi,

This series fixes a THP policy problem I found while debugging
frequent ROCm GPU failures on an AMD Radeon 780M system during ML
training.

Some AMDGPU/KFD user mappings are registered through interval
notifiers and cannot safely tolerate the backing VMA changing from base
pages to a transparent huge page after registration.
That's certainly not correct. This is a must have for a whole lot of use cases.

Why exactly isn't that working for your use case?

Regards,
Christian.

Userspace can
still apply MADV_HUGEPAGE or MADV_COLLAPSE, and khugepaged can also
collapse the range, after the GPU mapping has been registered.

On my system this showed up as asynchronous ROCm/HIP kernel launch
failures, often reported later at a synchronization or copy point. I
expect the issue to be relevant to AMDGPU/KFD mappings on
XNACK-disabled GPUs more generally, because those mappings cannot rely
on replayable GPU faults after a CPU-side THP remap. I have validated
the failure and fix on AMD Radeon 780M / gfx1103.

Patch 1 adds MMU_INTERVAL_NOTIFIER_BLOCK_THP so interval notifier
users can ask the MM core to keep the covered VMA range out of THP
while the notifier is active. The MM core applies VM_NOHUGEPAGE and
clears VM_HUGEPAGE under mmap_lock for write. A later MADV_HUGEPAGE
over an active opt-in range is treated as an ignored hint, and
MADV_COLLAPSE is rejected by the existing VM_NOHUGEPAGE checks.

Patches 2 and 3 opt in the AMDGPU/KFD paths that need this behavior:
HSA userptr BOs, KFD SVM ranges when XNACK is disabled, and
GPU_ALWAYS_MAPPED SVM ranges. Other interval notifier users keep their
current behavior.

This does not disable THP globally and does not add work to GPU
command submission or kernel launch paths. Additional work is limited
to opt-in notifier registration, opt-in notifier flag transitions, and
MADV_HUGEPAGE attempts that overlap an active opt-in range.

I tested this on top of torvalds/linux commit ab9de95c9cf9 with:

   - scripts/checkpatch.pl --strict --no-tree
   - git apply --check
   - x86_64 defconfig build with TRANSPARENT_HUGEPAGE=y,
     DRM_AMDGPU=m, and HSA_AMD=y for mm/ and AMDGPU/KFD objects
   - standalone HSA/HIP reproducers and the ROCm/PyTorch workload that
     originally exposed the failure on my Radeon 780M system

The standalone reproducers depend on ROCm userspace libraries, so I
have not included them in this series. I can send them separately if
useful.

This series was prepared with assistance from OpenAI Codex (GPT-5.5).
I reviewed the resulting code and take responsibility for the
submission.

Yitao Jiang (3):
   mm/mmu_notifier: let interval notifiers block THP
   drm/amdgpu: block THP for HSA userptr notifiers
   drm/amdkfd: block THP for non-replayable SVM ranges

  drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c |  25 ++-
  drivers/gpu/drm/amd/amdkfd/kfd_svm.c    |  36 ++++-
  include/linux/huge_mm.h                 |   5 +-
  include/linux/mmu_notifier.h            |  28 ++++
  mm/khugepaged.c                         |   9 +-
  mm/madvise.c                            |   3 +-
  mm/mmu_notifier.c                       | 204 +++++++++++++++++++++++-
  7 files changed, 286 insertions(+), 24 deletions(-)


base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
--
2.53.0