Re: [PATCH RFT RFC] usb: xhci: Kill hosts with HCE or HSE on command timeout

From: Desnes Nunes

Date: Tue May 26 2026 - 23:48:22 EST


Hello Michal,

On Sat, May 23, 2026 at 5:28 AM Michal Pecio <michal.pecio@xxxxxxxxx> wrote:
> We can make a guess that the faulting address is the ERST, which
> definitely should be accessible to the host controller.
>
> This simple patch logs ERST allocation and freeing; as far as I see
> nothing else touches that mapping.
>
> If the ERST is somehow freed before starting the HC, that's a bug.

Tested the patch and only saw the allocation messages:

# grep "alloc ERST\|free ERST\|ERST\|Device context\|fault addr"
vmcore-dmesg.txt
[ 6.582282] xhci_hcd 0000:00:0d.0: Device context base array
address = 0x000000010aca8000 (DMA), 00000000282a4aa1 (virt)
[ 6.582287] xhci_hcd 0000:00:0d.0: alloc ERST at 0x000000010acac000
[ 6.598137] xhci_hcd 0000:00:0d.0: ERST deq = 64'h10acaa000
[ 6.715052] xhci_hcd 0000:80:14.0: Device context base array
address = 0x0000000102ec9000 (DMA), 00000000d1c656e7 (virt)
[ 6.715057] xhci_hcd 0000:80:14.0: alloc ERST at 0x0000000102ecd000
[ 6.730919] xhci_hcd 0000:80:14.0: ERST deq = 64'h102ecb000

# grep "alloc ERST\|free ERST\|ERST\|Device context\|fault addr" kexec-dmesg.log
[Tue May 26 08:41:56 2026] DMAR: [DMA Write NO_PASID] Request device
[80:1f.6] fault addr 0x106f06000 [fault reason 0x39] SM: Present bit
in Root Entry is clear
[Tue May 26 08:41:56 2026] DMAR: [DMA Write NO_PASID] Request device
[80:1f.6] fault addr 0x106f19000 [fault reason 0x39] SM: Present bit
in Root Entry is clear
[Tue May 26 08:41:57 2026] DMAR: [DMA Write NO_PASID] Request device
[80:1f.6] fault addr 0x106f1c000 [fault reason 0x39] SM: Present bit
in Root Entry is clear
[Tue May 26 08:42:01 2026] xhci_hcd 0000:00:0d.0: Device context base
array address = 0x00000010750bf000 (DMA), 00000000fcec19e7 (virt)
[Tue May 26 08:42:01 2026] xhci_hcd 0000:00:0d.0: alloc ERST at
0x00000010750c5000
[Tue May 26 08:42:01 2026] xhci_hcd 0000:00:0d.0: ERST deq = 64'h10750c3000
[Tue May 26 08:42:01 2026] xhci_hcd 0000:80:14.0: Device context base
array address = 0x000000107513c000 (DMA), 000000008803b985 (virt)
[Tue May 26 08:42:01 2026] xhci_hcd 0000:80:14.0: alloc ERST at
0x0000001075140000
[Tue May 26 08:42:01 2026] xhci_hcd 0000:80:14.0: ERST deq = 64'h107513e000
[Tue May 26 08:42:02 2026] DMAR: [DMA Read NO_PASID] Request device
[80:14.0] fault addr 0x1075140000 [fault reason 0x39] SM: Present bit
in Root Entry is clear

^ PS: Different address alloc on kdump though

> Otherwise, it seems you were right that you have some IOMMU problem.

Thus, I started to investigate this front now. This time I gave some
more attention to these dmar messages:

[Tue May 19 08:17:49 2026] DMAR: Intel-IOMMU force enabled due to
platform opt in
[Tue May 19 08:17:49 2026] DMAR: No RMRR found
[Tue May 19 08:17:49 2026] DMAR: No ATSR found
[Tue May 19 08:17:49 2026] DMAR: dmar0: Using Queued invalidation
=> [Tue May 19 08:17:49 2026] DMAR: Translation already enabled -
trying to copy translation structures
=> [Tue May 19 08:17:49 2026] DMAR: Copied translation tables from
previous kernel for dmar0
[Tue May 19 08:17:49 2026] DMAR: dmar1: Using Queued invalidation
=> [Tue May 19 08:17:49 2026] DMAR: Translation already enabled -
trying to copy translation structures
=> [Tue May 19 08:17:49 2026] DMAR: Copied translation tables from
previous kernel for dmar1

I started wondering if maybe on my system these translation tables
can't be fully trusted for some reason during kdump?
Maybe iommu is copying root_entries with the Present bit clear, and
thus generating the fault reason 0x39?
-> bus 0x80's? Both ethernet and xhci_hcd fault addr were on this bus

So, to test this theory out, I tried to disable translation and
allocate a clean root-entry table right away if I am running a kdump
kernel:

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index e236c7ec221f..de673f34f4e1 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2135,24 +2135,31 @@ static int __init init_dmars(void)
if (translation_pre_enabled(iommu)) {
pr_info("Translation already enabled - trying
to copy translation structures\n");

- ret = copy_translation_tables(iommu);
- if (ret) {
- /*
- * We found the IOMMU with translation
- * enabled - but failed to copy over the
- * old root-entry table. Try to proceed
- * by disabling translation now and
- * allocating a clean root-entry table.
- * This might cause DMAR faults, but
- * probably the dump will still succeed.
- */
- pr_err("Failed to copy translation
tables from previous kernel for %s\n",
- iommu->name);
+ if (is_kdump_kernel()) {
+ pr_info("DESNES V2 IOMMU kdump kernel,
disabilng translation and allocating clean root-entry for %s\n",
+ iommu->name);
iommu_disable_translation(iommu);
clear_translation_pre_enabled(iommu);
} else {
- pr_info("Copied translation tables
from previous kernel for %s\n",
- iommu->name);
+ ret = copy_translation_tables(iommu);
+ if (ret) {
+ /*
+ * We found the IOMMU with translation
+ * enabled - but failed to copy over the
+ * old root-entry table. Try to proceed
+ * by disabling translation now and
+ * allocating a clean root-entry table.
+ * This might cause DMAR faults, but
+ * probably the dump will still succeed.
+ */
+ pr_err("DESNES V2 Failed to
copy translation tables from previous kernel for %s\n",
+ iommu->name);
+ iommu_disable_translation(iommu);
+ clear_translation_pre_enabled(iommu);
+ } else {
+ pr_info("DESNES V2 Copied
translation tables from previous kernel for %s\n",
+ iommu->name);
+ }
}
}

Didn't had time to check ERST or HSE yet, but with this I didn't had
any DMAR faults, vmcore was collected normally and system rebooted
smoothly afterwards:

[Tue May 26 22:52:58 2026] DMAR: Intel-IOMMU force enabled due to
platform opt in
[Tue May 26 22:52:58 2026] DMAR: No RMRR found
[Tue May 26 22:52:58 2026] DMAR: No ATSR found
[Tue May 26 22:52:58 2026] DMAR: dmar0: Using Queued invalidation
=> [Tue May 26 22:52:58 2026] DMAR: Translation already enabled -
trying to copy translation structures
=> [Tue May 26 22:52:58 2026] DMAR: DESNES V2 IOMMU kdump kernel,
disabilng translation and allocating clean root-entry for dmar0
[Tue May 26 22:52:58 2026] DMAR: dmar1: Using Queued invalidation
=> [Tue May 26 22:52:58 2026] DMAR: Translation already enabled -
trying to copy translation structures
=> [Tue May 26 22:52:58 2026] DMAR: DESNES V2 IOMMU kdump kernel,
disabilng translation and allocating clean root-entry for dmar1

Seems like a lead on this iommu front.

The funny thing is that the comment in this section literaly says that
doing this could cause faults, but here clearing it actually seemed to
solve them and made kdump succeed - commit
091d42e43d21b6ca7ec39bf5f9e17bc0bd8d4312 ("iommu/vt-d: Copy
translation tables from old kernel")

Let me do some more tests to dump and check the root-entry table
before clearing, as well as to check ERST allocations and HSE value,
and I'll get back to you Michal.

Best Regards,

Desnes