Re: [PATCH v2 2/2] iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode

From: Baolu Lu

Date: Mon Dec 22 2025 - 23:05:43 EST


On 12/22/25 19:19, Jinhui Guo wrote:
On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
From: Jinhui Guo<guojinhui.liam@xxxxxxxxxxxxx>
Sent: Thursday, December 11, 2025 12:00 PM

Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation
request when device is disconnected") relies on
pci_dev_is_disconnected() to skip ATS invalidation for
safely-removed devices, but it does not cover link-down caused
by faults, which can still hard-lock the system.
According to the commit msg it actually tries to fix the hard lockup
with surprise removal. For safe removal the device is not removed
before invalidation is done:

"
For safe removal, device wouldn't be removed until the whole software
handling process is done, it wouldn't trigger the hard lock up issue
caused by too long ATS Invalidation timeout wait.
"

Can you help articulate the problem especially about the part
'link-down caused by faults"? What are those faults? How are
they different from the said surprise removal in the commit
msg to not set pci_dev_is_disconnected()?

Hi, kevin, sorry for the delayed reply.

A normal or surprise removal of a PCIe device on a hot-plug port normally
triggers an interrupt from the PCIe switch.

We have, however, observed cases where no interrupt is generated when the
device suddenly loses its link; the behaviour is identical to setting the
Link Disable bit in the switch’s Link Control register (offset 10h). Exactly
what goes wrong in the LTSSM between the PCIe switch and the endpoint remains
unknown.

In this scenario, the hardware has effectively vanished, yet the device
driver remains bound and the IOMMU resources haven't been released. I’m
just curious if this stale state could trigger issues in other places
before the kernel fully realizes the device is gone? I’m not objecting
to the fix. I'm just interested in whether this 'zombie' state creates
risks elsewhere.


For example, if a VM fails to connect to the PCIe device,
'failed' for what reason?

"virsh destroy" is executed to release resources and isolate
the fault, but a hard-lockup occurs while releasing the group fd.

Call Trace:
qi_submit_sync
qi_flush_dev_iotlb
intel_pasid_tear_down_entry
device_block_translation
blocking_domain_attach_dev
__iommu_attach_device
__iommu_device_set_domain
__iommu_group_set_domain_internal
iommu_detach_group
vfio_iommu_type1_detach_group
vfio_group_detach_container
vfio_group_fops_release
__fput

Although pci_device_is_present() is slower than
pci_dev_is_disconnected(), it still takes only ~70 µs on a
ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed
and width increase.

Besides, devtlb_invalidation_with_pasid() is called only in the
paths below, which are far less frequent than memory map/unmap.

1. mm-struct release
2. {attach,release}_dev
3. set/remove PASID
4. dirty-tracking setup

surprise removal can happen at any time, e.g. after the check of
pci_device_is_present(). In the end we need the logic in
qi_check_fault() to check the presence upon ITE timeout error
received to break the infinite loop. So in your case even with
that logici in place you still observe lockup (probably due to
hardware ITE timeout is longer than the lockup detection on
the CPU?
Are you referring to the timeout added in patch
https://lore.kernel.org/all/20240222090251.2849702-4- haifeng.zhao@xxxxxxxxxxxxxxx/ ?

This doesn't appear to be a deterministic solution, because ...

Our lockup-detection timeout is the default 10 s.

We see ITE-timeout messages in the kernel log. Yet the system still
hard-locks—probably because, as you mentioned, the hardware ITE timeout
is longer than the CPU’s lockup-detection window. I’ll reproduce the
case and follow up with a deeper analysis.

... as you see, neither the PCI nor the VT-d specifications mandate a
specific device-TLB invalidation timeout value for hardware
implementations. Consequently, the ITE timeout value may exceed the CPU
watchdog threshold, meaning a hard lockup will be detected before the
ITE even occurs.

Thanks,
baolu