RE: [PATCH rc v6] iommu: Fix nested pci_dev_reset_iommu_prepare/done()

From: Tian, Kevin

Date: Fri Apr 17 2026 - 04:26:43 EST


> From: Nicolin Chen <nicolinc@xxxxxxxxxx>
> Sent: Wednesday, April 8, 2026 3:47 AM
>
> Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
> internally while both are calling pci_dev_reset_iommu_prepare/done().
>
> As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
> will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
> device reset.
>
> On the other hand, removing the outer calls in the PCI callers is unsafe.
> As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
> execute custom firmware waits after their inner pcie_flr() completes. If
> the IOMMU protection relies solely on the inner reset, the IOMMU will be
> unblocked prematurely while the device is still resetting.
>
> Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.
>
> Given the IOMMU core tracks the resetting state per iommu_group while the
> reset is per device, this has to track at the group_device level as well.
>
> Introduce a 'reset_depth' and a 'blocked' flag to struct group_device, to
> handle the re-entries on the same device. This allows multi-device groups
> to isolate concurrent device resets independently.
>
> Note that iommu_deferred_attach() and
> iommu_driver_get_domain_for_dev()
> both now check the per-device 'gdev->blocked' flag instead of a per-group
> flag like 'group->resetting_domain'. This is actually more precise. Also,
> this 'gdev->blocked' will be useful in the future work to flag the device
> blocked by an ongoing/failed reset or quarantine.
>
> As the reset routine is per gdev, it cannot clear group->resetting_domain
> without iterating over the device list to ensure no other device is being
> reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
> in the struct iommu_group.
>
> Since both helpers are now per gdev, call the per-device set_dev_pasid op
> to recover PASID domains. And add 'max_pasids > 0' checks in both helpers.
>
> Fixes: c279e83953d9 ("iommu: Introduce
> pci_dev_reset_iommu_prepare/done()")
> Cc: stable@xxxxxxxxxxxxxxx
> Reported-by: Shuai Xue <xueshuai@xxxxxxxxxxxxxxxxx>
> Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
> Suggested-by: Kevin Tian <kevin.tian@xxxxxxxxx>
> Signed-off-by: Nicolin Chen <nicolinc@xxxxxxxxxx>

I still have a question whether iommu_driver_get_domain_for_dev() is
actually required, but it's orthogonal to what this fixes.

btw Sashiko [1] gave several comments.

one is that iommu_detach_device_pasid() is not blocked which can trigger
devtlb invalidation in middle of reset. but it cannot fail. so the right fix is
to skip the blocked device in __iommu_remove_group_pasid().

another is a use-after-free concern upon iommu_detach_device() in
middle of reset. In my thinking it will trigger WARN_ON before any UAF:

static void __iommu_group_set_domain_nofail(struct iommu_group *group,
struct iommu_domain *new_domain)
{
WARN_ON(__iommu_group_set_domain_internal(
group, new_domain, IOMMU_SET_DOMAIN_MUST_SUCCEED));
}

but I haven't got time to think about the fix carefully.

the last one is trivial that goto and guard() shouldn't be mixed in one
function according to the cleanup guidelines.

the former two are existing issues which could be fixed in a follow-up
patch if you want to fix this nesting issue first. If that's case (with the
3rd issue fixed):

Reviewed-by: Kevin Tian <kevin.tian@xxxxxxxxx>

[1] https://sashiko.dev/#/patchset/20260407194644.171304-1-nicolinc%40nvidia.com