Re: [PATCH rc v2] iommu: Fix nested pci_dev_reset_iommu_prepare/done()

From: Nicolin Chen

Date: Fri Mar 27 2026 - 17:08:54 EST

On Fri, Mar 27, 2026 at 08:27:12AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@xxxxxxxxxx>
> > Sent: Friday, March 20, 2026 5:35 AM
> >
> > On Thu, Mar 19, 2026 at 07:14:21PM +0800, Shuai Xue wrote:
> > > On 3/19/26 12:31 PM, Nicolin Chen wrote:
> > > > @@ -3961,9 +3962,10 @@ int pci_dev_reset_iommu_prepare(struct
> > pci_dev *pdev)
> > > > guard(mutex)(&group->mutex);
> > > > - /* Re-entry is not allowed */
> > > > - if (WARN_ON(group->resetting_domain))
> > > > - return -EBUSY;
> > > > + if (group->resetting_domain) {
> > > > + group->reset_cnt++;
> > > > + return 0;
> > > > + }
> > > > ret = __iommu_group_alloc_blocking_domain(group);
> > >
> > > pci_dev_reset_iommu_prepare/done() have NO singleton group check.
> > > They operate on the specific pdev passed in, but use group-wide
> > > state (resetting_domain, reset_cnt) to track the reset lifecycle.
> > >
> > > Interestingly, the broken_worker in patch 3 of the ATC timeout
> > > series DOES have an explicit singleton check:
> > >
> > > if (list_is_singular(&group->devices)) {
> > > /* Note: only support group with a single device */
> > >
> > > This reveals an implicit assumption: the entire prepare/done
> > > mechanism works correctly only for singleton groups. For
> > > multi-device groups:
> > >
> > > - prepare() only detaches the specific pdev, leaving other
> > > devices in the group still attached to the original domain
>
> no, attach is per-group. all devices are changed together.

I think prepare() should use the per-device ops:
ops->attach_dev (via __iommu_attach_device)
ops->set_dev_pasid
right?

And it is intended to limit to the resetting device only. Other
devices in the group can still stay in the attached domain, but
they cannot do new attachment (-EBUSY) because that's per-group.

> > > - The group-wide resetting_domain/reset_cnt state can be
> > > corrupted by concurrent resets on different devices (as
> > > discussed above)
> >
> > That's a phenomenal catch!
> >
> > I think we shall have a reset_ndevs in the group and a reset_depth
> > in the gdev.
> >
>
> I didn't get how the state might be corrupted.
>
> If there are two devices in a group both being reset, you only
> want to switch back to the original domain after both devices
> have finished reset.
>
> so a reset_cnt is sufficient to cover both scenarios:
>
> - nested prepare/done in a single device resetting path
> - concurrent prepare/done in multiple devices resetting paths

Since prepare() is per-device, the per-group reset_cnt wouldn't
be sufficient if one device is resetting while the other isn't.

iommu_driver_get_domain_for_dev() for example would return the
blocking_domain for both devices, but physically only the first
device is attached to blocking_domain.

> v4 is way too complex. Probably it's required in the following
> ATS timeout series (which I haven't thought about carefully).
> but for the fix here seems a simple logic like v2 does is ok?

I think v4 is the way to go, unless we fundamentally limit this
API (and the followup ATS timeout series) to singleton groups.

FWIW, I have the ATS timeout series updated (waiting for some
finalization of this base patch):
https://github.com/nicolinc/iommufd/commits/smmuv3_atc_timeout-v3/
The new iommu_report_device_broken() is changed to per-gdev.

Thanks
Nicolin