Re: [PATCH v2 11/12] iommu/amd: Add support for nested domain attach/detach
From: Jason Gunthorpe
Date: Tue Oct 07 2025 - 19:43:19 EST
On Tue, Oct 07, 2025 at 02:22:30PM -0500, Suthikulpanit, Suravee wrote:
>
>
> On 10/6/2025 9:59 AM, Jason Gunthorpe wrote:
> > On Wed, Oct 01, 2025 at 06:09:53AM +0000, Suravee Suthikulpanit wrote:
> > > ....
> > > + if (WARN_ON(!ndom || !pdom || (pdom->iop.mode == PAGE_MODE_NONE)))
> > > + return;
> > > +
> > > + amd_iommu_make_clear_dte(dev_data, &new);
> > > +
> > > + new.data[0] |= iommu_virt_to_phys(pdom->iop.root);
> > > + new.data[0] |= FIELD_PREP(DTE_MODE_MASK, pdom->iop.mode);
> > > + new.data[0] |= DTE_FLAG_IR | DTE_FLAG_IW | DTE_FLAG_TV;
> > > + new.data[0] |= (DTE_FLAG_PPR & gdte->data[0]);
> >
> > > + if (pdom->dirty_tracking)
> > > + new.data[0] |= DTE_FLAG_HAD;
> > > +
> > > + if (dev_data->ats_enabled)
> > > + new.data[1] |= DTE_FLAG_IOTLB;
> >
> > This sequence should be in some set_dte_gcr3() ??
>
> Not sure what you mean. This logic was in set_dte_entry(), and duplicated
> here in the set_dte_nested() since we no longer calling set_dte_entry() from
> the nested path. Also, it's not really related to GCR3 table.
Sorry some make_ste_v1()
This logic should be the same for any v1 page table.
> > > + /*
> > > + * Restore cached persistent DTE bits, which can be set by information
> > > + * in IVRS table. See set_dev_entry_from_acpi().
> > > + */
> > > + initial_dte = amd_iommu_get_ivhd_dte_flags(iommu->pci_seg->id, dev_data->devid);
> > > + if (initial_dte) {
> > > + new.data128[0] |= initial_dte->data128[0];
> > > + new.data128[1] |= initial_dte->data128[1];
> > > + }
> >
> > This should go into amd_iommu_make_clear_dte() I think, and refactor
> > it out of iommu_update_dte256() ?
> > Every created DTE needs these bits set, right?
>
> Currently, the amd_iommu_make_clear_dte() clears all the DTE bits and set
> the DTE[V] (valid) bit. This is used when preparing the DTE for programming,
> detaching domain, and when setting the blocking domain. Putting this logic
> in the function would change the behavior.
Yes, but it seems like that is the right behavior?
> These bits affect Interrupt remapping (Lint0/Lint1/NMI/INIT/ExtInt interrupt
> pass-through) and System management message behavior.
Because these bits shouldn't be cleared on a blocking domain!
Interrupts are still expected to work.
> > > +
> > > + /* Guest translation stuff */
> > > + new.data[0] |= (gdte->data[0] &
> > > + (DTE_GLX | DTE_FLAG_GV | DTE_FLAG_GIOV));
> > > +
> > > + /* GCR3 table */
> > > + new.data[0] |= (gdte->data[0] & DTE_GCR3_14_12);
> > > + new.data[1] |= (gdte->data[1] & (DTE_GCR3_30_15 | DTE_GCR3_51_31));
> > > +
> > > + /* Guest paging mode */
> > > + new.data[2] |= (gdte->data[2] & DTE_GPT_LEVEL_MASK);
> >
> > I didn't see anything validating gdte has only permitted set bits in
> > the prior patch?
>
> Not sure which on are you referring to.
You have to validate gdte has only valid bits.
> > If this is goint to decode array item by item then why not use struct
> > iommu_hwpt_amd_guest in the nested_domain ?
>
> The struct dev_table_entry *gdte is basically the same information as in the
> struct iommu_hwpt_amd_guest.dte that we copied from the userspace into the
> more appropriate in-kernel data structure type, which is used within the
> driver.
The appropriate type is the userspace type if you don't actually ever
use anything unique to the kernel type. You can avoid the special
assert/etc since this way of accessing it is not sensitive to layout.
> Here, we just select only what we needed for configuring guest page table
> specifically to be programmed onto the host DTE.
Everything you don't pick up should be checked to be 0. VMM needs to
filter out unsuopported things or generate a bad DTE error.
Jason