RE: [PATCH v3 00/10] Add Intel VT-d nested translation

From: Tian, Kevin
Date: Fri May 26 2023 - 07:25:54 EST

> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Sent: Friday, May 26, 2023 2:07 AM
> On Wed, 24 May 2023 08:59:43 +0000
> "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > > From: Liu, Yi L <yi.l.liu@xxxxxxxxx>
> > > Sent: Thursday, May 11, 2023 10:51 PM
> > >
> > > The first Intel platform supporting nested translation is Sapphire
> > > Rapids which, unfortunately, has a hardware errata [2] requiring special
> > > treatment. This errata happens when a stage-1 page table page (either
> > > level) is located in a stage-2 read-only region. In that case the IOMMU
> > > hardware may ignore the stage-2 RO permission and still set the A/D bit
> > > in stage-1 page table entries during page table walking.
> > >
> > > A flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is introduced to
> > > report
> > > this errata to userspace. With that restriction the user should either
> > > disable nested translation to favor RO stage-2 mappings or ensure no
> > > RO stage-2 mapping to enable nested translation.
> > >
> > > Intel-iommu driver is armed with necessary checks to prevent such mix
> > > in patch10 of this series.
> > >
> > > Qemu currently does add RO mappings though. The vfio agent in Qemu
> > > simply maps all valid regions in the GPA address space which certainly
> > > includes RO regions e.g. vbios.
> > >
> > > In reality we don't know a usage relying on DMA reads from the BIOS
> > > region. Hence finding a way to allow user opt-out RO mappings in
> > > Qemu might be an acceptable tradeoff. But how to achieve it cleanly
> > > needs more discussion in Qemu community. For now we just hacked
> Qemu
> > > to test.
> > >
> >
> > Hi, Alex,
> >
> > Want to touch base on your thoughts about this errata before we
> > actually go to discuss how to handle it in Qemu.
> >
> > Overall it affects all Sapphire Rapids platforms. Fully disabling nested
> > translation in the kernel just for this rare vulnerability sounds an overkill.
> >
> > So we decide to enforce the exclusive check (RO in stage-2 vs. nesting)
> > in the kernel and expose the restriction to userspace so the VMM can
> > choose which one to enable based on its own requirement.
> >
> > At least this looks a reasonable tradeoff to some proprietary VMMs
> > which never adds RO mappings in stage-2 today.
> >
> > But we do want to get Qemu support nested translation on those
> > platform as the widely-used reference VMM!
> >
> > Do you see any major oversight before pursuing such change in Qemu
> > e.g. having a way for the user to opt-out adding RO mappings in stage-2?
> 😊
> I don't feel like I have enough info to know what common scenarios are
> going to make use of 2-stage and nested configurations and how likely a
> user is to need such an opt-out. If it's likely that a user is going
> to encounter this configuration, an opt-out is at best a workaround.
> It's a significant support issue if a user needs to generate a failure
> in QEMU, notice and decipher any log messages that failure may have
> generated, and take action to introduce specific changes in their VM
> configuration to support a usage restriction.

Thanks. This is a good point.

> For QEMU I might lean more towards an effort to better filter the
> mappings we create to avoid these read-only ranges that likely don't
> require DMA mappings anyway.

We thought about having intel-viommu to register a discard memory
manager to filter in case the kernel reports this errata.

Our originally thought was that even with it we may still want to
explicitly let user to opt given this configuration doesn't match the
bare metal. But with your explanation probably doing so instead
causes more trouble than what it tries to achieve.

> How much does this affect arbitrary userspace vfio drivers? For
> example are there scenarios where running in a VM with a vIOMMU
> introduces nested support that's unknown to the user which now prevents
> this usage? An example might be running an L2 guest with a version of
> QEMU that does create read-only mappings. If necessary, how would lack
> of read-only mapping support be conveyed to those nested use cases?

To enable nested translation it's expected to have the guest use
stage-1 while the host uses stage-2. So the L0 QEMU will expose
a vIOMMU with only stage-1 capability to L1.

In that case it's perfectly fine to have RO mappings in stage-1 no
matter whether L1 further create L2 guest inside.

Then only L0 QEMU needs to care about this RO thing in stage-2.

In case L0 QEMU exposes a legacy vIOMMU which supports only stage-2
then nesting cannot be enabled. Instead it will fallback to the old
shadowing path then RO mapping from guest doesn't matter either.

Exposing a vIOMMU which supports both stage-1/stage-2/nesting
is another story. But I believe it's far from when this becomes useful
and it's reasonable to just have L0 QEMU not support this configuration
before this errata is fixed. 😊