Re: [PATCH v3 6/7] iommu: Introduce IOMMU_RESV_DIRECT_RELAXABLE reserved memory regions
From: Alex Williamson
Date: Thu May 16 2019 - 13:08:41 EST
On Thu, 16 May 2019 14:58:08 +0200
Auger Eric <eric.auger@xxxxxxxxxx> wrote:
> Hi Jean-Philippe,
>
> On 5/16/19 2:43 PM, Jean-Philippe Brucker wrote:
> > On 16/05/2019 12:45, Auger Eric wrote:
> >> Hi Jean-Philippe,
> >>
> >> On 5/16/19 1:16 PM, Jean-Philippe Brucker wrote:
> >>> On 16/05/2019 11:08, Eric Auger wrote:
> >>>> Note: At the moment the sysfs ABI is not changed. However I wonder
> >>>> whether it wouldn't be preferable to report the direct region as
> >>>> "direct_relaxed" there. At the moment, in case the same direct
> >>>> region is used by 2 devices, one USB/GFX and another not belonging
> >>>> to the previous categories, the direct region will be output twice
> >>>> with "direct" type.
> >>>>
> >>>> This would unblock Shameer's series:
> >>>> [PATCH v6 0/7] vfio/type1: Add support for valid iova list management
> >>>> https://patchwork.kernel.org/patch/10425309/
> >>>
> >>> Thanks for doing this!
> >>>
> >>>> which failed to get pulled for 4.18 merge window due to IGD
> >>>> device assignment regression.
> >>>>
> >>>> v2 -> v3:
> >>>> - fix direct type check
> >>>> ---
> >>>> drivers/iommu/iommu.c | 12 +++++++-----
> >>>> include/linux/iommu.h | 6 ++++++
> >>>> 2 files changed, 13 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >>>> index ae4ea5c0e6f9..28c3d6351832 100644
> >>>> --- a/drivers/iommu/iommu.c
> >>>> +++ b/drivers/iommu/iommu.c
> >>>> @@ -73,10 +73,11 @@ struct iommu_group_attribute {
> >>>> };
> >>>>
> >>>> static const char * const iommu_group_resv_type_string[] = {
> >>>> - [IOMMU_RESV_DIRECT] = "direct",
> >>>> - [IOMMU_RESV_RESERVED] = "reserved",
> >>>> - [IOMMU_RESV_MSI] = "msi",
> >>>> - [IOMMU_RESV_SW_MSI] = "msi",
> >>>> + [IOMMU_RESV_DIRECT] = "direct",
> >>>> + [IOMMU_RESV_DIRECT_RELAXABLE] = "direct",
> >>>> + [IOMMU_RESV_RESERVED] = "reserved",
> >>>> + [IOMMU_RESV_MSI] = "msi",
> >>>> + [IOMMU_RESV_SW_MSI] = "msi",
> >>>> };
> >>>>
> >>>> #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store) \
> >>>> @@ -573,7 +574,8 @@ static int iommu_group_create_direct_mappings(struct iommu_group *group,
> >>>> start = ALIGN(entry->start, pg_size);
> >>>> end = ALIGN(entry->start + entry->length, pg_size);
> >>>>
> >>>> - if (entry->type != IOMMU_RESV_DIRECT)
> >>>> + if (entry->type != IOMMU_RESV_DIRECT &&
> >>>> + entry->type != IOMMU_RESV_DIRECT_RELAXABLE)
> >>>
> >>> I'm trying to understand why you need to create direct mappings at all
> >>> for these relaxable regions. In the host the region is needed for legacy
> >>> device features, which are disabled (and cannot be re-enabled) when
> >>> assigning the device to a guest?
> >> This follows Kevin's comment in the thread below:
> >> https://patchwork.kernel.org/patch/10449103/#21957279
> >>
> >> In normal DMA API host path, those regions need to be 1-1 mapped. They
> >> are likely to be accessed by the driver or FW at early boot phase or
> >> even during execution, depending on features being used.
> >>
> >> That's the reason, according to Kevin we couldn't hide them.
> >>
> >> We just know that, in general, they are not used anymore when assigning
> >> the device or if accesses are attempted this generally does not block
> >> the assignment use case. For example, it is said in
> >> https://github.com/qemu/qemu/blob/master/docs/igd-assign.txt that in
> >> legacy IGD assignment use case, there may be "a small numbers of DMAR
> >> faults when initially assigned".
> >
> > Hmm, fair enough. That doesn't sound too good, if the device might
> > perform arbitrary writes into guest memory once new IOMMU mappings are
> > in place. I was wondering if we could report some IOVA ranges as
> > "available but avoid if possible".
> In Shameer's series we currently reject any vfio dma_map that would fall
> into an RMRR (hence the regression on existing USB/GFX use case). With
> the relaxable RMRR info we could imagine to let the userspace choose
> whether we want to proceed with the dma_map despite the risk or
> introduce a vfio_iommu_type1 module option (turned off by default for
> not regressing existing USB/GFX passthrough) that would forbid dma_map
> on relaxable RMRR regions.
Yep, the risk that Jean-Philippe mentions is real, the IGD device has
the stolen memory addresses latched into the hardware and we're unable
to change that. What we try to do now is trap page table writes to the
device and translate them to a VM allocated stolen memory range, which
is sufficient for getting a BIOS splash screen, but we really want to
assume that the OS level driver just doesn't use the stolen memory
range. There was a time when it seemed like we could assume the Intel
drivers were heading in that direction, but it seems that's no longer
an actual goal. To fully support IGD assignment in a way that isn't as
fragile as it is today, we'd want to re-export the RMRR out to
userspace so that QEMU could identity map it into the VM address
space. That's not trivial, it's only one of several issues around
IGD assignment, and we've got GVT-g (Intel vGPUs) now that don't impose
these requirements, so motivation to tackle the issue is somewhat
reduced.
With the changes here, we might want vfio to issue a warning when one
of these relaxed reserved regions is ignored and we'd probably want a
module option to opt-in to strict enforcement, where downstreams that
don't claim to support IGD assignment might enforce this by default.
Thanks,
Alex