Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication
From: Niklas Schnelle
Date: Tue Nov 29 2022 - 09:40:57 EST
On Tue, 2022-11-29 at 12:53 +0000, Robin Murphy wrote:
> On 2022-11-29 12:00, Niklas Schnelle wrote:
> > On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
> > > On 2022-11-16 17:16, Niklas Schnelle wrote:
> > > > When RPCIT indicates that the underlying hypervisor has run out of
> > > > resources it often means that its IOVA space is exhausted and IOVAs need
> > > > to be freed before new ones can be created. By triggering a flush of the
> > > > IOVA queue we can get the queued IOVAs freed and also get the new
> > > > mapping established during the global flush.
> > >
> > > Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is
> > > exhausted and fail the DMA API call before even getting as far as
> > > iommu_map(), though? Or is there some less obvious limitation like a
> > > maximum total number of distinct IOVA regions regardless of size?
> >
> > Well, yes and no. Your thinking is of course correct if the advertised
> > available IOVA space can be fully utilized without exhausting
> > hypervisor resources we won't trigger this case. However sadly there
> > are complications. The most obvious being that in QEMU/KVM the
> > restriction of the IOVA space to what QEMU can actually have mapped at
> > once was just added recently[0] prior to that we would regularly go
> > through this "I'm out of resources free me some IOVAs" dance with our
> > existing DMA API implementation where this just triggers an early cycle
> > of freeing all unused IOVAs followed by a global flush. On z/VM I know
> > of no situations where this is triggered. That said this signalling is
> > architected so z/VM may have corner cases where it does this. On our
> > bare metal hypervisor (no paging) this return code is unused and IOTLB
> > flushes are simply hardware cache flushes as on bare metal platforms.
> >
> > [0]
> > https://lore.kernel.org/qemu-devel/20221028194758.204007-4-mjrosato@xxxxxxxxxxxxx/
>
> That sheds a bit more light, thanks, although I'm still not confident I
> fully understand the whole setup. AFAICS that patch looks to me like
> it's putting a fixed limit on the size of the usable address space. That
> in turn implies that "free some IOVAs and try again" might be a red
> herring and never going to work; for your current implementation, what
> that presumably means in reality is "free some IOVAs, resetting the
> allocator to start allocating lower down in the address space where it
> will happen to be below that limit, and try again", but the iommu-dma
> allocator won't do that. If it doesn't know that some arbitrary range
> below the top of the driver-advertised aperture is unusable, it will
> just keep allocating IOVAs up there and mappings will always fail.
>
> If the driver can't accurately represent the usable IOVA space via the
> aperture and/or reserved regions, then this whole approach seems doomed.
> If on the other hand I've misunderstood and you can actually still use
> any address, just not all of them at the same time,
This is exactly it, the problem is a limit on the number of IOVAs that
are concurrently mapped. In QEMU pass-through the tightest limit is
usually the one set by the host kernel parameter
vfio_iommu_type1.dma_entry_limit which defaults to 65535 mappings. With
IOMMU_DOMAIN_DMA we stay under this limit without extra action but once
there is a flush queue (including the existing per-CPU one) where each
entry may keep many pages lazily unmapped this is easly hit with fio
bandwidth tests on an NVMe. For this case this patch works reliably
because of course the number of actually active mappings without the
lazily freed ones is similar to the number of active ones with
IOMMU_DOMAIN_DMA.
> then it might in
> fact be considerably easier to skip the flush queue mechanism entirely
> and implement this internally to the driver - basically make .iotlb_sync
> a no-op for non-strict DMA domains,
I'm assuming you mean .iotlb_sync_map above.
> put the corresponding RPCIT flush
> and retry in .sync_map, then allow that to propagate an error back to
> iommu_map() if the new mapping still hasn't taken.
>
> Thanks,
> Robin.
Hmm, interesting. This would leave the IOVAs in the flush queue lazily
unmapped and thus still block their re-use but free their host
resources via a global RPCIT allowing the guest to use a different
porition of the IOVA space with those resources. It could work, though
I need to test it, but it feels a bit clunky.
Maybe we can go cleaner while using this idea of not having to flush
the queue but just freeing their host side resources. If we allowed
.iotlb_sync_map to return an error that fails the mapping operation,
then we could do it all in there. In the normal case it just does the
RPCIT but if that returns that the hypervisor ran out of resources it
does another global RPCIT allowing the hypervisor to free IOVAs that
were lazily unmapped. If the latter succeeds all is good if not then
the mapping operation failed. Logically it makes sense too,
.iotlb_sync_map is the final step of syncing the mapping to the host
which can fail just like the mapping operation itself.
Apart from the out of active IOVAs case this would also handle the
other useful error case when using .iotlb_sync_map for shadowing where
it fails because the host ran against a pinned pages limit or out of
actual memory. Not by fixing it but at least we would get a failed
mapping operation.
The other callbacks .flush_iotlb_all and .iotlb_sync
could stay the same as they are only used for unmapped pages where we
can't reasonably run out of resources in the host neither active IOVAs
nor pinned pages.