Re: [PATCH v2 7/7] iommu/s390: flush queued IOVAs on RPCIT out of resource indication

From: Matthew Rosato
Date: Tue Nov 29 2022 - 08:52:20 EST


On 11/29/22 7:00 AM, Niklas Schnelle wrote:
> On Mon, 2022-11-28 at 14:52 +0000, Robin Murphy wrote:
>> On 2022-11-16 17:16, Niklas Schnelle wrote:
>>> When RPCIT indicates that the underlying hypervisor has run out of
>>> resources it often means that its IOVA space is exhausted and IOVAs need
>>> to be freed before new ones can be created. By triggering a flush of the
>>> IOVA queue we can get the queued IOVAs freed and also get the new
>>> mapping established during the global flush.
>>
>> Shouldn't iommu_dma_alloc_iova() already see that the IOVA space is
>> exhausted and fail the DMA API call before even getting as far as
>> iommu_map(), though? Or is there some less obvious limitation like a
>> maximum total number of distinct IOVA regions regardless of size?
>
> Well, yes and no. Your thinking is of course correct if the advertised
> available IOVA space can be fully utilized without exhausting
> hypervisor resources we won't trigger this case. However sadly there
> are complications. The most obvious being that in QEMU/KVM the
> restriction of the IOVA space to what QEMU can actually have mapped at
> once was just added recently[0] prior to that we would regularly go
> through this "I'm out of resources free me some IOVAs" dance with our
> existing DMA API implementation where this just triggers an early cycle
> of freeing all unused IOVAs followed by a global flush. On z/VM I know
> of no situations where this is triggered. That said this signalling is

While the QEMU case made for an easily-reproducible scenario, the indication was really provided to handle a scenario where you have multiple pageable guests whose sum of total memory is overcommitted as compared to the hypervisor's resources. The intent is for the entire advertised guest aperture to be usable and generally-speaking it is, but it's possible to find yourself in a (poorly tuned) scenario where the hypervisor is unable to pin additional pages (basically an OOM condition) -- the hypervisor (qemu/kvm or z/VM) can use this as a cry for help to say "stop what you're doing and flush your queues immediately so I can unpin as much as possible", and then after that the guest(s) can continue using their aperture.

This is unnecessary for the no-paging bare metal hypervisor because there you aren't over-committing memory.