Re: [PATCH v11 11/22] vfio iommu: Add blocking notifier to notify DMA_UNMAP

From: Alex Williamson
Date: Tue Nov 08 2016 - 16:28:25 EST


On Wed, 9 Nov 2016 01:29:19 +0530
Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:

> On 11/8/2016 11:16 PM, Alex Williamson wrote:
> > On Tue, 8 Nov 2016 21:56:29 +0530
> > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> >
> >> On 11/8/2016 5:15 AM, Alex Williamson wrote:
> >>> On Sat, 5 Nov 2016 02:40:45 +0530
> >>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> >>>
> >> ...
> >>>>
> >>>> +int vfio_register_notifier(struct device *dev, struct notifier_block *nb)
> >>>
> >>> Is the expectation here that this is a generic notifier for all
> >>> vfio->mdev signaling? That should probably be made clear in the mdev
> >>> API to avoid vendor drivers assuming their notifier callback only
> >>> occurs for unmaps, even if that's currently the case.
> >>>
> >>
> >> Ok. Adding comment about notifier callback in mdev_device which is part
> >> of next patch.
> >>
> >> ...
> >>
> >>>> mutex_lock(&iommu->lock);
> >>>>
> >>>> - if (!iommu->external_domain) {
> >>>> + /* Fail if notifier list is empty */
> >>>> + if ((!iommu->external_domain) || (!iommu->notifier.head)) {
> >>>> ret = -EINVAL;
> >>>> goto pin_done;
> >>>> }
> >>>> @@ -867,6 +870,11 @@ unlock:
> >>>> /* Report how much was unmapped */
> >>>> unmap->size = unmapped;
> >>>>
> >>>> + if (unmapped && iommu->external_domain)
> >>>> + blocking_notifier_call_chain(&iommu->notifier,
> >>>> + VFIO_IOMMU_NOTIFY_DMA_UNMAP,
> >>>> + unmap);
> >>>
> >>> This is after the fact, there's already a gap here where pages are
> >>> unpinned and the mdev device is still running.
> >>
> >> Oh, there is a bug here, now unpin_pages() take user_pfn as argument and
> >> find vfio_dma. If its not found, it doesn't unpin pages. We have to call
> >> this notifier before vfio_remove_dma(). But if we call this before
> >> vfio_remove_dma() there will be deadlock since iommu->lock is already
> >> held here and vfio_iommu_type1_unpin_pages() will also try to hold
> >> iommu->lock.
> >> If we want to call blocking_notifier_call_chain() before
> >> vfio_remove_dma(), sequence should be:
> >>
> >> unmapped += dma->size;
> >> mutex_unlock(&iommu->lock);
> >> if (iommu->external_domain)) {
> >> struct vfio_iommu_type1_dma_unmap nb_unmap;
> >>
> >> nb_unmap.iova = dma->iova;
> >> nb_unmap.size = dma->size;
> >> blocking_notifier_call_chain(&iommu->notifier,
> >> VFIO_IOMMU_NOTIFY_DMA_UNMAP,
> >> &nb_unmap);
> >> }
> >> mutex_lock(&iommu->lock);
> >> vfio_remove_dma(iommu, dma);
> >
> > It seems like it would be worthwhile to have the rb-tree rooted in the
> > vfio-dma, then we only need to call the notifier if there are pages
> > pinned within that vfio-dma (ie. the rb-tree is not empty). We can
> > then release the lock call the notifier, re-acquire the lock, and
> > BUG_ON if the rb-tree still is not empty. We might get duplicate pfns
> > between separate vfio_dma structs, but as I mentioned in other replies,
> > that seems like an exception that we don't need to optimize for.
> >
>
> If we don't optimize for the case where iova from different vfio_dma are
> mapped to same pfn and we would not consider this case for page
> accounting then:

Just to clarify, the current code (not handling mdevs) will pin and do
page accounting per iova, regardless of whether the iova translates to a
unique pfn. As long as we do no worse than that, I'm ok.

> - have rb tree of pinned iova, where key would be iova, in each vfio_dma
> structure.
> - iova tracking structure would have iova and ref_count only.
> - page accounting would only count number of iova's in rb_tree, case
> where different iova could map to same pfn would not be considered in
> this implementation for now.
> - vfio_unpin_pages() would have user_pfn and pfn as input, we would
> validate that iova exist in rb tree and trust vendor driver that
> corresponding pfn is correct, there is no validation of pfn. If want
> validate pfn, call GUP, verify pfn and call put_pfn().
> - In .release() or .detach_group() path, if there are entries in this rb
> tree, call GUP again using that iova, get pfn and then call
> put_pfn(pfn) for ref_count+1 times. This is because we are not keeping
> pfn in our tracking logic.

Wait a sec, if we detach a group from the container and it's not the
last group in the container (which would trigger a release), we can't
assume anything about which vfio_dma entries were associated with that
device. The vendor driver, through the release of the device(s) within
that group, needs to unpin. In a container release, we need to send a
notifier to the vendor driver(s) to cause an unpin. This is the only
mechanism we have to ensure that vendor drivers are not leaking
references. If during the release, after the notifier, if any
vfio_pfns remain, we need to BUG_ON, just like we need to do for any
other DMA_UNMAP.

Also, I'll say it again, I also don't like this API of passing around
potentially giant arrays, and especially the API of relying on the
vendor driver to tell us an arbitrary pfn to unpin. If we make the
assumption that vendor drivers do not pin lots and lots of memory,
perhaps we could use a struct vfio_pfn as:

struct vfio_pfn {
struct rb_node node;
dma_addr_t iova; /* key */
unsigned long pfn;
atomic_t ref_count;
};

This puts us at 44-bytes per pfn, which isn't great, but I think it
puts us in a better position with the API that we could make use of a
page-table or sparse array in the future that would eliminate the
rb_node and make the iova implicit in the location of the data
structure. That would leave only the pfn and ref_count, which could
potentially be combined into a single 8-byte field if we had per
vfio_dma (or higher) locking to avoid the atomic_t (and we're happy that
the reference count is always less than PAGE_SIZE, ie. we could fail
pinning if we get to that point).

That would allow for the unpin call to not provide the pfn, so could we
then look at whether we need the batching provided by the iova array at
all? I don't have a feel for the size of memory that gets pinned by
the vendor driver, the frequency of pinning, or whether usage of
hugepages for the guest is likely to translate into contiguous memory
requests through this API. What's your feeling? Thanks,

Alex