Re: [PATCH v12 10/22] vfio iommu type1: Add support for mediated devices

From: Kirti Wankhede
Date: Tue Nov 15 2016 - 10:11:54 EST




On 11/15/2016 4:55 AM, Alex Williamson wrote:
> On Mon, 14 Nov 2016 21:12:24 +0530
> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>
>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>> managed by an IOMMU domain.
>>
>> Aim of this change is:
>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>> - To support direct assigned device and mediated device in single module
>>
>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>> backend module. More details:
>> - Domain for external user is tracked separately in vfio_iommu structure.
>> It is allocated when group for first mdev device is attached.
>> - Pages pinned for external domain are tracked in each vfio_dma structure
>> for that iova range.
>> - Page tracking rb-tree in vfio_dma keeps <iova, pfn, ref_count>. Key of
>> rb-tree is iova, but it actually aims to track pfns.
>> - On external pin request for an iova, page is pinned only once, if iova
>> is already pinned and tracked, ref_count is incremented.
>
> This is referring only to the external (ie. pfn_list) tracking only,
> correct? In general, a page can be pinned up to twice per iova
> referencing it, once for an iommu mapped domain and again in the
> pfn_list, right?
>

That's right.

...
}
>>
>> - if (!rsvd && !lock_cap && mm->locked_vm + i + 1 > limit) {
>> - put_pfn(pfn, prot);
>> + iova = vaddr - dma->vaddr + dma->iova;
>
>
> nit, this could be incremented in the for loop just like vaddr, we
> don't need to create it from scratch (iova += PAGE_SIZE).
>

Ok.

>
>> + vpfn = vfio_find_vpfn(dma, iova);
>> + if (!vpfn)
>> + lock_acct++;
>> +
>> + if (!rsvd && !lock_cap &&
>> + mm->locked_vm + lock_acct + 1 > limit) {
>
>
> lock_acct was incremented just above, why is this lock_acct + 1? I
> think we're off by one here (ie. remove the +1)?
>

Moved this vfio_find_vpfn() check after this check.


...


>> -static long vfio_unpin_pages_remote(struct vfio_dma *dma, unsigned long pfn,
>> - long npage, int prot, bool do_accounting)
>> +static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>> + unsigned long pfn, long npage,
>> + bool do_accounting)
>> {
>> - unsigned long unlocked = 0;
>> + long unlocked = 0, locked = 0;
>> long i;
>>
>> - for (i = 0; i < npage; i++)
>> - unlocked += put_pfn(pfn++, prot);
>> + for (i = 0; i < npage; i++) {
>> + struct vfio_pfn *vpfn;
>> +
>> + unlocked += put_pfn(pfn++, dma->prot);
>> +
>> + vpfn = vfio_find_vpfn(dma, iova + (i << PAGE_SHIFT));
>> + if (vpfn)
>> + locked++;
>
> This isn't taking into account reserved pages (ex. device mmaps). In
> the pinning path above we skip accounting of reserved pages, put_pfn
> also only increments for non-reserved pages, but vfio_find_vpfn()
> doesn't care, so it's possible that (locked - unlocked) could result in
> a positive value. Maybe something like:
>
> if (put_pfn(...)) {
> unlocked++;
> if (vfio_find_vpfn(...))
> locked++;
> }
>

Updating.

...
>>
>> -static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>> +static int vfio_iommu_type1_pin_pages(void *iommu_data,
>> + unsigned long *user_pfn,
>> + int npage, int prot,
>> + unsigned long *phys_pfn)
>> +{
>> + struct vfio_iommu *iommu = iommu_data;
>> + int i, j, ret;
>> + unsigned long remote_vaddr;
>> + unsigned long *pfn = phys_pfn;
>
> nit, why do we have this variable rather than using phys_pfn directly?
> Maybe before we had unwind support we incremented this rather than
> indexing it?
>

Updating.

...

>> + iova = user_pfn[i] << PAGE_SHIFT;
>> + dma = vfio_find_dma(iommu, iova, 0);
>> + if (!dma)
>> + goto unpin_exit;
>> + unlocked += vfio_unpin_page_external(dma, iova, do_accounting);
>> + }
>> +
>> +unpin_exit:
>> + mutex_unlock(&iommu->lock);
>> + return unlocked;
>
> This is not returning what it's supposed to. unlocked is going to be
> less than or equal to the number of pages unpinned. We don't need to
> track unlocked, I think we're just tracking where we are in the unpin
> array, whether it was partial or complete. I think we want:
>
> return i > npage ? npage : i;
>
> Or maybe we can make it more obvious if there's an error on the first
> unpin entry:
>
> return i > npage ? npage : (i > 0 ? i : -EINVAL);
>

Thanks for pointing that out. Updating.

...
>> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
>> +{
>> + struct rb_node *n, *p;
>> +
>> + n = rb_first(&iommu->dma_list);
>> + for (; n; n = rb_next(n)) {
>> + struct vfio_dma *dma;
>> + long locked = 0, unlocked = 0;
>> +
>> + dma = rb_entry(n, struct vfio_dma, node);
>> + unlocked += vfio_unmap_unpin(iommu, dma, false);
>> + p = rb_first(&dma->pfn_list);
>> + for (; p; p = rb_next(p))
>> + locked++;
>
> We don't know that these weren't reserved pages. If the vendor driver
> was pinning peer-to-peer ranges vfio_unmap_unpin() might have returned
> 0 yet we're assuming each is locked RAM, so our accounting can go
> upside down.
>

Adding if rsvd check here.

>> + vfio_lock_acct(dma->task, locked - unlocked);
>> + }
>> +}
>> +
>> +static void vfio_external_unpin_all(struct vfio_iommu *iommu,
>> + bool do_accounting)
>> +{
>> + struct rb_node *n, *p;
>> +
>> + n = rb_first(&iommu->dma_list);
>> + for (; n; n = rb_next(n)) {
>> + struct vfio_dma *dma;
>> +
>> + dma = rb_entry(n, struct vfio_dma, node);
>> + while ((p = rb_first(&dma->pfn_list))) {
>> + int unlocked;
>> + struct vfio_pfn *vpfn = rb_entry(p, struct vfio_pfn,
>> + node);
>> +
>> + unlocked = vfio_iova_put_vfio_pfn(dma, vpfn);
>> + if (do_accounting)
>> + vfio_lock_acct(dma->task, -unlocked);
>
> nit, we could batch these further, only updating accounting once per
> vfio_dma, or once entirely.
>

Ok.

Thanks,
Kirti