Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
From: Alex Williamson
Date: Sun Oct 23 2016 - 22:32:39 EST
On Fri, 21 Oct 2016 01:47:25 +0530
Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> Alex,
>
> Addressing your comments other than invalidation part.
>
> On 10/20/2016 2:32 AM, Alex Williamson wrote:
> > On Tue, 18 Oct 2016 02:52:04 +0530
> > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> >
> ...
> >> Tested by assigning below combinations of devices to a single VM:
> >> - GPU pass through only
> >> - vGPU device only
> >> - One GPU pass through and one vGPU device
> >> - Linux VM hot plug and unplug vGPU device while GPU pass through device
> >> exist
> >> - Linux VM hot plug and unplug GPU pass through device while vGPU device
> >> exist
> >
> > Were you able to do these with the locked memory limit of the user set
> > to the minimum required for existing GPU assignment?
> >
>
> No, is there a way to set memory limit through livbirt so that it would
> set memory limit to system memory assigned to VM?
Not that I know of, but I also don't know how you're making use of an
mdev device through libvirt yet since they don't have support for the
vfio-pci sysfsdev option. I would recommend testing with QEMU manually.
> ...
> >> + container = group->container;
> >> + if (IS_ERR(container)) {
> >
> > I don't see that we ever use an ERR_PTR to set group->container, it
> > should either be NULL or valid and the fact that we added ourselves to
> > container_users should mean that it's valid. The paranoia test here
> > would be if container is NULL, but IS_ERR() doesn't check NULL. If we
> > need that paranoia test, maybe we should just:
> >
> > if (WARN_ON(!container)) {
> >
> > I'm not fully convinced it's needed though.
> >
>
> Ok removing this check.
>
> >> + ret = PTR_ERR(container);
> >> + goto err_pin_pages;
> >> + }
> >> +
> >> + down_read(&container->group_lock);
> >> +
> >> + driver = container->iommu_driver;
> >> + if (likely(driver && driver->ops->pin_pages))
> >> + ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> >> + npage, prot, phys_pfn);
> >
> > The caller is going to need to provide some means for us to callback to
> > invalidate pinned pages.
> >
> > ret has already been used, so it's zero at this point. I expect the
> > original intention was to let the initialization above fall through
> > here so that the caller gets an errno if the driver doesn't support
> > pin_pages. Returning zero without actually doing anything seems like
> > an unexpected return value.
> >
>
> yes, changing it to:
>
> driver = container->iommu_driver;
> if (likely(driver && driver->ops->pin_pages))
> ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> npage, prot, phys_pfn);
> else
> ret = -EINVAL;
>
>
>
>
> >> +static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn)
> >> +{
> >> + struct vfio_pfn *p;
> >> + struct vfio_domain *domain = iommu->local_domain;
> >> + int ret = 1;
> >> +
> >> + if (!domain)
> >> + return 1;
> >> +
> >> + mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> + p = vfio_find_pfn(domain, pfn);
> >> + if (p)
> >> + ret = 0;
> >> +
> >> + mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> + return ret;
> >> +}
> >
> > So if the vfio_pfn for a given pfn exists, return 0, else return 1.
> > But do we know that the vfio_pfn exists at the point where we actually
> > do that accounting?
> >
>
> Only below functions call vfio_pfn_account()
> __vfio_pin_pages_remote() -> vfio_pfn_account()
> __vfio_unpin_pages_remote() -> vfio_pfn_account()
>
> Consider the case when mdev device is already assigned to VM, run some
> app in VM that pins some pages, then hotplug pass through device.
> Then __vfio_pin_pages_remote() is called when iommu capable domain is
> attached to container to pin all pages from vfio_iommu_replay(). So if
> at this time vfio_pfn exist means that the page is pinned through
> local_domain when iommu capable domain was not present, so accounting
> was already done for that pages. Hence returned 0 here which mean don't
> add this page in accounting.
Right, I see that's the intention, I can't pick any holes in the
concept, but I'll continue to try to look for bugs.
> >> +
> >> struct vwork {
> >> struct mm_struct *mm;
> >> long npage;
> >> @@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
> >> kfree(vwork);
> >> }
> >>
> >> -static void vfio_lock_acct(long npage)
> >> +static void vfio_lock_acct(struct task_struct *task, long npage)
> >> {
> >> struct vwork *vwork;
> >> struct mm_struct *mm;
> >>
> >> - if (!current->mm || !npage)
> >> + if (!task->mm || !npage)
> >> return; /* process exited or nothing to do */
> >>
> >> - if (down_write_trylock(¤t->mm->mmap_sem)) {
> >> - current->mm->locked_vm += npage;
> >> - up_write(¤t->mm->mmap_sem);
> >> + if (down_write_trylock(&task->mm->mmap_sem)) {
> >> + task->mm->locked_vm += npage;
> >> + up_write(&task->mm->mmap_sem);
> >> return;
> >> }
> >>
> >> @@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage)
> >> vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> >> if (!vwork)
> >> return;
> >> - mm = get_task_mm(current);
> >> + mm = get_task_mm(task);
> >> if (!mm) {
> >> kfree(vwork);
> >> return;
> >> @@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot)
> >> return 0;
> >> }
> >
> > This coversion of vfio_lock_acct() to pass a task_struct and updating
> > existing callers to pass current would be a great separate, easily
> > review-able patch.
> >
>
> Ok. I'll split this in separate commit.
>
>
> >>
> >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >> + int prot, unsigned long *pfn)
> >> {
> >> struct page *page[1];
> >> struct vm_area_struct *vma;
> >> + struct mm_struct *local_mm = (mm ? mm : current->mm);
> >> int ret = -EFAULT;
> >>
> >> - if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> >> + if (mm) {
> >> + down_read(&local_mm->mmap_sem);
> >> + ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> >> + !!(prot & IOMMU_WRITE), 0, page, NULL);
> >> + up_read(&local_mm->mmap_sem);
> >> + } else
> >> + ret = get_user_pages_fast(vaddr, 1,
> >> + !!(prot & IOMMU_WRITE), page);
> >> +
> >> + if (ret == 1) {
> >> *pfn = page_to_pfn(page[0]);
> >> return 0;
> >> }
> >>
> >> - down_read(¤t->mm->mmap_sem);
> >> + down_read(&local_mm->mmap_sem);
> >>
> >> - vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> >> + vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
> >>
> >> if (vma && vma->vm_flags & VM_PFNMAP) {
> >> *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> >> @@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >> ret = 0;
> >> }
> >>
> >> - up_read(¤t->mm->mmap_sem);
> >> + up_read(&local_mm->mmap_sem);
> >>
> >> return ret;
> >> }
> >
> > This would also be a great separate patch.
>
> Ok.
>
> > Have you considered
> > renaming the mm_struct function arg to "remote_mm" and making the local
> > variable simply "mm"? It seems like it would tie nicely with the
> > remote_mm path using get_user_pages_remote() while passing NULL for
> > remote_mm uses current->mm and the existing path (and avoid the general
> > oddness of passing local_mm to a "remote" function).
> >
>
> Yes, your suggestion looks good. Updating.
>
>
> >> @@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >> * the iommu can only map chunks of consecutive pfns anyway, so get the
> >> * first page and all consecutive pages with the same locking.
> >> */
> >> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> >> - int prot, unsigned long *pfn_base)
> >> +static long __vfio_pin_pages_remote(struct vfio_iommu *iommu,
> >> + unsigned long vaddr, long npage,
> >> + int prot, unsigned long *pfn_base)
>
> ...
>
>
> >> @@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
> >> break;
> >> }
> >>
> >> + lock_acct += vfio_pfn_account(iommu, pfn);
> >> +
> >
> > I take it that this is the new technique for keeping the accounting
> > accurate, we only increment the locked accounting by the amount not
> > already pinned in a vfio_pfn.
> >
>
> That's correct.
>
>
> >> if (!rsvd && !lock_cap &&
> >> - current->mm->locked_vm + i + 1 > limit) {
> >> + current->mm->locked_vm + lock_acct > limit) {
> >> put_pfn(pfn, prot);
> >> pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> >> __func__, limit << PAGE_SHIFT);
> >> @@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
> >> }
> >>
> >> if (!rsvd)
> >> - vfio_lock_acct(i);
> >> + vfio_lock_acct(current, lock_acct);
> >>
> >> return i;
> >> }
> >>
> >> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> >> - int prot, bool do_accounting)
> >> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
> >> + unsigned long pfn, long npage, int prot,
> >> + bool do_accounting)
> >
> > Have you noticed that it's kind of confusing that
> > __vfio_{un}pin_pages_remote() uses current, which does a
> > get_user_pages_fast() while "local" uses a provided task_struct and
> > uses get_user_pages_*remote*()? And also what was effectively local
> > (ie. we're pinning for our own use here) is now "remote" and pinning
> > for a remote, vendor driver consumer, is now "local". It's not very
> > intuitive.
> >
>
> 'local' in local_domain was suggested to describe the domain for local
> page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this
> name were discarded. May be we should revisit what the name should be.
> Any suggestion?
>
> For local_domain, to pin pages, flow is:
>
> for local_domain
> |- vfio_pin_pages()
> |- vfio_iommu_type1_pin_pages()
> |- __vfio_pin_page_local()
> |- vaddr_get_pfn(task->mm)
> |- get_user_pages_remote()
>
> __vfio_pin_page_local() --> get_user_pages_remote()
In vfio.c we have the concept of an external user, perhaps that could
be continued here. An mdev driver would be an external, or remote
pinning.
> >> static int vfio_iommu_type1_attach_group(void *iommu_data,
> >> struct iommu_group *iommu_group)
> >> {
> >> struct vfio_iommu *iommu = iommu_data;
> >> - struct vfio_group *group, *g;
> >> + struct vfio_group *group;
> >> struct vfio_domain *domain, *d;
> >> struct bus_type *bus = NULL;
> >> int ret;
> >> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >> mutex_lock(&iommu->lock);
> >>
> >> list_for_each_entry(d, &iommu->domain_list, next) {
> >> - list_for_each_entry(g, &d->group_list, next) {
> >> - if (g->iommu_group != iommu_group)
> >> - continue;
> >> + if (find_iommu_group(d, iommu_group)) {
> >> + mutex_unlock(&iommu->lock);
> >> + return -EINVAL;
> >> + }
> >> + }
> >
> > The find_iommu_group() conversion would also be an easy separate patch.
> >
>
> Ok.
>
> >>
> >> + if (iommu->local_domain) {
> >> + if (find_iommu_group(iommu->local_domain, iommu_group)) {
> >> mutex_unlock(&iommu->lock);
> >> return -EINVAL;
> >> }
> >> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >> if (ret)
> >> goto out_free;
> >>
> >> + if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
> >> + (bus == &mdev_bus_type)) {
> >> + if (!iommu->local_domain) {
> >> + domain->local_addr_space =
> >> + kzalloc(sizeof(*domain->local_addr_space),
> >> + GFP_KERNEL);
> >> + if (!domain->local_addr_space) {
> >> + ret = -ENOMEM;
> >> + goto out_free;
> >> + }
> >> +
> >> + domain->local_addr_space->task = current;
> >> + INIT_LIST_HEAD(&domain->group_list);
> >> + domain->local_addr_space->pfn_list = RB_ROOT;
> >> + mutex_init(&domain->local_addr_space->pfn_list_lock);
> >> + iommu->local_domain = domain;
> >> + } else
> >> + kfree(domain);
> >> +
> >> + list_add(&group->next, &domain->group_list);
> >
> > I think you mean s/domain/iommu->local_domain/ here, we just freed
> > domain in the else path.
> >
>
> Yes, corrected.
>
> >> + mutex_unlock(&iommu->lock);
> >> + return 0;
> >> + }
> >> +
> >> domain->domain = iommu_domain_alloc(bus);
> >> if (!domain->domain) {
> >> ret = -EIO;
> >> @@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
> >> vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
> >> }
> >>
> >> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
> >> +{
> >> + struct vfio_domain *domain = iommu->local_domain;
> >> + struct vfio_dma *dma, *tdma;
> >> + struct rb_node *n;
> >> + long locked = 0;
> >> +
> >> + rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
> >> + node) {
> >> + vfio_unmap_unpin(iommu, dma);
> >> + }
> >> +
> >> + mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> + n = rb_first(&domain->local_addr_space->pfn_list);
> >> +
> >> + for (; n; n = rb_next(n))
> >> + locked++;
> >> +
> >> + vfio_lock_acct(domain->local_addr_space->task, locked);
> >> + mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +}
> >
> > Couldn't a properly timed mlock by the user allow them to lock more
> > memory than they're allowed here? For instance imagine the vendor
> > driver has pinned the entire VM memory and the user has exactly the
> > locked memory limit for that VM. During the gap here between unpinning
> > the entire vfio_dma list and re-accounting for the pfn_list, the user
> > can mlock up to their limit again an now they've doubled the locked
> > memory they're allowed.
> >
>
> As per original code, vfio_unmap_unpin() calls
> __vfio_unpin_pages_remote(.., false) with do_accounting set to false,
> why is that so?
Because vfio_dma tracks the user granularity of calling MAP_DMA, not
the granularity with which the iommu mapping was actually done. There
might be multiple non-contiguous chunks to make that mapping and we
don't know how the iommu chose to map a given chunk to support large
page sizes. If we chose to do accounting on the iommu_unmap()
granularity, we might account for every 4k page separately. We choose
not to do accounting there so that we can batch the accounting into one
update per range.
> Here if accounting is set to true then we don't have to do re-accounting
> here.
If vfio_unmap_unpin() did not do accounting, you could update
accounting once with the difference between what was pinned and what
remains pinned via the mdev and avoid the gap caused by de-accounting
everything and then re-accounting only for the mdev pinnings.
> >> +
> >> +static void vfio_local_unpin_all(struct vfio_domain *domain)
> >> +{
> >> + struct rb_node *node;
> >> +
> >> + mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> + while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> >> + vfio_unpin_pfn(domain,
> >> + rb_entry(node, struct vfio_pfn, node), false);
> >> +
> >> + mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +}
> >> +
> >> static void vfio_iommu_type1_detach_group(void *iommu_data,
> >> struct iommu_group *iommu_group)
> >> {
> >> @@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> >>
> >> mutex_lock(&iommu->lock);
> >>
> >> - list_for_each_entry(domain, &iommu->domain_list, next) {
> >> - list_for_each_entry(group, &domain->group_list, next) {
> >> - if (group->iommu_group != iommu_group)
> >> - continue;
> >> -
> >> - iommu_detach_group(domain->domain, iommu_group);
> >> + if (iommu->local_domain) {
> >> + domain = iommu->local_domain;
> >> + group = find_iommu_group(domain, iommu_group);
> >> + if (group) {
> >> list_del(&group->next);
> >> kfree(group);
> >> - /*
> >> - * Group ownership provides privilege, if the group
> >> - * list is empty, the domain goes away. If it's the
> >> - * last domain, then all the mappings go away too.
> >> - */
> >> +
> >> if (list_empty(&domain->group_list)) {
> >> - if (list_is_singular(&iommu->domain_list))
> >> + vfio_local_unpin_all(domain);
> >> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> >> vfio_iommu_unmap_unpin_all(iommu);
> >> - iommu_domain_free(domain->domain);
> >> - list_del(&domain->next);
> >> kfree(domain);
> >> + iommu->local_domain = NULL;
> >> + }
> >
> >
> > I can't quite wrap my head around this, if we have mdev groups attached
> > and this iommu group matches an mdev group, remove from list and free
> > the group. If there are now no more groups in the mdev group list,
> > then for each vfio_pfn, unpin the pfn, /without/ doing accounting
> > udpates
>
> corrected the code to do accounting here.
>
> > and remove the vfio_pfn, but only if the ref_count is now
> > zero.
>
> Yes, If you see the loop vfio_local_unpin_all(), it iterates till the
> node in rb tree exist
>
> >> + while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> >> + vfio_unpin_pfn(domain,
> >> + rb_entry(node, struct vfio_pfn, node), false);
> >> +
>
>
> and vfio_unpin_pfn() only remove the node from rb tree if ref count is
> zero.
>
> static int vfio_unpin_pfn(struct vfio_domain *domain,
> struct vfio_pfn *vpfn, bool do_accounting)
> {
> __vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot,
> do_accounting);
>
> if (atomic_dec_and_test(&vpfn->ref_count))
> vfio_remove_from_pfn_list(domain, vpfn);
>
> return 1;
> }
>
> so for example for a vfio_pfn ref_count is 2, first iteration would be:
> - call __vfio_unpin_page_local()
> - atomic_dec(ref_count), so now ref_count is 1, but node is not removed
> from rb tree.
>
> In next iteration:
> - call __vfio_unpin_page_local()
> - atomic_dec(ref_count), so now ref_count is 0, remove node from rb tree.
Ok, I missed that, thanks.
> > We free the domain, so if the ref_count was non-zero we've now
> > just leaked memory. I think that means that if a vendor driver pins a
> > given page twice, that leak occurs. Furthermore, if there is not an
> > iommu capable domain in the container, we remove all the vfio_dma
> > entries as well, ok. Maybe the only issue is those leaked vfio_pfns.
> >
>
> So if vendor driver pins a page twice, vfio_unpin_pfn() would get called
> twice and only when ref count is zero that node is removed from rb tree.
> So there is no memory leak.
Ok