Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
From: Kirti Wankhede
Date: Thu Oct 20 2016 - 16:17:55 EST
Alex,
Addressing your comments other than invalidation part.
On 10/20/2016 2:32 AM, Alex Williamson wrote:
> On Tue, 18 Oct 2016 02:52:04 +0530
> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>
...
>> Tested by assigning below combinations of devices to a single VM:
>> - GPU pass through only
>> - vGPU device only
>> - One GPU pass through and one vGPU device
>> - Linux VM hot plug and unplug vGPU device while GPU pass through device
>> exist
>> - Linux VM hot plug and unplug GPU pass through device while vGPU device
>> exist
>
> Were you able to do these with the locked memory limit of the user set
> to the minimum required for existing GPU assignment?
>
No, is there a way to set memory limit through livbirt so that it would
set memory limit to system memory assigned to VM?
>>
...
>> + container = group->container;
>> + if (IS_ERR(container)) {
>
> I don't see that we ever use an ERR_PTR to set group->container, it
> should either be NULL or valid and the fact that we added ourselves to
> container_users should mean that it's valid. The paranoia test here
> would be if container is NULL, but IS_ERR() doesn't check NULL. If we
> need that paranoia test, maybe we should just:
>
> if (WARN_ON(!container)) {
>
> I'm not fully convinced it's needed though.
>
Ok removing this check.
>> + ret = PTR_ERR(container);
>> + goto err_pin_pages;
>> + }
>> +
>> + down_read(&container->group_lock);
>> +
>> + driver = container->iommu_driver;
>> + if (likely(driver && driver->ops->pin_pages))
>> + ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>> + npage, prot, phys_pfn);
>
> The caller is going to need to provide some means for us to callback to
> invalidate pinned pages.
>
> ret has already been used, so it's zero at this point. I expect the
> original intention was to let the initialization above fall through
> here so that the caller gets an errno if the driver doesn't support
> pin_pages. Returning zero without actually doing anything seems like
> an unexpected return value.
>
yes, changing it to:
driver = container->iommu_driver;
if (likely(driver && driver->ops->pin_pages))
ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
npage, prot, phys_pfn);
else
ret = -EINVAL;
>> +static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn)
>> +{
>> + struct vfio_pfn *p;
>> + struct vfio_domain *domain = iommu->local_domain;
>> + int ret = 1;
>> +
>> + if (!domain)
>> + return 1;
>> +
>> + mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> + p = vfio_find_pfn(domain, pfn);
>> + if (p)
>> + ret = 0;
>> +
>> + mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> + return ret;
>> +}
>
> So if the vfio_pfn for a given pfn exists, return 0, else return 1.
> But do we know that the vfio_pfn exists at the point where we actually
> do that accounting?
>
Only below functions call vfio_pfn_account()
__vfio_pin_pages_remote() -> vfio_pfn_account()
__vfio_unpin_pages_remote() -> vfio_pfn_account()
Consider the case when mdev device is already assigned to VM, run some
app in VM that pins some pages, then hotplug pass through device.
Then __vfio_pin_pages_remote() is called when iommu capable domain is
attached to container to pin all pages from vfio_iommu_replay(). So if
at this time vfio_pfn exist means that the page is pinned through
local_domain when iommu capable domain was not present, so accounting
was already done for that pages. Hence returned 0 here which mean don't
add this page in accounting.
>> +
>> struct vwork {
>> struct mm_struct *mm;
>> long npage;
>> @@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>> kfree(vwork);
>> }
>>
>> -static void vfio_lock_acct(long npage)
>> +static void vfio_lock_acct(struct task_struct *task, long npage)
>> {
>> struct vwork *vwork;
>> struct mm_struct *mm;
>>
>> - if (!current->mm || !npage)
>> + if (!task->mm || !npage)
>> return; /* process exited or nothing to do */
>>
>> - if (down_write_trylock(¤t->mm->mmap_sem)) {
>> - current->mm->locked_vm += npage;
>> - up_write(¤t->mm->mmap_sem);
>> + if (down_write_trylock(&task->mm->mmap_sem)) {
>> + task->mm->locked_vm += npage;
>> + up_write(&task->mm->mmap_sem);
>> return;
>> }
>>
>> @@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage)
>> vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>> if (!vwork)
>> return;
>> - mm = get_task_mm(current);
>> + mm = get_task_mm(task);
>> if (!mm) {
>> kfree(vwork);
>> return;
>> @@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot)
>> return 0;
>> }
>
> This coversion of vfio_lock_acct() to pass a task_struct and updating
> existing callers to pass current would be a great separate, easily
> review-able patch.
>
Ok. I'll split this in separate commit.
>>
>> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>> + int prot, unsigned long *pfn)
>> {
>> struct page *page[1];
>> struct vm_area_struct *vma;
>> + struct mm_struct *local_mm = (mm ? mm : current->mm);
>> int ret = -EFAULT;
>>
>> - if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
>> + if (mm) {
>> + down_read(&local_mm->mmap_sem);
>> + ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
>> + !!(prot & IOMMU_WRITE), 0, page, NULL);
>> + up_read(&local_mm->mmap_sem);
>> + } else
>> + ret = get_user_pages_fast(vaddr, 1,
>> + !!(prot & IOMMU_WRITE), page);
>> +
>> + if (ret == 1) {
>> *pfn = page_to_pfn(page[0]);
>> return 0;
>> }
>>
>> - down_read(¤t->mm->mmap_sem);
>> + down_read(&local_mm->mmap_sem);
>>
>> - vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
>> + vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>>
>> if (vma && vma->vm_flags & VM_PFNMAP) {
>> *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>> @@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>> ret = 0;
>> }
>>
>> - up_read(¤t->mm->mmap_sem);
>> + up_read(&local_mm->mmap_sem);
>>
>> return ret;
>> }
>
> This would also be a great separate patch.
Ok.
> Have you considered
> renaming the mm_struct function arg to "remote_mm" and making the local
> variable simply "mm"? It seems like it would tie nicely with the
> remote_mm path using get_user_pages_remote() while passing NULL for
> remote_mm uses current->mm and the existing path (and avoid the general
> oddness of passing local_mm to a "remote" function).
>
Yes, your suggestion looks good. Updating.
>> @@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>> * the iommu can only map chunks of consecutive pfns anyway, so get the
>> * first page and all consecutive pages with the same locking.
>> */
>> -static long vfio_pin_pages(unsigned long vaddr, long npage,
>> - int prot, unsigned long *pfn_base)
>> +static long __vfio_pin_pages_remote(struct vfio_iommu *iommu,
>> + unsigned long vaddr, long npage,
>> + int prot, unsigned long *pfn_base)
...
>> @@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>> break;
>> }
>>
>> + lock_acct += vfio_pfn_account(iommu, pfn);
>> +
>
> I take it that this is the new technique for keeping the accounting
> accurate, we only increment the locked accounting by the amount not
> already pinned in a vfio_pfn.
>
That's correct.
>> if (!rsvd && !lock_cap &&
>> - current->mm->locked_vm + i + 1 > limit) {
>> + current->mm->locked_vm + lock_acct > limit) {
>> put_pfn(pfn, prot);
>> pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
>> __func__, limit << PAGE_SHIFT);
>> @@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>> }
>>
>> if (!rsvd)
>> - vfio_lock_acct(i);
>> + vfio_lock_acct(current, lock_acct);
>>
>> return i;
>> }
>>
>> -static long vfio_unpin_pages(unsigned long pfn, long npage,
>> - int prot, bool do_accounting)
>> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
>> + unsigned long pfn, long npage, int prot,
>> + bool do_accounting)
>
> Have you noticed that it's kind of confusing that
> __vfio_{un}pin_pages_remote() uses current, which does a
> get_user_pages_fast() while "local" uses a provided task_struct and
> uses get_user_pages_*remote*()? And also what was effectively local
> (ie. we're pinning for our own use here) is now "remote" and pinning
> for a remote, vendor driver consumer, is now "local". It's not very
> intuitive.
>
'local' in local_domain was suggested to describe the domain for local
page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this
name were discarded. May be we should revisit what the name should be.
Any suggestion?
For local_domain, to pin pages, flow is:
for local_domain
|- vfio_pin_pages()
|- vfio_iommu_type1_pin_pages()
|- __vfio_pin_page_local()
|- vaddr_get_pfn(task->mm)
|- get_user_pages_remote()
__vfio_pin_page_local() --> get_user_pages_remote()
>> static int vfio_iommu_type1_attach_group(void *iommu_data,
>> struct iommu_group *iommu_group)
>> {
>> struct vfio_iommu *iommu = iommu_data;
>> - struct vfio_group *group, *g;
>> + struct vfio_group *group;
>> struct vfio_domain *domain, *d;
>> struct bus_type *bus = NULL;
>> int ret;
>> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>> mutex_lock(&iommu->lock);
>>
>> list_for_each_entry(d, &iommu->domain_list, next) {
>> - list_for_each_entry(g, &d->group_list, next) {
>> - if (g->iommu_group != iommu_group)
>> - continue;
>> + if (find_iommu_group(d, iommu_group)) {
>> + mutex_unlock(&iommu->lock);
>> + return -EINVAL;
>> + }
>> + }
>
> The find_iommu_group() conversion would also be an easy separate patch.
>
Ok.
>>
>> + if (iommu->local_domain) {
>> + if (find_iommu_group(iommu->local_domain, iommu_group)) {
>> mutex_unlock(&iommu->lock);
>> return -EINVAL;
>> }
>> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>> if (ret)
>> goto out_free;
>>
>> + if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
>> + (bus == &mdev_bus_type)) {
>> + if (!iommu->local_domain) {
>> + domain->local_addr_space =
>> + kzalloc(sizeof(*domain->local_addr_space),
>> + GFP_KERNEL);
>> + if (!domain->local_addr_space) {
>> + ret = -ENOMEM;
>> + goto out_free;
>> + }
>> +
>> + domain->local_addr_space->task = current;
>> + INIT_LIST_HEAD(&domain->group_list);
>> + domain->local_addr_space->pfn_list = RB_ROOT;
>> + mutex_init(&domain->local_addr_space->pfn_list_lock);
>> + iommu->local_domain = domain;
>> + } else
>> + kfree(domain);
>> +
>> + list_add(&group->next, &domain->group_list);
>
> I think you mean s/domain/iommu->local_domain/ here, we just freed
> domain in the else path.
>
Yes, corrected.
>> + mutex_unlock(&iommu->lock);
>> + return 0;
>> + }
>> +
>> domain->domain = iommu_domain_alloc(bus);
>> if (!domain->domain) {
>> ret = -EIO;
>> @@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>> vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>> }
>>
>> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
>> +{
>> + struct vfio_domain *domain = iommu->local_domain;
>> + struct vfio_dma *dma, *tdma;
>> + struct rb_node *n;
>> + long locked = 0;
>> +
>> + rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
>> + node) {
>> + vfio_unmap_unpin(iommu, dma);
>> + }
>> +
>> + mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> + n = rb_first(&domain->local_addr_space->pfn_list);
>> +
>> + for (; n; n = rb_next(n))
>> + locked++;
>> +
>> + vfio_lock_acct(domain->local_addr_space->task, locked);
>> + mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +}
>
> Couldn't a properly timed mlock by the user allow them to lock more
> memory than they're allowed here? For instance imagine the vendor
> driver has pinned the entire VM memory and the user has exactly the
> locked memory limit for that VM. During the gap here between unpinning
> the entire vfio_dma list and re-accounting for the pfn_list, the user
> can mlock up to their limit again an now they've doubled the locked
> memory they're allowed.
>
As per original code, vfio_unmap_unpin() calls
__vfio_unpin_pages_remote(.., false) with do_accounting set to false,
why is that so?
Here if accounting is set to true then we don't have to do re-accounting
here.
>> +
>> +static void vfio_local_unpin_all(struct vfio_domain *domain)
>> +{
>> + struct rb_node *node;
>> +
>> + mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> + while ((node = rb_first(&domain->local_addr_space->pfn_list)))
>> + vfio_unpin_pfn(domain,
>> + rb_entry(node, struct vfio_pfn, node), false);
>> +
>> + mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +}
>> +
>> static void vfio_iommu_type1_detach_group(void *iommu_data,
>> struct iommu_group *iommu_group)
>> {
>> @@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>>
>> mutex_lock(&iommu->lock);
>>
>> - list_for_each_entry(domain, &iommu->domain_list, next) {
>> - list_for_each_entry(group, &domain->group_list, next) {
>> - if (group->iommu_group != iommu_group)
>> - continue;
>> -
>> - iommu_detach_group(domain->domain, iommu_group);
>> + if (iommu->local_domain) {
>> + domain = iommu->local_domain;
>> + group = find_iommu_group(domain, iommu_group);
>> + if (group) {
>> list_del(&group->next);
>> kfree(group);
>> - /*
>> - * Group ownership provides privilege, if the group
>> - * list is empty, the domain goes away. If it's the
>> - * last domain, then all the mappings go away too.
>> - */
>> +
>> if (list_empty(&domain->group_list)) {
>> - if (list_is_singular(&iommu->domain_list))
>> + vfio_local_unpin_all(domain);
>> + if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>> vfio_iommu_unmap_unpin_all(iommu);
>> - iommu_domain_free(domain->domain);
>> - list_del(&domain->next);
>> kfree(domain);
>> + iommu->local_domain = NULL;
>> + }
>
>
> I can't quite wrap my head around this, if we have mdev groups attached
> and this iommu group matches an mdev group, remove from list and free
> the group. If there are now no more groups in the mdev group list,
> then for each vfio_pfn, unpin the pfn, /without/ doing accounting
> udpates
corrected the code to do accounting here.
> and remove the vfio_pfn, but only if the ref_count is now
> zero.
Yes, If you see the loop vfio_local_unpin_all(), it iterates till the
node in rb tree exist
>> + while ((node = rb_first(&domain->local_addr_space->pfn_list)))
>> + vfio_unpin_pfn(domain,
>> + rb_entry(node, struct vfio_pfn, node), false);
>> +
and vfio_unpin_pfn() only remove the node from rb tree if ref count is
zero.
static int vfio_unpin_pfn(struct vfio_domain *domain,
struct vfio_pfn *vpfn, bool do_accounting)
{
__vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot,
do_accounting);
if (atomic_dec_and_test(&vpfn->ref_count))
vfio_remove_from_pfn_list(domain, vpfn);
return 1;
}
so for example for a vfio_pfn ref_count is 2, first iteration would be:
- call __vfio_unpin_page_local()
- atomic_dec(ref_count), so now ref_count is 1, but node is not removed
from rb tree.
In next iteration:
- call __vfio_unpin_page_local()
- atomic_dec(ref_count), so now ref_count is 0, remove node from rb tree.
> We free the domain, so if the ref_count was non-zero we've now
> just leaked memory. I think that means that if a vendor driver pins a
> given page twice, that leak occurs. Furthermore, if there is not an
> iommu capable domain in the container, we remove all the vfio_dma
> entries as well, ok. Maybe the only issue is those leaked vfio_pfns.
>
So if vendor driver pins a page twice, vfio_unpin_pfn() would get called
twice and only when ref count is zero that node is removed from rb tree.
So there is no memory leak.
Kirti