答复: [PATCH] vfio dma_map/unmap: optimized for hugetlbfs pages

From: Maoming (maoming, Cloud Infrastructure Service Product Dept.)
Date: Thu Aug 13 2020 - 02:11:38 EST


Hi, Alex
Thanks for taking a look. Also, my apologies for the slow reply.
Some replies below:
-----邮件原件-----
发件人: Zhoujian (jay)
发送时间: 2020年7月21日 21:28
收件人: Alex Williamson <alex.williamson@xxxxxxxxxx>
抄送: linux-kernel@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; cohuck@xxxxxxxxxx; Maoming (maoming, Cloud Infrastructure Service Product Dept.) <maoming.maoming@xxxxxxxxxx>; Huangweidong (C) <weidong.huang@xxxxxxxxxx>; Peter Xu <peterx@xxxxxxxxxx>; Andrea Arcangeli <aarcange@xxxxxxxxxx>
主题: RE: [PATCH] vfio dma_map/unmap: optimized for hugetlbfs pages

Thanks for taking a close look at the code, Alex.
We'll check them one by one ASAP.

> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
> Sent: Tuesday, July 21, 2020 6:46 AM
> To: Zhoujian (jay) <jianjay.zhou@xxxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx;
> cohuck@xxxxxxxxxx; Maoming (maoming, Cloud Infrastructure Service
> Product Dept.) <maoming.maoming@xxxxxxxxxx>; Huangweidong (C)
> <weidong.huang@xxxxxxxxxx>; Peter Xu <peterx@xxxxxxxxxx>; Andrea
> Arcangeli <aarcange@xxxxxxxxxx>
> Subject: Re: [PATCH] vfio dma_map/unmap: optimized for hugetlbfs pages
>
> On Mon, 20 Jul 2020 16:29:47 +0800
> Jay Zhou <jianjay.zhou@xxxxxxxxxx> wrote:
>
> > From: Ming Mao <maoming.maoming@xxxxxxxxxx>
> >
> > Hi all,
> > I'm working on starting lots of big size Virtual Machines(memory:
> > >128GB) with VFIO-devices. And I encounter a problem that is the
> > waiting time of starting all Virtual Machines is too long. I analyze
> > the startup log and find that the time of pinning/unpinning pages
> > could be reduced.
> >
> > In the original process, to make sure the pages are contiguous, we
> > have to check all pages one by one. I think maybe we can use
> > hugetlbfs pages which can skip this step.
> > So I create a patch to do this.
> > According to my test, the result of this patch is pretty well.
> >
> > Virtual Machine: 50G memory, 32 CPU, 1 VFIO-device, 1G hugetlbfs page
> > original after optimization
> > pin time 700ms 0.1ms
> >
> > I Suppose that:
> > 1)the hugetlbfs page should not be split 2)PG_reserved is not
> > relevant for hugetlbfs pages 3)we can delete the for loops and use
> > some operations (such as atomic_add,page_ref_add) instead
> >
> > please correct me if I am wrong.
> >
> > Thanks.
> >
> > Signed-off-by: Ming Mao <maoming.maoming@xxxxxxxxxx>
> > ---
> > drivers/vfio/vfio_iommu_type1.c | 236 ++++++++++++++++++++++++++++++--
> > include/linux/vfio.h | 20 +++
> > 2 files changed, 246 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 5e556ac91..42e25752e 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -415,6 +415,46 @@ static int put_pfn(unsigned long pfn, int prot)
> > return 0;
> > }
> >
> > +/*
> > + * put pfns for a hugetlbfs page
> > + * @start:the 4KB-page we start to put,can be any page in this
> > +hugetlbfs page
> > + * @npage:the number of 4KB-pages need to put
>
> This code supports systems where PAGE_SIZE is not 4KB.
>

Yes, I will fix this problem


> > + * @prot:IOMMU_READ/WRITE
> > + */
> > +static int hugetlb_put_pfn(unsigned long start, unsigned int npage,
> > +int prot) {
> > + struct page *page = NULL;
> > + struct page *head = NULL;
>
> Unnecessary initialization.
>
Yes,you are right



> > +
> > + if (!npage || !pfn_valid(start))
> > + return 0;
> > +
> > + page = pfn_to_page(start);
> > + if (!page || !PageHuge(page))
> > + return 0;
> > + head = compound_head(page);
> > + /*
> > + * The last page should be in this hugetlbfs page.
> > + * The number of putting pages should be equal to the number
> > + * of getting pages.So the hugepage pinned refcount and the normal
> > + * page refcount can not be smaller than npage.
> > + */
> > + if ((head != compound_head(pfn_to_page(start + npage - 1)))
> > + || (page_ref_count(head) < npage)
> > + || (compound_pincount(page) < npage))
> > + return 0;
> > +
> > + if ((prot & IOMMU_WRITE) && !PageDirty(page))
> > + set_page_dirty_lock(page);
> > +
> > + atomic_sub(npage, compound_pincount_ptr(head));
> > + if (page_ref_sub_and_test(head, npage))
> > + __put_page(head);
> > +
> > + mod_node_page_state(page_pgdat(head), NR_FOLL_PIN_RELEASED,
> npage);
> > + return 1;
> > +}
> > +
> > static int follow_fault_pfn(struct vm_area_struct *vma, struct
> > mm_struct
> *mm,
> > unsigned long vaddr, unsigned long *pfn,
> > bool write_fault)
> > @@ -479,6 +519,90 @@ static int vaddr_get_pfn(struct mm_struct *mm,
> unsigned long vaddr,
> > return ret;
> > }
> >
> > +struct vfio_hupetlbpage_info vfio_hugetlbpage_info[HUGE_MAX_HSTATE] = {
> > + {vfio_hugetlbpage_2M, PMD_SIZE, ~((1ULL << HPAGE_PMD_SHIFT) - 1)},
> > + {vfio_hugetlbpage_1G, PUD_SIZE, ~((1ULL << HPAGE_PUD_SHIFT) - 1)},
>
> Other architectures support more huge page sizes, also 0-day
> identified these #defines don't exist when THP is not configured. But
> why couldn't we figure out all of these form the compound_order()?
> order == shift, size = PAGE_SIZE << order.
>

Yes, compound_order() is better.


> > +};
> > +
> > +static bool is_hugetlbpage(unsigned long pfn, enum
> > +vfio_hugetlbpage_type *type) {
> > + struct page *page = NULL;
>
> Unnecessary initialization.
>
Yes,you are right



> > +
> > + if (!pfn_valid(pfn) || !type)
> > + return false;
> > +
> > + page = pfn_to_page(pfn);
> > + /* only check for hugetlbfs pages */
> > + if (!page || !PageHuge(page))
> > + return false;
> > +
> > + switch (compound_order(compound_head(page))) {
> > + case PMD_ORDER:
> > + *type = vfio_hugetlbpage_2M;
> > + break;
> > + case PUD_ORDER:
> > + *type = vfio_hugetlbpage_1G;
> > + break;
> > + default:
> > + return false;
> > + }
> > +
> > + return true;
> > +}
> > +
> > +/* Is the addr in the last page in hugetlbfs pages? */ static bool
> > +hugetlb_is_last_page(unsigned long addr, enum vfio_hugetlbpage_type
> > +type) {
> > + unsigned int num = 0;
>
> Unnecessary initialization, and in fact unnecessary variable altogether.
> ie.
>
> return hugetlb_get_resdual_pages(addr & ~(PAGE_SIZE - 1), type) == 1;
>
>
> > +
> > + num = hugetlb_get_resdual_pages(addr & ~(PAGE_SIZE - 1), type);
>
> residual?
>

Yes, I will fix this problem


> > +
> > + if (num == 1)
> > + return true;
> > + else
> > + return false;
> > +}
> > +
> > +static bool hugetlb_page_is_pinned(struct vfio_dma *dma,
> > + unsigned long start,
> > + unsigned long npages)
> > +{
> > + struct vfio_pfn *vpfn = NULL;
>
> Unnecessary initialization.
>
Yes,you are right


> > + struct rb_node *node = rb_first(&dma->pfn_list);
> > + unsigned long end = start + npages - 1;
> > +
> > + for (; node; node = rb_next(node)) {
> > + vpfn = rb_entry(node, struct vfio_pfn, node);
> > +
> > + if ((vpfn->pfn >= start) && (vpfn->pfn <= end))
> > + return true;
>
> This function could be named better, it suggests the hugetlbfs page is
> pinned, but really we're only looking for any pfn_list pinnings overlapping the pfn range.
>

Yes,you are right




> > + }
> > +
> > + return false;
> > +}
> > +
> > +static unsigned int hugetlb_get_contiguous_pages_num(struct
> > +vfio_dma
> *dma,
> > + unsigned long pfn,
> > + unsigned long resdual_npage,
> > + unsigned long max_npage)
> > +{
> > + unsigned int num = 0;
>
> Unnecessary initialization
>


Yes,you are right



> > +
> > + if (!dma)
> > + return 0;
> > +
> > + num = resdual_npage < max_npage ? resdual_npage : max_npage;
>
> min(resdual_npage, max_npage)
>
>

Yes,min is better



> > + /*
> > + * If there is only one page, it is no need to optimize them.
>
> s/no need/not necessary/
>

I will fix this problem

> > + * Maybe some pages have been pinned and inserted into
> > +dma->pfn_list by
> others.
> > + * In this case, we just goto the slow path simply.
> > + */
> > + if ((num < 2) || hugetlb_page_is_pinned(dma, pfn, num))
> > + return 0;
>
> Why does having pinnings in the pfn_list disqualify it from hugetlbfs pinning?
>
> Testing for the last page here is redundant to the pinning path,
> should it only be done here? Can num be zero?
>
> Maybe better to return -errno for this and above zero conditions
> rather than return a value that doesn't reflect the purpose of the function?
>


All pages found in pfn_list have been pinned and counted externally.
In this case, the lock_acct should not be added.
I thought it could spend lots of time to figure out how many pages have been pinned
in this hugetlb page.
But considering that the dma->pfn_list is NULL in most cases, it does not matter to
do this. So,I will modify it to figure out the number of pages pinned externally.

It is a special case if the vaddr is the last page.
Because the next page is in another hugetlb page, and the hugetlb pages may be not contiguous.
But,as you said, it is redundant to the pinning path.
I will rewrite this function.


> > +
> > + return num;
> > +}
> > +
> > /*
> > * Attempt to pin pages. We really don't want to track all the pfns and
> > * the iommu can only map chunks of consecutive pfns anyway, so get
> > the @@ -492,6 +616,7 @@ static long vfio_pin_pages_remote(struct
> > vfio_dma
> *dma, unsigned long vaddr,
> > long ret, pinned = 0, lock_acct = 0;
> > bool rsvd;
> > dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
> > + enum vfio_hugetlbpage_type type;
> >
> > /* This code path is only user initiated */
> > if (!current->mm)
> > @@ -521,6 +646,55 @@ static long vfio_pin_pages_remote(struct
> > vfio_dma
> *dma, unsigned long vaddr,
> > if (unlikely(disable_hugepages))
> > goto out;
> >
> > + /*
> > + * It is no need to get pages one by one for hugetlbfs pages.
>
> s/no need/not necessary/
>
I will fix this problem


> > + * 4KB-pages in hugetlbfs pages are contiguous.
> > + * But if the vaddr is in the last 4KB-page, we just goto the slow path.
>
>
> s/4KB-/PAGE_SIZE /
>
I will fix this problem


> Please explain the significance of vaddr being in the last PAGE_SIZE
> page of a hugetlbfs page. Is it simply that we should take the slow
> path for mapping a single page before the end of the hugetlbfs page?
> Is this optimization worthwhile? Isn't it more consistent to handle
> all mappings over hugetlbfs pages the same? Should we be operating on
> vaddr here for the hugetlbfs page alignment or base_pfn?
>


If the vaddr is the last page, the next page is in another hugetlb page.
In the new patch, I will check the address of all hugetlb pages. If the vaddr is the
last page of a hugetlb page and the next hugetlb page is contiguous, I wll take the
fast path.

Yes, the vaddr should be alignment.


> > + */
> > + if (is_hugetlbpage(*pfn_base, &type) &&
> > +!hugetlb_is_last_page(vaddr,
> type)) {
> > + unsigned long hugetlb_resdual_npage = 0;
> > + unsigned long contiguous_npage = 0;
> > + struct page *head = NULL;
> > +
> > + hugetlb_resdual_npage =
> > + hugetlb_get_resdual_pages((vaddr + PAGE_SIZE) & ~(PAGE_SIZE -
> 1),
> > +type);
>
> ~(PAGE_SIZE - 1) is PAGE_MASK, but that whole operation looks like a
> PAGE_ALIGN(vaddr)
>
> This is trying to get the number of pages after this page to the end
> of the hugetlbfs page, right? Rather than the hugetlb_is_last_page()
> above, shouldn't we have recorded the number of pages there and used math to get this value?
>

Yes, you are right. I will fix this problem


> > + /*
> > + * Maybe the hugetlb_resdual_npage is invalid.
> > + * For example, hugetlb_resdual_npage > (npage - 1) or
> > + * some pages of this hugetlbfs page have been pinned.
> > + */
> > + contiguous_npage = hugetlb_get_contiguous_pages_num(dma,
> *pfn_base + 1,
> > + hugetlb_resdual_npage, npage - 1);
> > + if (!contiguous_npage)
> > + goto slow_path;
> > +
> > + /*
> > + * Unlike THP, the splitting should not happen for hugetlbfs pages.
> > + * Since PG_reserved is not relevant for compound pages, and the
> > +pfn
> of
> > + * 4KB-page which in hugetlbfs pages is valid,
>
> s/4KB/PAGE_SIZE/
>

I will fix this problem

> > + * it is no need to check rsvd for hugetlbfs pages.
>
> s/no need/not necessary/
>

I will fix this problem

> > + */
> > + if (!dma->lock_cap &&
> > + current->mm->locked_vm + lock_acct + contiguous_npage >
> > +limit)
> {
> > + pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > + __func__, limit << PAGE_SHIFT);
> > + ret = -ENOMEM;
> > + goto unpin_out;
> > + }
> > + /*
> > + * We got a hugetlbfs page using vaddr_get_pfn alreadly.
> > + * In this case,we do not need to alloc pages and we can finish all
> > + * work by a single operation to the head page.
> > + */
> > + lock_acct += contiguous_npage;
> > + head = compound_head(pfn_to_page(*pfn_base));
> > + atomic_add(contiguous_npage, compound_pincount_ptr(head));
> > + page_ref_add(head, contiguous_npage);
> > + mod_node_page_state(page_pgdat(head), NR_FOLL_PIN_ACQUIRED,
> contiguous_npage);
> > + pinned += contiguous_npage;
> > + goto out;
>
> I'm hoping Peter or Andrea understand this, but I think we still have
> pfn_base pinned separately and I don't see that we've done an unpin
> anywhere, so are we leaking the pin of the first page??
>

Do you mean like this?
(1) ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base)
(2)rsvd is true
(3)There is something wrong and goto unpin_out
(4)can not put_pfn for pfn_base because of rsvd



> > + }
> > +slow_path:
> > /* Lock all the consecutive pages from pfn_base */
> > for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
> > pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) { @@ -569,7
> > +743,30 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma,
> > dma_addr_t iova, {
> > long unlocked = 0, locked = 0;
> > long i;
> > + enum vfio_hugetlbpage_type type;
> > +
> > + if (is_hugetlbpage(pfn, &type)) {
> > + unsigned long hugetlb_resdual_npage = 0;
> > + unsigned long contiguous_npage = 0;
>
> Unnecessary initialization...
>
> >

Yes, you are right

> > + hugetlb_resdual_npage = hugetlb_get_resdual_pages(iova &
> > +~(PAGE_SIZE - 1), type);
>
> PAGE_MASK
>

I will fix this problem


> Like above, is it pfn or iova that we should be using when looking at
> hugetlbfs alignment?
>

Yes, I will fix this problem



> > + contiguous_npage = hugetlb_get_contiguous_pages_num(dma, pfn,
> > + hugetlb_resdual_npage, npage);
> > + /*
> > + * There is not enough contiguous pages or this hugetlbfs page
> > + * has been pinned.
> > + * Let's try the slow path.
> > + */
> > + if (!contiguous_npage)
> > + goto slow_path;
> > +
> > + /* try the slow path if failed */
> > + if (hugetlb_put_pfn(pfn, contiguous_npage, dma->prot)) {
> > + unlocked = contiguous_npage;
> > + goto out;
> > + }
>
> Should probably break the pin path into a separate get_pfn function
> for symmetry.
>
>

Yes, I will fix this problem


> > + }
> > +slow_path:
> > for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > if (put_pfn(pfn++, dma->prot)) {
> > unlocked++;
> > @@ -578,6 +775,7 @@ static long vfio_unpin_pages_remote(struct
> > vfio_dma
> *dma, dma_addr_t iova,
> > }
> > }
> >
> > +out:
> > if (do_accounting)
> > vfio_lock_acct(dma, locked - unlocked, true);
> >
> > @@ -867,6 +1065,7 @@ static long vfio_unmap_unpin(struct vfio_iommu
> *iommu, struct vfio_dma *dma,
> > struct iommu_iotlb_gather iotlb_gather;
> > int unmapped_region_cnt = 0;
> > long unlocked = 0;
> > + enum vfio_hugetlbpage_type type;
> >
> > if (!dma->size)
> > return 0;
> > @@ -900,16 +1099,33 @@ static long vfio_unmap_unpin(struct
> > vfio_iommu
> *iommu, struct vfio_dma *dma,
> > continue;
> > }
> >
> > - /*
> > - * To optimize for fewer iommu_unmap() calls, each of which
> > - * may require hardware cache flushing, try to find the
> > - * largest contiguous physical memory chunk to unmap.
> > - */
> > - for (len = PAGE_SIZE;
> > - !domain->fgsp && iova + len < end; len += PAGE_SIZE) {
> > - next = iommu_iova_to_phys(domain->domain, iova + len);
> > - if (next != phys + len)
> > - break;
> > + if (is_hugetlbpage((phys >> PAGE_SHIFT), &type)
> > + && (!domain->fgsp)) {
>
> Reverse the order of these tests.
>
OK, I will fix this problem


> > + unsigned long hugetlb_resdual_npage = 0;
> > + unsigned long contiguous_npage = 0;
>
>
> Unnecessary...
>

Yes, I will fix this problem

> > + hugetlb_resdual_npage =
> > + hugetlb_get_resdual_pages(iova & ~(PAGE_SIZE - 1), type);
>
> PAGE_MASK
>

Yes, I will fix this problem


> > + /*
> > + * The number of contiguous page can not be larger than
> dma->size
> > + * which is the number of pages pinned.
> > + */
> > + contiguous_npage = ((dma->size >> PAGE_SHIFT) >
> hugetlb_resdual_npage) ?
> > + hugetlb_resdual_npage : (dma->size >> PAGE_SHIFT);
>
> min()
>


Yes, min is better
> > +
> > + len = contiguous_npage * PAGE_SIZE;
> > + } else {
> > + /*
> > + * To optimize for fewer iommu_unmap() calls, each of which
> > + * may require hardware cache flushing, try to find the
> > + * largest contiguous physical memory chunk to unmap.
> > + */
> > + for (len = PAGE_SIZE;
> > + !domain->fgsp && iova + len < end; len += PAGE_SIZE) {
> > + next = iommu_iova_to_phys(domain->domain, iova + len);
> > + if (next != phys + len)
> > + break;
> > + }
> > }
> >
> > /*
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h index
> > 38d3c6a8d..91ef2058f 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -214,4 +214,24 @@ extern int vfio_virqfd_enable(void *opaque,
> > void *data, struct virqfd **pvirqfd, int fd); extern void
> > vfio_virqfd_disable(struct virqfd **pvirqfd);
> >
> > +enum vfio_hugetlbpage_type {
> > + vfio_hugetlbpage_2M,
> > + vfio_hugetlbpage_1G,
> > +};
> > +
> > +struct vfio_hupetlbpage_info {
> > + enum vfio_hugetlbpage_type type;
>
> The enum within the structure serves no purpose.
>

Yes, I will delete it


> > + unsigned long size;
> > + unsigned long mask;
> > +};
> > +
> > +#define PMD_ORDER 9
> > +#define PUD_ORDER 18
>
> Architecture specific.
>

I will use compound_order()

> > +/*
> > + * get the number of resdual 4KB-pages in a hugetlbfs page
> > + * (including the page which pointed by this address)
>
> s/4KB/PAGE_SIZE/
>

Yes, I will fix this problem


> > + */
> > +#define hugetlb_get_resdual_pages(address, type) \
> > + ((vfio_hugetlbpage_info[type].size \
> > + - (address & ~vfio_hugetlbpage_info[type].mask)) >> PAGE_SHIFT)
> > #endif /* VFIO_H */