Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist

From: Robin Murphy
Date: Tue Nov 06 2018 - 13:27:44 EST

Next message: Nicolas Saenz Julienne: "Re: [PATCH RFC 09/18] staging: vchiq_core: do not initialize semaphores twice"
Previous message: Mark Brown: "Re: [PATCH] sh: Provide prototypes for PCI I/O mapping in asm/io.h"
In reply to: Robin Murphy: "Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist"
Next in thread: Nicolin Chen: "Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 02/11/2018 23:36, Nicolin Chen wrote:

On Fri, Nov 02, 2018 at 04:54:07PM +0000, Robin Murphy wrote:

On 01/11/2018 21:35, Nicolin Chen wrote:

The __GFP_ZERO will be passed down to the generic page allocation
routine which zeros everything page by page. This is safe to be a
generic way but not efficient for iommu allocation that organizes
contiguous pages using scatterlist.

So this changes drops __GFP_ZERO from the flag, and adds a manual
memset after page/sg allocations, using the length of scatterlist.

My test result of a 2.5MB size allocation shows iommu_dma_alloc()
takes 46% less time, reduced from averagely 925 usec to 500 usec.

Assuming this is for arm64, I'm somewhat surprised that memset() could be
that much faster than clear_page(), since they should effectively amount to
the same thing (a DC ZVA loop). What hardware is this on? Profiling to try

I am running with tegra186-p2771-0000.dtb so it's arm64 yes.

and see exactly where the extra time goes would be interesting too.

I re-ran the test to get some accuracy within the function and got:
1) pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
// reduced from 422 usec to 56 usec == 366 usec less
2) if (!(prot & IOMMU_CACHE)) {...} //flush routine
// reduced from 439 usec to 236 usec == 203 usec less
Note: new memset takes about 164 usec, resulting in 400 usec diff
for the entire iommu_dma_alloc() function call.

It looks like this might be more than the diff between clear_page
and memset, and might be related to mapping and cache. Any idea?

Hmm, I guess it might not be so much clear_page() itself as all the gubbins involved in getting there from prep_new_page(). I could perhaps make some vague guesses about how the A57 cores might get tickled by the different code patterns, but the Denver cores are well beyond my ability to reason about. Out of even further curiosity, how does the quick hack below compare?

@@ -568,6 +571,15 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
if (attrs & DMA_ATTR_ALLOC_SINGLE_PAGES)
alloc_sizes = min_size;
+ /*
+ * The generic zeroing in a length of one page size is slow,
+ * so do it manually in a length of scatterlist size instead
+ */
+ if (gfp & __GFP_ZERO) {
+ gfp &= ~__GFP_ZERO;
+ gfp_zero = true;
+ }

Or just mask it out in __iommu_dma_alloc_pages()?

Yea, the change here would be neater then.

@@ -581,6 +593,12 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
goto out_free_iova;
+ if (gfp_zero) {
+ /* Now zero all the pages in the scatterlist */
+ for_each_sg(sgt.sgl, s, sgt.orig_nents, i)
+ memset(sg_virt(s), 0, s->length);

What if the pages came from highmem? I know that doesn't happen on arm64
today, but the point of this code *is* to be generic, and other users will
arrive eventually.

Hmm, so it probably should use sg_miter_start/stop() too? Looking
at the flush routine doing in PAGE_SIZE for each iteration, would
be possible to map and memset contiguous pages together? Actually
the flush routine might be also optimized if we can map contiguous
pages.

I suppose the ideal point at which to do it would be after the remapping when we have the entire buffer contiguous in vmalloc space and can make best use of prefetchers etc. - DMA_ATTR_NO_KERNEL_MAPPING is a bit of a spanner in the works, but we could probably accommodate a special case for that. As Christoph points out, this isn't really the place to be looking for performance anyway (unless it's pathologically bad as per the DMA_ATTR_ALLOC_SINGLE_PAGES fun), but if we're looking at pulling the remapping out of the arch code, maybe we could aim to rework the zeroing completely as part of that.

Robin.

----->8-----
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b04753b204..7d28db3bf4bf 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -569,7 +569,7 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
alloc_sizes = min_size;

count = PAGE_ALIGN(size) >> PAGE_SHIFT;
- pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+ pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp & ~__GFP_ZERO);
if (!pages)
return NULL;

@@ -581,15 +581,18 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
goto out_free_iova;

- if (!(prot & IOMMU_CACHE)) {
+ {
struct sg_mapping_iter miter;
/*
* The CPU-centric flushing implied by SG_MITER_TO_SG isn't
* sufficient here, so skip it by using the "wrong" direction.
*/
sg_miter_start(&miter, sgt.sgl, sgt.orig_nents, SG_MITER_FROM_SG);
- while (sg_miter_next(&miter))
+ while (sg_miter_next(&miter)) {
+ clear_page(miter.addr);
+ if (!(prot & IOMMU_CACHE))
flush_page(dev, miter.addr, page_to_phys(miter.page));
+ }
sg_miter_stop(&miter);
}

Next message: Nicolas Saenz Julienne: "Re: [PATCH RFC 09/18] staging: vchiq_core: do not initialize semaphores twice"
Previous message: Mark Brown: "Re: [PATCH] sh: Provide prototypes for PCI I/O mapping in asm/io.h"
In reply to: Robin Murphy: "Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist"
Next in thread: Nicolin Chen: "Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]