Re: Intel IOMMU (and IOMMU for Virtualization) performances

From: Muli Ben-Yehuda
Date: Fri Jun 06 2008 - 16:22:23 EST


On Wed, Jun 04, 2008 at 11:06:15AM -0700, Grant Grundler wrote:
> On Wed, Jun 4, 2008 at 7:47 AM, FUJITA Tomonori
> <fujita.tomonori@xxxxxxxxxxxxx> wrote:
> ...
> > Now I try to fix Intel IOMMU code, the free space management
> > algorithm.
> >
> > The major difference between Intel IOMMU code and the others is Intel
> > IOMMU code uses Red Black tree to manage free space while the others
> > use bitmap (swiotlb is the only exception).
> >
> > The Red Black tree method consumes less memory than the bitmap method,
> > but it incurs more overheads (the RB tree method needs to walk through
> > the tree, allocates a new item, and insert it every time it maps an
> > I/O address). Intel IOMMU (and IOMMUs for virtualization) needs
> > multiple IOMMU address spaces. That's why the Red Black tree method is
> > chosen, I guess.
>
> It's possible to split up one flat address space and share the IOMMU
> among several users. Each user gets her own segment of bitmap and
> corresponding IO Pdir. So I don't see allocation policy as a strong
> reason to use Red/Black Tree.

Do you mean multiple users sharing the same I/O address space (but
each user using a different segment), or multiple users, each with its
own I/O address space, but only using a specific segment of that
address space and using a single bitmap to represent free space in all
segments? If the former, then you are losing some of the benefit of
the IOMMU since all users can DMA to other users areas (same I/O
address space). If the latter, having a bitmap per IO address space
seems simpler and would have the same memory consumption.

> > I got the following results with one thread issuing 1KB I/Os:
> >
> > IOPS (I/O per second)
> > IOMMU disabled 145253.1 (1.000)
> > RB tree (mainline) 118313.0 (0.814)
> > Bitmap 128954.1 (0.887)
>
> Just to make this clear, this is a 10% performance difference.
>
> But a second metric is more telling: CPU utilization. How much time
> was spent in the IOMMU code for each implementation with the same
> workload?
>
> This isn't a demand for that information but just a request to
> measure that in any future benchmarking. oprofile or perfmon2 are
> the best tools to determine that.

Agreed, CPU utilization would be very interesting here.

> Just as important as the allocation data structure is the allocation
> policy. The allocation policy will perform best if it matches the
> IO TLB replacement implemented in the IOMMU HW. Thrashing the IO TLB
> by allocating aliases to competing streams will hurt perf as well.
> Obviously a single benchmark is unlikely to detect this.

Is there a public description of the caching policies of currently
available VT-d hardware?

> I've never been able to come up with a good heuristic for
> determining the size of the IOVA space. It generally does NOT need
> to map all of Host Physical RAM. The actual requirement depends
> entirely on the workload, type and number of IO devices
> installed. The problem is we don't know any of those things until
> well after the IOMMU is already needed.

Why not do what hash-tables implementation do, start small and resize
when we approach half-full?

Cheers,
Muli
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/