Re: [PATCH v1 4/4] iommu/hyperv: Add page-selective IOTLB flush support

From: Jacob Pan

Date: Wed May 20 2026 - 16:40:57 EST

Hi Michael,

On Wed, 20 May 2026 19:26:24 +0000
Michael Kelley <mhklinux@xxxxxxxxxxx> wrote:

> From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> To: Yu Zhang <zhangyu1@xxxxxxxxxxxxxxxxxxx>, Jason Gunthorpe
> <jgg@xxxxxxxx> CC: "linux-kernel@xxxxxxxxxxxxxxx"
> <linux-kernel@xxxxxxxxxxxxxxx>, "linux-hyperv@xxxxxxxxxxxxxxx"
> <linux-hyperv@xxxxxxxxxxxxxxx>, "iommu@xxxxxxxxxxxxxxx"
> <iommu@xxxxxxxxxxxxxxx>, "linux-pci@xxxxxxxxxxxxxxx"
> <linux-pci@xxxxxxxxxxxxxxx>, "linux-arch@xxxxxxxxxxxxxxx"
> <linux-arch@xxxxxxxxxxxxxxx>, "wei.liu@xxxxxxxxxx"
> <wei.liu@xxxxxxxxxx>, "kys@xxxxxxxxxxxxx" <kys@xxxxxxxxxxxxx>,
> "haiyangz@xxxxxxxxxxxxx" <haiyangz@xxxxxxxxxxxxx>,
> "decui@xxxxxxxxxxxxx" <decui@xxxxxxxxxxxxx>, "longli@xxxxxxxxxxxxx"
> <longli@xxxxxxxxxxxxx>, "joro@xxxxxxxxxx" <joro@xxxxxxxxxx>,
> "will@xxxxxxxxxx" <will@xxxxxxxxxx>, "robin.murphy@xxxxxxx"
> <robin.murphy@xxxxxxx>, "bhelgaas@xxxxxxxxxx" <bhelgaas@xxxxxxxxxx>,
> "kwilczynski@xxxxxxxxxx" <kwilczynski@xxxxxxxxxx>,
> "lpieralisi@xxxxxxxxxx" <lpieralisi@xxxxxxxxxx>, "mani@xxxxxxxxxx"
> <mani@xxxxxxxxxx>, "robh@xxxxxxxxxx" <robh@xxxxxxxxxx>,
> "arnd@xxxxxxxx" <arnd@xxxxxxxx>, "jacob.pan@xxxxxxxxxxxxxxxxxxx"
> <jacob.pan@xxxxxxxxxxxxxxxxxxx>, "tgopinath@xxxxxxxxxxxxxxxxxxx"
> <tgopinath@xxxxxxxxxxxxxxxxxxx>,
> "easwar.hariharan@xxxxxxxxxxxxxxxxxxx"
> <easwar.hariharan@xxxxxxxxxxxxxxxxxxx> Subject: RE: [PATCH v1 4/4]
> iommu/hyperv: Add page-selective IOTLB flush support Date: Wed, 20
> May 2026 19:26:24 +0000
>
> From: Yu Zhang <zhangyu1@xxxxxxxxxxxxxxxxxxx> Sent: Wednesday, May
> 20, 2026 10:15 AM
> >
> > On Fri, May 15, 2026 at 07:35:45PM -0300, Jason Gunthorpe wrote:
> > > On Tue, May 12, 2026 at 12:24:08AM +0800, Yu Zhang wrote:
> > > > +static inline u16 hv_iommu_fill_iova_list(union
> > > > hv_iommu_flush_va *iova_list,
> > > > + unsigned long start,
> > > > + unsigned long end)
> > > > +{
> > > > + unsigned long start_pfn = start >> PAGE_SHIFT;
> > > > + unsigned long end_pfn = PAGE_ALIGN(end) >> PAGE_SHIFT;
> > > > + unsigned long nr_pages = end_pfn - start_pfn;
> > > > + u16 count = 0;
> > > > +
> > > > + while (nr_pages > 0) {
> > > > + unsigned long flush_pages;
> > > > + int order;
> > > > + unsigned long pfn_align;
> > > > + unsigned long size_align;
> > > > +
> > > > + if (count >= HV_IOMMU_MAX_FLUSH_VA_COUNT) {
> > > > + count = HV_IOMMU_FLUSH_VA_OVERFLOW;
> > > > + break;
> > > > + }
> > > > +
> > > > + if (start_pfn)
> > > > + pfn_align = __ffs(start_pfn);
> > > > + else
> > > > + pfn_align = BITS_PER_LONG - 1;
> > > > +
> > > > + size_align = __fls(nr_pages);
> > > > + order = min(pfn_align, size_align);
> > > > + iova_list[count].page_mask_shift = order;
> > > > + iova_list[count].page_number = start_pfn;
> > > > +
> > > > + flush_pages = 1UL << order;
> > > > + start_pfn += flush_pages;
> > > > + nr_pages -= flush_pages;
> > > > + count++;
> > > > + }
> > >
> > > This seems like a really silly hypervisor interface. Why doesn't
> > > it just accept a normal range? Splitting it into power of two
> > > aligned ranges is very inefficient.
> >
> > Fair point. I'm not sure how much flexibility we have to change
> > this hypercall interface at the moment - it predates the pvIOMMU
> > work and may have other consumers beyond Linux guest. On the other
> > hand, having the guest specify 2^N-aligned blocks does save the
> > hypervisor from having to decompose ranges itself before issuing
> > hardware invalidation commands - the guest-provided entries can be
> > fed to the HW more or less directly.
> >
> > That said, the way I'm currently using this interface may be
> > more precise than necessary. Maybe we have 2 options:
> >
> > 1) Current approach: decompose the range into multiple exact
> > 2^N-aligned blocks with no over-flush, but at the cost of
> > more complex calculations and more entries.
> >
> > 2) Follow what Intel/AMD drivers do: find a single minimal
> > 2^N-aligned block that covers the entire range, but may
> > over-flush.
> >
> > Any preference?
> >
> > @Michael, since you've also been reviewing this patch, I'd
> > appreciate your thoughts on the above as well. :)
> >
>
> I'm just guessing, but perhaps flushing an aligned power-of-2
> range can be processed by the hypervisor at a relatively fixed
> cost, regardless of the size. Having the guest do the decomposing
> of an arbitrary range allows the hypervisor to make use of the
> existing "rep" hypercall mechanism if the hypercall is taking
> "too long". The hypervisor can pause its processing, return to
> the guest temporarily, and then continue the hypercall. If the
> arbitrary range were passed into the hypercall for the hypervisor
> to do the decomposing, that pause-and-restart mechanism
> wouldn't be available.
>
> Of course, Linux doesn't really take advantage of the pause to
> reduce guest interrupt latency because the Hyper-V code in
> Linux typically disable interrupts around a hypercall due to the
> way the hypercall input page is allocated. But other guest
> operating systems might benefit from such a pause. And we could
> probably fix the Hyper-V code in Linux to allow interrupts during a
> hypercall pause/restart if long-running hypercalls turn out to be
> a problem.
I am not sure if this pause feature is suitable for IOTLB flush at all
since it is inherently synchronous — the caller must block until all
invalidations complete. Pausing mid-flush to return to the guest
doesn't help if the guest can't make forward progress anyway.