Re: [PATCH] vfio: fix VFIO_IOMMU_UNMAP_DMA when end of range would overflow u64

From: Alejandro Jimenez

Date: Tue Oct 07 2025 - 10:41:15 EST




On 10/7/25 12:24 AM, Alex Mastro wrote:
On Mon, Oct 06, 2025 at 09:23:56PM -0400, Alejandro Jimenez wrote:
If going this way, we'd also have to deny the MAP requests. Right now we

Agree.

I have doubts that anyone actually relies on MAP_DMA-ing such
end-of-u64-mappings in practice, so perhaps it's OK?


The AMD IOMMU supports a 64-bit IOVA, so when using the AMD vIOMMU with DMA
remapping enabled + VF passthrough + a Linux guest with iommu.forcedac=1 we
hit this issue since the driver (mlx5) starts requesting mappings for IOVAs
right at the top of the address space.

Interesting!

The reason why I hadn't send it to the list yet is because I noticed the
additional locations Jason mentioned earlier in the thread (e.g.
vfio_find_dma(), vfio_iommu_replay()) and wanted to put together a
reproducer that also triggered those paths.

I am not well equipped to test some of these paths, so if you have a recipe, I'd
be interested.

Unfortunately I didn't find a good recipe yet, which is why I delayed sending my patch with the fix.
I tried a few things to trigger the code paths, but the "easy way" methods are not enough. e.g. vfio_iommu_replay() is called when enabling the IOMMU model on the container i.e. when issuing the VFIO_SET_IOMMU ioctl but at that time there are no mappings yet for the domain so there is no dma_list to traverse.
I also tried attaching a second group to the same container that already contained a group with mappings, but in vfio_iommu_type1_attach_group() the new group is attached to the "existing compatible domain" that was already created for the first, since mappings are per-domain, so there is no need for replay.
Just mentioning this to give an idea, I didn't have time to try triggering the paths involved with dirty tracking...

The basic reproducer I linked on my original patch:

https://gist.github.com/aljimenezb/f3338c9c2eda9b0a7bf5f76b40354db8

is only exercising vfio_find_dma_first_node() and vfio_dma_do_unmap(), which is enough to verify the fix that was causing the AMD vIOMMU failure.


I mentioned in the notes for the patch above why I chose a slightly more
complex method than the '- 1' approach, since there is a chance that
iova+size could also go beyond the end of the address space and actually
wrap around.

I think certain invariants have broken if that is possible. The current checks
in the unmap path should prevent that (iova + size - 1 < iova).

1330 } else if (!size || size & (pgsize - 1) ||
1331 iova + size - 1 < iova || size > SIZE_MAX) {
1332 goto unlock;


Yes, that check properly accounts for overflow, but later vfio_find_dma_first_node() doesn't, so it won't find the node in the dma_list. I agree with the approach of sanitizing the inputs at the uapi boundary as Jason mentions, but the code should be robust enough to handle overflow anyways I'd think.

My goal was not to trust the inputs at all if possible. We could also use
check_add_overflow() if we want to add explicit error reporting in
vfio_iommu_type1 when an overflow is detected. i.e. catch bad callers that

I do like the explicitness of the check_* functions over the older style wrap
checks.

Below is my partially validated attempt at a more comprehensive fix if we were
to try and make end of address space mappings work, rather than blanking out
the last page.


In my opinion, the latter that is not optional. Ranges ending at the 64-bit boundary are valid, and not accepting them will break drivers that request them.

I'll test this change when I have a chance, though I'd really like to have solid test cases so I am hoping there is some advice on how to trigger all the paths.

Thank you,
Alejandro

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 08242d8ce2ca..66a25de35446 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -28,6 +28,7 @@
#include <linux/iommu.h>
#include <linux/module.h>
#include <linux/mm.h>
+#include <linux/overflow.h>
#include <linux/kthread.h>
#include <linux/rbtree.h>
#include <linux/sched/signal.h>
@@ -165,12 +166,14 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
{
struct rb_node *node = iommu->dma_list.rb_node;
+ BUG_ON(size == 0);
+
while (node) {
struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
- if (start + size <= dma->iova)
+ if (start + size - 1 < dma->iova)
node = node->rb_left;
- else if (start >= dma->iova + dma->size)
+ else if (start > dma->iova + dma->size - 1)
node = node->rb_right;
else
return dma;
@@ -186,10 +189,12 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
struct rb_node *node = iommu->dma_list.rb_node;
struct vfio_dma *dma_res = NULL;
+ BUG_ON(size == 0);
+
while (node) {
struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
- if (start < dma->iova + dma->size) {
+ if (start <= dma->iova + dma->size - 1) {
res = node;
dma_res = dma;
if (start >= dma->iova)
@@ -199,7 +204,7 @@ static struct rb_node *vfio_find_dma_first_node(struct vfio_iommu *iommu,
node = node->rb_right;
}
}
- if (res && size && dma_res->iova > start + size - 1)
+ if (res && dma_res->iova > start + size - 1)
res = NULL;
return res;
}
@@ -213,7 +218,7 @@ static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
parent = *link;
dma = rb_entry(parent, struct vfio_dma, node);
- if (new->iova + new->size <= dma->iova)
+ if (new->iova + new->size - 1 < dma->iova)
link = &(*link)->rb_left;
else
link = &(*link)->rb_right;
@@ -825,14 +830,24 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
unsigned long remote_vaddr;
struct vfio_dma *dma;
bool do_accounting;
+ u64 end, to_pin;
- if (!iommu || !pages)
+ if (!iommu || !pages || npage < 0)
return -EINVAL;
/* Supported for v2 version only */
if (!iommu->v2)
return -EACCES;
+ if (npage == 0)
+ return 0;
+
+ if (check_mul_overflow(npage, PAGE_SIZE, &to_pin))
+ return -EINVAL;
+
+ if (check_add_overflow(user_iova, to_pin - 1, &end))
+ return -EINVAL;
+
mutex_lock(&iommu->lock);
if (WARN_ONCE(iommu->vaddr_invalid_count,
@@ -997,7 +1012,7 @@ static long vfio_sync_unpin(struct vfio_dma *dma, struct vfio_domain *domain,
#define VFIO_IOMMU_TLB_SYNC_MAX 512
static size_t unmap_unpin_fast(struct vfio_domain *domain,
- struct vfio_dma *dma, dma_addr_t *iova,
+ struct vfio_dma *dma, dma_addr_t iova,
size_t len, phys_addr_t phys, long *unlocked,
struct list_head *unmapped_list,
int *unmapped_cnt,
@@ -1007,18 +1022,16 @@ static size_t unmap_unpin_fast(struct vfio_domain *domain,
struct vfio_regions *entry = kzalloc(sizeof(*entry), GFP_KERNEL);
if (entry) {
- unmapped = iommu_unmap_fast(domain->domain, *iova, len,
+ unmapped = iommu_unmap_fast(domain->domain, iova, len,
iotlb_gather);
if (!unmapped) {
kfree(entry);
} else {
- entry->iova = *iova;
+ entry->iova = iova;
entry->phys = phys;
entry->len = unmapped;
list_add_tail(&entry->list, unmapped_list);
-
- *iova += unmapped;
(*unmapped_cnt)++;
}
}
@@ -1037,18 +1050,17 @@ static size_t unmap_unpin_fast(struct vfio_domain *domain,
}
static size_t unmap_unpin_slow(struct vfio_domain *domain,
- struct vfio_dma *dma, dma_addr_t *iova,
+ struct vfio_dma *dma, dma_addr_t iova,
size_t len, phys_addr_t phys,
long *unlocked)
{
- size_t unmapped = iommu_unmap(domain->domain, *iova, len);
+ size_t unmapped = iommu_unmap(domain->domain, iova, len);
if (unmapped) {
- *unlocked += vfio_unpin_pages_remote(dma, *iova,
+ *unlocked += vfio_unpin_pages_remote(dma, iova,
phys >> PAGE_SHIFT,
unmapped >> PAGE_SHIFT,
false);
- *iova += unmapped;
cond_resched();
}
return unmapped;
@@ -1057,12 +1069,12 @@ static size_t unmap_unpin_slow(struct vfio_domain *domain,
static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
bool do_accounting)
{
- dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
struct vfio_domain *domain, *d;
LIST_HEAD(unmapped_region_list);
struct iommu_iotlb_gather iotlb_gather;
int unmapped_region_cnt = 0;
long unlocked = 0;
+ size_t pos = 0;
if (!dma->size)
return 0;
@@ -1086,13 +1098,14 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
}
iommu_iotlb_gather_init(&iotlb_gather);
- while (iova < end) {
+ while (pos < dma->size) {
size_t unmapped, len;
phys_addr_t phys, next;
+ dma_addr_t iova = dma->iova + pos;
phys = iommu_iova_to_phys(domain->domain, iova);
if (WARN_ON(!phys)) {
- iova += PAGE_SIZE;
+ pos += PAGE_SIZE;
continue;
}
@@ -1101,7 +1114,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
* may require hardware cache flushing, try to find the
* largest contiguous physical memory chunk to unmap.
*/
- for (len = PAGE_SIZE; iova + len < end; len += PAGE_SIZE) {
+ for (len = PAGE_SIZE; len + pos < dma->size; len += PAGE_SIZE) {
next = iommu_iova_to_phys(domain->domain, iova + len);
if (next != phys + len)
break;
@@ -1111,16 +1124,18 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
* First, try to use fast unmap/unpin. In case of failure,
* switch to slow unmap/unpin path.
*/
- unmapped = unmap_unpin_fast(domain, dma, &iova, len, phys,
+ unmapped = unmap_unpin_fast(domain, dma, iova, len, phys,
&unlocked, &unmapped_region_list,
&unmapped_region_cnt,
&iotlb_gather);
if (!unmapped) {
- unmapped = unmap_unpin_slow(domain, dma, &iova, len,
+ unmapped = unmap_unpin_slow(domain, dma, iova, len,
phys, &unlocked);
if (WARN_ON(!unmapped))
break;
}
+
+ pos += unmapped;
}
dma->iommu_mapped = false;
@@ -1212,7 +1227,7 @@ static int update_user_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
}
static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
- dma_addr_t iova, size_t size, size_t pgsize)
+ dma_addr_t iova, u64 end, size_t pgsize)
{
struct vfio_dma *dma;
struct rb_node *n;
@@ -1229,8 +1244,8 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
if (dma && dma->iova != iova)
return -EINVAL;
- dma = vfio_find_dma(iommu, iova + size - 1, 0);
- if (dma && dma->iova + dma->size != iova + size)
+ dma = vfio_find_dma(iommu, end, 1);
+ if (dma && dma->iova + dma->size - 1 != end)
return -EINVAL;
for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
@@ -1239,7 +1254,7 @@ static int vfio_iova_dirty_bitmap(u64 __user *bitmap, struct vfio_iommu *iommu,
if (dma->iova < iova)
continue;
- if (dma->iova > iova + size - 1)
+ if (dma->iova > end)
break;
ret = update_user_bitmap(bitmap, iommu, dma, iova, pgsize);
@@ -1305,6 +1320,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
unsigned long pgshift;
dma_addr_t iova = unmap->iova;
u64 size = unmap->size;
+ u64 unmap_end;
bool unmap_all = unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL;
bool invalidate_vaddr = unmap->flags & VFIO_DMA_UNMAP_FLAG_VADDR;
struct rb_node *n, *first_n;
@@ -1327,11 +1343,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
if (iova || size)
goto unlock;
size = U64_MAX;
- } else if (!size || size & (pgsize - 1) ||
- iova + size - 1 < iova || size > SIZE_MAX) {
+ } else if (!size || size & (pgsize - 1) || size > SIZE_MAX) {
goto unlock;
}
+ if (check_add_overflow(iova, size - 1, &unmap_end))
+ goto unlock;
+
/* When dirty tracking is enabled, allow only min supported pgsize */
if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
(!iommu->dirty_page_tracking || (bitmap->pgsize != pgsize))) {
@@ -1376,8 +1394,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
if (dma && dma->iova != iova)
goto unlock;
- dma = vfio_find_dma(iommu, iova + size - 1, 0);
- if (dma && dma->iova + dma->size != iova + size)
+ dma = vfio_find_dma(iommu, unmap_end, 1);
+ if (dma && dma->iova + dma->size - 1 != unmap_end)
goto unlock;
}
@@ -1386,7 +1404,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
while (n) {
dma = rb_entry(n, struct vfio_dma, node);
- if (dma->iova > iova + size - 1)
+ if (dma->iova > unmap_end)
break;
if (!iommu->v2 && iova > dma->iova)
@@ -1713,12 +1731,12 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
for (; n; n = rb_next(n)) {
struct vfio_dma *dma;
- dma_addr_t iova;
+ size_t pos = 0;
dma = rb_entry(n, struct vfio_dma, node);
- iova = dma->iova;
- while (iova < dma->iova + dma->size) {
+ while (pos < dma->size) {
+ dma_addr_t iova = dma->iova + pos;
phys_addr_t phys;
size_t size;
@@ -1734,14 +1752,15 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
phys = iommu_iova_to_phys(d->domain, iova);
if (WARN_ON(!phys)) {
- iova += PAGE_SIZE;
+ pos += PAGE_SIZE;
continue;
}
+
size = PAGE_SIZE;
p = phys + size;
i = iova + size;
- while (i < dma->iova + dma->size &&
+ while (size + pos < dma->size &&
p == iommu_iova_to_phys(d->domain, i)) {
size += PAGE_SIZE;
p += PAGE_SIZE;
@@ -1782,7 +1801,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
goto unwind;
}
- iova += size;
+ pos += size;
}
}
@@ -1799,29 +1818,29 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
unwind:
for (; n; n = rb_prev(n)) {
struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
- dma_addr_t iova;
+ size_t pos = 0;
if (dma->iommu_mapped) {
iommu_unmap(domain->domain, dma->iova, dma->size);
continue;
}
- iova = dma->iova;
- while (iova < dma->iova + dma->size) {
+ while (pos < dma->size) {
+ dma_addr_t iova = dma->iova + pos;
phys_addr_t phys, p;
size_t size;
dma_addr_t i;
phys = iommu_iova_to_phys(domain->domain, iova);
if (!phys) {
- iova += PAGE_SIZE;
+ pos += PAGE_SIZE;
continue;
}
size = PAGE_SIZE;
p = phys + size;
i = iova + size;
- while (i < dma->iova + dma->size &&
+ while (pos + size < dma->size &&
p == iommu_iova_to_phys(domain->domain, i)) {
size += PAGE_SIZE;
p += PAGE_SIZE;
@@ -2908,6 +2927,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
unsigned long pgshift;
size_t data_size = dirty.argsz - minsz;
size_t iommu_pgsize;
+ u64 end;
if (!data_size || data_size < sizeof(range))
return -EINVAL;
@@ -2916,8 +2936,12 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
sizeof(range)))
return -EFAULT;
- if (range.iova + range.size < range.iova)
+ if (range.size == 0)
+ return 0;
+
+ if (check_add_overflow(range.iova, range.size - 1, &end))
return -EINVAL;
+
if (!access_ok((void __user *)range.bitmap.data,
range.bitmap.size))
return -EINVAL;
@@ -2949,7 +2973,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
if (iommu->dirty_page_tracking)
ret = vfio_iova_dirty_bitmap(range.bitmap.data,
iommu, range.iova,
- range.size,
+ end,
range.bitmap.pgsize);
else
ret = -EINVAL;