Re: [PATCH v2 3/5] KVM: guest_memfd: Handle errors from xa_store_range() when binding

From: Ackerley Tng

Date: Wed May 27 2026 - 15:12:45 EST

Sean Christopherson <seanjc@xxxxxxxxxx> writes:

> On Fri, May 22, 2026, Ackerley Tng wrote:
>> Unhandled errors from xa_store_range() means kvm_gmem_bind() might falsely
>> reporting success, leading to false assumptions in guest_memfd's lifecycle
>> later.
>>
>> On error, restore the unbound state and return the error to userspace.
>>
>> Fixes: a7800aa80ea4d ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
>> Signed-off-by: Ackerley Tng <ackerleytng@xxxxxxxxxx>
>> ---
>> virt/kvm/guest_memfd.c | 11 +++++++++--
>> 1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index d203135969d13..5b4911ffa208a 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -648,6 +648,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>> struct inode *inode;
>> struct file *file;
>> int r = -EINVAL;
>> + void *result;
>
> I would rather go with "xr". "result" is too generic, e.g. begs the question of
> "result of what?"
>

Good to go with xr too.

> Actually, I don't think we even need an intermediate variable.
>
>> BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
>>
>> @@ -688,7 +689,14 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>> if (kvm_gmem_supports_mmap(inode))
>> slot->flags |= KVM_MEMSLOT_GMEM_ONLY;
>>
>> - xa_store_range(&f->bindings, start, end - 1, slot, GFP_KERNEL);
>> + result = xa_store_range(&f->bindings, start, end - 1, slot, GFP_KERNEL);
>> + if (xa_is_err(result)) {
>> + r = xa_err(result);
>> + xa_store_range(&f->bindings, start, end - 1, NULL, GFP_KERNEL);
>
> I'm not convinced this is necessary. Sashiko "asked" the question:
>
> : If xa_store_range() fails midway through storing a large range (for example,
> : returning -ENOMEM), does it leave the already-processed entries in the
> : f->bindings XArray?
> :
> : When this error is propagated back, the caller __kvm_set_memory_region()
> : will abort the operation and free the memslot without calling
> : kvm_gmem_unbind().
> :
> : Since the partial XArray updates aren't rolled back here, could this leave
> : dangling pointers to the freed memslot in f->bindings? If so, when the file
> : is eventually closed, kvm_gmem_release() might iterate over these dangling
> : pointers and write to slot->gmem.file, resulting in a use-after-free.
>
> but I think Sashiko is hallicunating.
>

When I updated this I kind of just assumed xa_store_range() always
iterates indices (so for a range [0, 10], it would store 11 times), and
an earlier index could be set, and a later store could result in
-ENOMEM.

Since you called this out, I dug into it more.

> If @entry is non-NULL, xa_store_range() pre-creates the entire range, before
> storing anything into the range:
>
> if (entry) {
> unsigned int order = BITS_PER_LONG;
> if (last + 1)
> order = __ffs(last + 1);
> xas_set_order(&xas, last, order);
> xas_create(&xas, true);
> if (xas_error(&xas))
> goto unlock;
> }
>

xa_store_range() doesn't actually always iterate: if last + 1 is some
clean power of 2, it'll create a higher order xarray node.

Otherwise, it falls back to creating and storing 1 index/node at a time:
if the above did manage to create an xarray node, xas_error() returns
false, it goes on to the store below.

> Yes, the API handles failure on the subsequent xas_store(), but I can't imagine
> that failure is actually, barring garbage input from KVM:
>
> do {
> xas_set_range(&xas, first, last);
> xas_store(&xas, entry);
> if (xas_error(&xas))
> goto unlock;
> first += xas_size(&xas);
> } while (first <= last);
>

So if a later xas_create() fails because it runs out of memory, the
earlier stores would have already been committed.

This ignores -EEXIST being returned since earlier in kvm_gmem_bind()
conflicts were already checked.

> Purely from a design perspective, providing an API that can fail partway through
> under normal operation, with no indication of where failure occured (AFAICT),
> would be awful.
>

Do you mean the API of xas_store_range()? xas is updated by
xas_set_range() so that should track the last store. Since the cleanup
is storing NULLs and won't allocate, I thought it would be fine to just
store NULL on the entire range on error.

>> + } else {
>> + r = 0;
>> + }
>> +
>> filemap_invalidate_unlock(inode->i_mapping);
>>
>> /*
>>
>> [...snip...]
>>