Re: [PATCH 0/2] KVM: arm64: Fix host/hyp tracking on share/unshare hypercall failure
From: Vincent Donnefort
Date: Fri May 29 2026 - 06:11:44 EST
On Fri, May 29, 2026 at 10:29:40AM +0100, Marc Zyngier wrote:
> On Fri, 29 May 2026 09:20:50 +0100,
> Fuad Tabba <tabba@xxxxxxxxxx> wrote:
> >
> > On Fri, 29 May 2026 at 09:15, Marc Zyngier <maz@xxxxxxxxxx> wrote:
> > >
> > > On Fri, 29 May 2026 09:05:35 +0100,
> > > Fuad Tabba <tabba@xxxxxxxxxx> wrote:
> > > >
> > > > On Fri, 29 May 2026 at 09:02, Vincent Donnefort <vdonnefort@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Fri, May 29, 2026 at 08:43:39AM +0100, tabba@xxxxxxxxxx wrote:
> > > > > > Hi folks,
> > > > > >
> > > > > > Yet another bug I found while testing Sashiko locally with fixes to
> > > > > > review-prompts.
> > > > > >
> > > > > > share_pfn_hyp() and unshare_pfn_hyp() in arch/arm64/kvm/mmu.c
> > > > > > maintain a host-side RB-tree mirroring the set of pages shared with
> > > > > > EL2. Both invoke a hypercall that can fail (page-state mismatch,
> > > > > > EL2 refcount still held), but neither cleans up on failure:
> > > > > >
> > > > > > - share_pfn_hyp() inserts the tracking node before the hypercall
> > > > > > and leaves it in the tree on failure, leaking the allocation and
> > > > > > presenting a phantom share to a later unshare.
> > > > > >
> > > > > > - unshare_pfn_hyp() erases the tracking node before the hypercall;
> > > > > > on failure the host loses its record while EL2 still owns the
> > > > > > share, breaking later operations on the same pfn.
> > > > > >
> > > > > > Severity is low (no isolation impact) and the failure paths are rare
> > > > > > in practice, but the desync is real. Both patches are independent and
> > > > > > apply cleanly to current mainline. In other words, this can wait for
> > > > > > 7.2.
> > > > >
> > > > >
> > > > > I believe I fixed that here lore.kernel.org/all/acyKhZL2di_QQ9xm@xxxxxxxxxx but
> > > > > as Quentin pointed-out, there's absolutely no reason for the hypercall to fail.
> > > > > So I haven't sent a v2.
> > > >
> > > > At the very least we need to add a comment, otherwise, people like me
> > > > and LLMs like Sashiko would stumble upon it.
> > > >
> > > > That said, this fix adds no real overhead, makes the code clearer, and
> > > > guards us against a future where that call might fail.
> > > > Self-documenting in essense.
> > > >
> > > > WDYT?
> > >
> > > If a hypercall really cannot fail, why does it have a return value?
> >
> > Good point. If we know it cannot fail, how about just `void`?
> >
> > That said, Vincen't exact words are: `very much unlikely`, not the
> > same as cannot fail :)
> >
> > https://lore.kernel.org/all/acyKhZL2di_QQ9xm@xxxxxxxxxx/
>
> I think the rules are simple:
>
> - if something can fail, we need to handle the failure
Looking at kvm_share_hyp() it should then rollback the shared pages. I think
that is fine.
>
> - if something should not fail and has the potential of compromising
> the system, we should panic
Then kvm_unshare_hyp() being void, should BUG_ON(unshare_pfn_hyp(pfn));
>
> - if something absolutely cannot fail, then there is nothing to handle
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.