Re: [PATCH] KVM: arm64: Adjust range correctly during host stage-2 faults

From: Marc Zyngier

Date: Thu Mar 05 2026 - 08:22:40 EST

On Thu, 05 Mar 2026 13:13:40 +0000,
Quentin Perret <qperret@xxxxxxxxxx> wrote:
>
> On Thursday 05 Mar 2026 at 10:55:42 (+0000), Marc Zyngier wrote:
> > On Wed, 04 Mar 2026 18:55:04 +0000,
> > Marc Zyngier <maz@xxxxxxxxxx> wrote:
> > >
> > > On Wed, 25 Jun 2025 11:55:48 +0100,
> > > Quentin Perret <qperret@xxxxxxxxxx> wrote:
> > > >
> > > > host_stage2_adjust_range() tries to find the largest block mapping that
> > > > fits within a memory or mmio region (represented by a kvm_mem_range in
> > > > this function) during host stage-2 faults under pKVM. To do so, it walks
> > > > the host stage-2 page-table, finds the faulting PTE and its level, and
> > > > then progressively increments the level until it finds a granule of the
> > > > appropriate size. However, the condition in the loop implementing the
> > > > above is broken as it checks kvm_level_supports_block_mapping() for the
> > > > next level instead of the current, so pKVM may attempt to map a region
> > > > larger than can be covered with a single block.
> > > >
> > > > This is not a security problem and is quite rare in practice (the
> > > > kvm_mem_range check usually forces host_stage2_adjust_range() to choose a
> > > > smaller granule), but this is clearly not the expected behaviour.
> > > >
> > > > Refactor the loop to fix the bug and improve readability.
> > > >
> > > > Fixes: c4f0935e4d95 ("KVM: arm64: Optimize host memory aborts")
> > > > Signed-off-by: Quentin Perret <qperret@xxxxxxxxxx>
> > >
> > > This patch prevents my O6 board from booting in protected mode as of
> > > e728e705802fe. Reverting it on top of 7.0-rc2 make the box work again.
> > >
> > > I haven't quite worked out why though. The hack below makes it work,
> > > but implies that we can get ranges that are smaller than a page. That
> > > feels unlikely, but I'm not sure we can rule it out (the kernel page
> > > size could be pretty large anyway).
> >
> > Having spent a bit of time on this, I'm pretty sure this is the cause
> > of the issue. The memblock tables are as such:
> >
> > maz@cosmic-debris:~/vminstall$ sudo cat /sys/kernel/debug/memblock/memory
> > 0: 0x0000000080000000..0x00000000843fffff 0 NOMAP
> > 1: 0x0000000084400000..0x00000000845fffff 0 NONE
> > 2: 0x0000000085000000..0x000000009fffffff 0 NONE
> > 3: 0x00000000a0000000..0x00000000a7ffffff 0 NOMAP
> > 4: 0x00000000a8000000..0x00000000fffbffff 0 NONE
> > 5: 0x00000000fffc0000..0x00000000fffeffff 0 NOMAP
> > 6: 0x00000000ffff0000..0x00000000ffffdfff 0 NONE
> > 7: 0x00000000ffffe000..0x00000000ffffffff 0 NOMAP
> > 8: 0x0000000100000000..0x00000007fe4effff 0 NONE
> > 9: 0x00000007fe4f0000..0x00000007fedeffff 0 NOMAP
> > 10: 0x00000007fedf0000..0x00000007ffffffff 0 NONE
> > 11: 0x0000008000000000..0x000000807a290fff 0 NONE
> > 12: 0x000000807a291000..0x000000807a2927b2 0 NOMAP
> > 13: 0x000000807a2927b3..0x000000807fffffff 0 NONE
>
> Ouch, these last few are 'interesting', oh well :-)
>
> > Any access to page 0x000000807a292000 is going to blow up in your
> > face, because there is no way you can map this and still respect the
> > memblock boundary. Same thing for any region that is smaller than
> > PAGE_SIZE, or not aligned on PAGE_SIZE. Which is even more annoying.
> >
> > I'm starting to think that my hack is not that idiotic in the end...
>
> Yes, I can't think of anything better TBH. We've already asserted that
> we don't have an annotated PTE here, and at the last level we're
> guaranteed not to accidentally map a neighbouring private region, so yes
> we should just proceed with a page-aligned mapping there.
>
> Want me to post a proper patch or do you already have one in stock?

I have that ready, but I wanted your feedback on it before posting it.

I'll send that now.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.