Re: [PATCH] kvm/x86/mmu: use the correct inherited permissions to get shadow page

From: Sean Christopherson
Date: Mon Nov 30 2020 - 12:42:17 EST


On Sat, Nov 28, 2020, Lai Jiangshan wrote:
> On Sat, Nov 28, 2020 at 12:48 AM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
> >
> > On 26/11/20 01:05, Sean Christopherson wrote:
> > > On Fri, Nov 20, 2020, Lai Jiangshan wrote:
> > >> From: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
> > >>
> > >> Commit 41074d07c78b ("KVM: MMU: Fix inherited permissions for emulated
> > >> guest pte updates") said role.access is common access permissions for
> > >> all ptes in this shadow page, which is the inherited permissions from
> > >> the parent ptes.
> > >>
> > >> But the commit did not enforce this definition when kvm_mmu_get_page()
> > >> is called in FNAME(fetch). Rather, it uses a random (last level pte's
> > >> combined) access permissions.
> > >
> > > I wouldn't say it's random, the issue is specifically that all shadow pages end
> > > up using the combined set of permissions of the entire walk, as opposed to the
> > > only combined permissions of its parents.
> > >
> > >> And the permissions won't be checked again in next FNAME(fetch) since the
> > >> spte is present. It might fail to meet guest's expectation when guest sets up
> > >> spaghetti pagetables.
> > >
> > > Can you provide details on the exact failure scenario? It would be very helpful
> > > for documentation and understanding. I can see how using the full combined
> > > permissions will cause weirdness for upper level SPs in kvm_mmu_get_page(), but
> > > I'm struggling to connect the dots to understand how that will cause incorrect
> > > behavior for the guest. AFAICT, outside of the SP cache, KVM only consumes
> > > role.access for the final/last SP.
> > >
> >
> > Agreed, a unit test would be even better, but just a description in the
> > commit message would be enough.
> >
> > Paolo
> >
>
> Something in my mind, but I haven't test it:
>
> pgd[]pud[] pmd[] pte[] virtual address pointers
> (same hpa as pmd2\) /->pte1(u--)->page1 <- ptr1 (u--)
> /->pmd1(uw-)--->pte2(uw-)->page2 <- ptr2 (uw-)
> pgd->pud-| (shared pte[] as above)
> \->pmd2(u--)--->pte1(u--)->page1 <- ptr3 (u--)
> (same hpa as pmd1/) \->pte2(uw-)->page2 <- ptr4 (u--)
>
>
> pmd1 and pmd2 point to the same pte table, so:
> ptr1 and ptr3 points to the same page.
> ptr2 and ptr4 points to the same page.
>
> The guess read-accesses to ptr1 first. So the hypervisor gets the
> shadow pte page table with role.access=u-- among other things.
> (Note the shadowed pmd1's access is uwx)
>
> And then the guest write-accesses to ptr2, and the hypervisor
> set up shadow page for ptr2.
> (Note the hypervisor silencely accepts the role.access=u--
> shadow pte page table in FNAME(fetch))
>
> After that, the guess read-accesses to ptr3, the hypervisor
> reused the same shadow pte page table as above.
>
> At last, the guest writes to ptr4 without vmexit nor pagefault,
> Which should cause vmexit as the guest expects.

Hmm, yes, KVM would incorrectly handle this scenario. But, the proposed patch
would not address the issue as KVM always maps non-leaf shadow pages with full
access permissions.

> In theory, guest userspace can trick the guest kernel if the guest
> kernel sets up page table like this.

I doubt any kernel is affected. Providing a RO or NX view by splitting the VA
space at the PMD level is doable, but it would be much more awkward to deal with
than splitting the VAs at the PGD level (kernel vs. userspace)

E.g. Linux uses constant[*] protections for page tables, with different constant
protections for kernel v. userspace.

[*] Ignoring encryption, which is technically an address bit anyways.

> Such spaghetti pagetables are unlikely to be seen in the guest.
>
> But when the guest is using KPTI and not using SMEP. KPTI means
> all pgd entries are marked NX on the lower/userspace part of
> the kernel pagetable. Which means SMEP is not needed.
> (see arch/x86/mm/pti.c)
>
> Assume the guest does disable SMEP and the guest has the flaw
> that the guest user can trick guest kernel to execute on lower
> part of the address space.
>
> Normally, NX bit marked on the kernel pagetable's lower pgd
> entries can help in this case. But when in guest with shadowpage
> in hypervisor, the guest user can make those NX bit useless.

This NX use case won't be affected. The example above requires ptr2 and ptr4 to
use the same PGD and PUD. If ptr2 and ptr4 use different PGDs, i.e. kernel vs.
userspace, KVM will use different shadow pages for the two PGDs, and the kernel
variant will have role.NX=1 in the leaf SPTEs.

> Again, I haven't tested it neither. I will try it later and
> update the patch including adding some more checks in the mmu.c.
>
> Thanks,
> Lai