Re: [PATCH] KVM: x86/mmu: fix potential races when walking host page table

From: Sean Christopherson
Date: Fri Apr 29 2022 - 11:33:26 EST


On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> On 4/29/22 05:17, Mingwei Zhang wrote:
> > + ptep = pte_offset_map(&pmd, address);
> > + pte = ptep_get(ptep);
> > + if (pte_present(pte)) {
> > + pte_unmap(ptep);
> > + level = PG_LEVEL_4K;
> > + goto out;
> > + }
> > + pte_unmap(ptep);
>
> Not needed as long as PG_LEVEL_4K is returned for a non-present PTE.
>
> > +out:
> > + local_irq_restore(flags);
> > + return level;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_lookup_address_level_in_mm);
>
> Exporting is not needed.
>
> Thanks for writing the walk code though. I'll adapt it and integrate the
> patch.

Can you also remove the intermediate pointers? They are at best a wash in terms
of readability (net negative IMO), and introduce unnecessary risk by opening up
the possibility of using the wrong variable and/or re-reading an entry.

Case in point, this code subtly re-reads the upper level entries when getting
the next level down, which can run afould of huge page promotion and use the
wrong address (huge pfn instead of page table address).

The alternative is to use the p*d_offset_lockless() helper, but IMO that's
unnecessary and pointless because it just does the same thing under the hood.

E.g. (compile tested only at this point)

static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
const struct kvm_memory_slot *slot)
{
unsigned long flags;
unsigned long hva;
int level;
pgd_t pgd;
p4d_t p4d;
pud_t pud;
pmd_t pmd;

if (!PageCompound(pfn_to_page(pfn)) && !kvm_is_zone_device_pfn(pfn))
return PG_LEVEL_4K;

/*
* Note, using the already-retrieved memslot and __gfn_to_hva_memslot()
* is not solely for performance, it's also necessary to avoid the
* "writable" check in __gfn_to_hva_many(), which will always fail on
* read-only memslots due to gfn_to_hva() assuming writes. Earlier
* page fault steps have already verified the guest isn't writing a
* read-only memslot.
*/
hva = __gfn_to_hva_memslot(slot, gfn);

level = PG_LEVEL_4K;

/*
* Disable IRQs to block TLB shootdown and thus prevent user page tables
* from being freed.
*/
local_irq_save(flags);

pgd = READ_ONCE(*pgd_offset(kvm->mm, hva));
if (pgd_none(pgd))
goto out;

p4d = READ_ONCE(*p4d_offset(&pgd, hva));
if (p4d_none(p4d) || !p4d_present(p4d))
goto out;

if (p4d_large(p4d)) {
level = PG_LEVEL_512G;
goto out;
}

pud = READ_ONCE(*pud_offset(&p4d, hva));
if (pud_none(pud) || !pud_present(pud))
goto out;

if (pud_large(pud)) {
level = PG_LEVEL_1G;
goto out;
}

pmd = READ_ONCE(*pmd_offset(&pud, hva));
if (pmd_none(pmd) || !pmd_present(pmd))
goto out;

if (pmd_large(pmd))
level = PG_LEVEL_2M;
out:
local_irq_restore(flags);
return level;
}