Re: [PATCH resend RFC 4/9] s390/mm: fix VMA and page table handling code in storage key handling functions
From: Claudio Imbrenda
Date: Mon Sep 27 2021 - 12:42:50 EST
On Thu, 9 Sep 2021 18:22:43 +0200
David Hildenbrand <david@xxxxxxxxxx> wrote:
> There are multiple things broken about our storage key handling
> functions:
>
> 1. We should not walk/touch page tables outside of VMA boundaries when
> holding only the mmap sem in read mode. Evil user space can modify the
> VMA layout just before this function runs and e.g., trigger races with
> page table removal code since commit dd2283f2605e ("mm: mmap: zap pages
> with read mmap_sem in munmap"). gfn_to_hva() will only translate using
> KVM memory regions, but won't validate the VMA.
>
> 2. We should not allocate page tables outside of VMA boundaries: if
> evil user space decides to map hugetlbfs to these ranges, bad things
> will happen because we suddenly have PTE or PMD page tables where we
> shouldn't have them.
>
> 3. We don't handle large PUDs that might suddenly appeared inside our page
> table hierarchy.
>
> Don't manually allocate page tables, properly validate that we have VMA and
> bail out on pud_large().
>
> All callers of page table handling functions, except
> get_guest_storage_key(), call fixup_user_fault() in case they
> receive an -EFAULT and retry; this will allocate the necessary page tables
> if required.
>
> To keep get_guest_storage_key() working as expected and not requiring
> kvm_s390_get_skeys() to call fixup_user_fault() distinguish between
> "there is simply no page table or huge page yet and the key is assumed
> to be 0" and "this is a fault to be reported".
>
> Although commit 637ff9efe5ea ("s390/mm: Add huge pmd storage key handling")
> introduced most of the affected code, it was actually already broken
> before when using get_locked_pte() without any VMA checks.
>
> Note: Ever since commit 637ff9efe5ea ("s390/mm: Add huge pmd storage key
> handling") we can no longer set a guest storage key (for example from
> QEMU during VM live migration) without actually resolving a fault.
> Although we would have created most page tables, we would choke on the
> !pmd_present(), requiring a call to fixup_user_fault(). I would
> have thought that this is problematic in combination with postcopy life
> migration ... but nobody noticed and this patch doesn't change the
> situation. So maybe it's just fine.
>
> Fixes: 9fcf93b5de06 ("KVM: S390: Create helper function get_guest_storage_key")
> Fixes: 24d5dd0208ed ("s390/kvm: Provide function for setting the guest storage key")
> Fixes: a7e19ab55ffd ("KVM: s390: handle missing storage-key facility")
> Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
Reviewed-by: Claudio Imbrenda <imbrenda@xxxxxxxxxxxxx>
> ---
> arch/s390/mm/pgtable.c | 57 +++++++++++++++++++++++++++++-------------
> 1 file changed, 39 insertions(+), 18 deletions(-)
>
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index 54969e0f3a94..5fb409ff7842 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -429,22 +429,36 @@ static inline pmd_t pmdp_flush_lazy(struct mm_struct *mm,
> }
>
> #ifdef CONFIG_PGSTE
> -static pmd_t *pmd_alloc_map(struct mm_struct *mm, unsigned long addr)
> +static int pmd_lookup(struct mm_struct *mm, unsigned long addr, pmd_t **pmdp)
> {
> + struct vm_area_struct *vma;
> pgd_t *pgd;
> p4d_t *p4d;
> pud_t *pud;
> - pmd_t *pmd;
> +
> + /* We need a valid VMA, otherwise this is clearly a fault. */
> + vma = vma_lookup(mm, addr);
> + if (!vma)
> + return -EFAULT;
>
> pgd = pgd_offset(mm, addr);
> - p4d = p4d_alloc(mm, pgd, addr);
> - if (!p4d)
> - return NULL;
> - pud = pud_alloc(mm, p4d, addr);
> - if (!pud)
> - return NULL;
> - pmd = pmd_alloc(mm, pud, addr);
> - return pmd;
> + if (!pgd_present(*pgd))
> + return -ENOENT;
> +
> + p4d = p4d_offset(pgd, addr);
> + if (!p4d_present(*p4d))
> + return -ENOENT;
> +
> + pud = pud_offset(p4d, addr);
> + if (!pud_present(*pud))
> + return -ENOENT;
> +
> + /* Large PUDs are not supported yet. */
> + if (pud_large(*pud))
> + return -EFAULT;
> +
> + *pmdp = pmd_offset(pud, addr);
> + return 0;
> }
> #endif
>
> @@ -778,8 +792,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> pmd_t *pmdp;
> pte_t *ptep;
>
> - pmdp = pmd_alloc_map(mm, addr);
> - if (unlikely(!pmdp))
> + if (pmd_lookup(mm, addr, &pmdp))
> return -EFAULT;
>
> ptl = pmd_lock(mm, pmdp);
> @@ -881,8 +894,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
> pte_t *ptep;
> int cc = 0;
>
> - pmdp = pmd_alloc_map(mm, addr);
> - if (unlikely(!pmdp))
> + if (pmd_lookup(mm, addr, &pmdp))
> return -EFAULT;
>
> ptl = pmd_lock(mm, pmdp);
> @@ -935,15 +947,24 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> pmd_t *pmdp;
> pte_t *ptep;
>
> - pmdp = pmd_alloc_map(mm, addr);
> - if (unlikely(!pmdp))
> + /*
> + * If we don't have a PTE table and if there is no huge page mapped,
> + * the storage key is 0.
> + */
> + *key = 0;
> +
> + switch (pmd_lookup(mm, addr, &pmdp)) {
> + case -ENOENT:
> + return 0;
> + case 0:
> + break;
> + default:
> return -EFAULT;
> + }
>
> ptl = pmd_lock(mm, pmdp);
> if (!pmd_present(*pmdp)) {
> - /* Not yet mapped memory has a zero key */
> spin_unlock(ptl);
> - *key = 0;
> return 0;
> }
>