Re: [PATCH 0/2] arm64: hugetlb: Fix page fault loop for sw-dirty/hw-clean contiguous PTEs

From: James Houghton
Date: Tue Dec 05 2023 - 12:55:27 EST


On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>
> On 04/12/2023 17:26, James Houghton wrote:
> > It is currently possible for a userspace application to enter a page
> > fault loop when using HugeTLB pages implemented with contiguous PTEs
> > when HAFDBS is not available. This happens because:
> > 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean
> > (PTE_DIRTY | PTE_RDONLY | PTE_WRITE).
>
> Hi James,
>
> Do you know how this happens?

Hi Ryan,

Thanks for taking a look! I do understand why this is happening. There
is an explanation in the reproducer[1] and also in this cover letter
(though I realize I could have been a little clearer). See below.

> AFAIK, this is the set of valid bit combinations, and
> PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is
> to understand how this is happening and prevent it?
>
> /*
> * PTE bits configuration in the presence of hardware Dirty Bit Management
> * (PTE_WRITE == PTE_DBM):
> *
> * Dirty Writable | PTE_RDONLY PTE_WRITE PTE_DIRTY (sw)
> * 0 0 | 1 0 0
> * 0 1 | 1 1 0
> * 1 0 | 1 0 1
> * 1 1 | 0 1 x
> *
> * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via
> * the page fault mechanism. Checking the dirty status of a pte becomes:
> *
> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY)
> */

Thanks for pointing this out. So (1) is definitely a bug. The second
patch in this series makes it impossible to create such a PTE via
pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well).

> > The second patch in this series makes step (1) less likely to occur.

It makes it impossible to create this invalid set of bits via
pte_modify(). Assuming all PTE pgprot updates are done via the proper
interfaces, patch #2 might actually make this invalid bit combination
impossible to produce (that's certainly the goal). So perhaps language
stronger than "less likely" is appropriate.

Here's the sequence of events to trigger this bug, via mprotect():

> > Without this patch, we can get the kernel to write a sw-dirty, hw-clean
> > PTE with the following steps (showing the relevant VMA flags and pgprot
> > bits):
> > i. Create a valid, writable contiguous PTE.
> > VMA vmflags: VM_SHARED | VM_READ | VM_WRITE
> > VMA pgprot bits: PTE_RDONLY | PTE_WRITE
> > PTE pgprot bits: PTE_DIRTY | PTE_WRITE
> > ii. mprotect the VMA to PROT_NONE.
> > VMA vmflags: VM_SHARED
> > VMA pgprot bits: PTE_RDONLY
> > PTE pgprot bits: PTE_DIRTY | PTE_RDONLY
> > iii. mprotect the VMA back to PROT_READ | PROT_WRITE.
> > VMA vmflags: VM_SHARED | VM_READ | VM_WRITE
> > VMA pgprot bits: PTE_RDONLY | PTE_WRITE
> > PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY

With patch #2, the PTE pgprot bits in step iii become PTE_DIRTY |
PTE_WRITE (hw-dirtiness is set, as the PTE is sw-dirty).

Thanks!

> > [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0