Re: [PATCH v1 1/2] arm64/mm: Move PTE_PROT_NONE and PMD_PRESENT_INVALID

From: Ryan Roberts
Date: Thu Apr 25 2024 - 06:37:53 EST


On 25/04/2024 11:29, Ryan Roberts wrote:
> On 25/04/2024 10:16, David Hildenbrand wrote:
>> On 24.04.24 13:10, Ryan Roberts wrote:
>>> Previously PTE_PROT_NONE was occupying bit 58, one of the bits reserved
>>> for SW use when the PTE is valid. This is a waste of those precious SW
>>> bits since PTE_PROT_NONE can only ever be set when valid is clear.
>>> Instead let's overlay it on what would be a HW bit if valid was set.
>>>
>>> We need to be careful about which HW bit to choose since some of them
>>> must be preserved; when pte_present() is true (as it is for a
>>> PTE_PROT_NONE pte), it is legitimate for the core to call various
>>> accessors, e.g. pte_dirty(), pte_write() etc. There are also some
>>> accessors that are private to the arch which must continue to be
>>> honoured, e.g. pte_user(), pte_user_exec() etc.
>>>
>>> So we choose to overlay PTE_UXN; This effectively means that whenever a
>>> pte has PTE_PROT_NONE set, it will always report pte_user_exec() ==
>>> false, which is obviously always correct.
>>>
>>> As a result of this change, we must shuffle the layout of the
>>> arch-specific swap pte so that PTE_PROT_NONE is always zero and not
>>> overlapping with any other field. As a result of this, there is no way
>>> to keep the `type` field contiguous without conflicting with
>>> PMD_PRESENT_INVALID (bit 59), which must also be 0 for a swap pte. So
>>> let's move PMD_PRESENT_INVALID to bit 60.
>>
>> A note that some archs split/re-combine type and/or offset, to make use of every
>> bit possible :) But that's mostly relevant for 32bit.
>>
>> (and as long as PFNs can still fit into the swp offset for migration entries etc.)
>
> Yeah, I considered splitting the type or offset field to avoid moving
> PMD_PRESENT_INVALID, but thought it was better to avoid the extra mask and shift.

Also, IMHO we shouldn't really need to reserve PMD_PRESENT_INVALID for swap
ptes; it would be cleaner to have one bit that defines "present" when valid is
clear (similar to PTE_PROT_NONE today) then another bit which is only defined
when "present && !valid" which tells us if this is PTE_PROT_NONE or
PMD_PRESENT_INVALID (I don't think you can ever have both at the same time?).

But there is a problem with this: __split_huge_pmd_locked() calls
pmdp_invalidate() for a pmd before it determines that it is pmd_present(). So
the PMD_PRESENT_INVALID can be set in a swap pte today. That feels wrong to me,
but was trying to avoid the whole thing unravelling so didn't persue.

>
>>
>>>
>>> In the end, this frees up bit 58 for future use as a proper SW bit (e.g.
>>> soft-dirty or uffd-wp).
>>
>> I was briefly confused about how you would use these bits as SW bits for swap
>> PTEs (which you can't as they overlay the type). See below regarding bit 3.
>>
>> I would have said here "proper SW bit for present PTEs".
>
> Yes; I'll clarify in the next version.
>
>>
>>>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>>> ---
>>>   arch/arm64/include/asm/pgtable-prot.h |  4 ++--
>>>   arch/arm64/include/asm/pgtable.h      | 16 +++++++++-------
>>>   2 files changed, 11 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable-prot.h
>>> b/arch/arm64/include/asm/pgtable-prot.h
>>> index dd9ee67d1d87..ef952d69fd04 100644
>>> --- a/arch/arm64/include/asm/pgtable-prot.h
>>> +++ b/arch/arm64/include/asm/pgtable-prot.h
>>> @@ -18,14 +18,14 @@
>>>   #define PTE_DIRTY        (_AT(pteval_t, 1) << 55)
>>>   #define PTE_SPECIAL        (_AT(pteval_t, 1) << 56)
>>>   #define PTE_DEVMAP        (_AT(pteval_t, 1) << 57)
>>> -#define PTE_PROT_NONE        (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
>>> +#define PTE_PROT_NONE        (PTE_UXN)         /* Reuse PTE_UXN; only when
>>> !PTE_VALID */
>>>     /*
>>>    * This bit indicates that the entry is present i.e. pmd_page()
>>>    * still points to a valid huge page in memory even if the pmd
>>>    * has been invalidated.
>>>    */
>>> -#define PMD_PRESENT_INVALID    (_AT(pteval_t, 1) << 59) /* only when
>>> !PMD_SECT_VALID */
>>> +#define PMD_PRESENT_INVALID    (_AT(pteval_t, 1) << 60) /* only when
>>> !PMD_SECT_VALID */
>>>     #define _PROT_DEFAULT        (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
>>>   #define _PROT_SECT_DEFAULT    (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index afdd56d26ad7..23aabff4fa6f 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -1248,20 +1248,22 @@ static inline pmd_t pmdp_establish(struct
>>> vm_area_struct *vma,
>>>    * Encode and decode a swap entry:
>>>    *    bits 0-1:    present (must be zero)
>>>    *    bits 2:        remember PG_anon_exclusive
>>> - *    bits 3-7:    swap type
>>> - *    bits 8-57:    swap offset
>>> - *    bit  58:    PTE_PROT_NONE (must be zero)
>>
>> Reading this patch alone: what happened to bit 3? Please mention that that it
>> will be used as a swap pte metadata bit (uffd-wp).
>
> Will do. It's all a bit arbitrary though. I could have put offset in 3-52, and
> then 53 would have been spare for uffd-wp. I'm not sure there is any advantage
> to either option.
>
>>
>>> + *    bits 4-53:    swap offset
>>
>> So we'll still have 50bit for the offset, good. We could even use 61-63 if ever
>> required to store bigger PFNs.
>
> yep, or more sw bits.
>
>>
>> LGTM
>>
>> Reviewed-by: David Hildenbrand <david@xxxxxxxxxx>
>
> Thanks!
>
>