Re: [RFC 1/2] arm64/mm: Change THP helpers to comply with generic MM semantics
From: Anshuman Khandual
Date: Mon Jul 01 2019 - 23:37:10 EST
On 06/28/2019 03:50 PM, Catalin Marinas wrote:
> Hi Anshuman,
Hello Catalin,
>
> On Thu, Jun 27, 2019 at 06:18:15PM +0530, Anshuman Khandual wrote:
>> pmd_present() and pmd_trans_huge() are expected to behave in the following
>> manner during various phases of a given PMD. It is derived from a previous
>> detailed discussion on this topic [1] and present THP documentation [2].
>>
>> pmd_present(pmd):
>>
>> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd)
>> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd)
>>
>> pmd_trans_huge(pmd):
>>
>> - Returns true if pmd refers to system RAM and is a trans huge mapping
>>
>> -------------------------------------------------------------------------
>> | PMD states | pmd_present | pmd_trans_huge |
>> -------------------------------------------------------------------------
>> | Mapped | Yes | Yes |
>> -------------------------------------------------------------------------
>> | Splitting | Yes | Yes |
>> -------------------------------------------------------------------------
>> | Migration/Swap | No | No |
>> -------------------------------------------------------------------------
>
> Before we actually start fixing this, I would strongly suggest that you
> add a boot selftest (see lib/Kconfig.debug for other similar cases)
> which checks the consistency of the page table macros w.r.t. the
> expected mm semantics. Once the mm maintainers agreed with the
> semantics, it will really help architecture maintainers in implementing
> them correctly.
Sure and it will help all architectures to be in sync wrt semantics.
>
> You wouldn't need actual page tables, just things like assertions on
> pmd_trans_huge(pmd_mkhuge(pmd)) == true. You could go further and have
> checks on pmdp_invalidate(&dummy_vma, dummy_addr, &dummy_pmd) with the
> dummy_* variables on the stack.
Hmm. I guess macros which operate directly on a page table entry will be
okay but the ones which check on specific states for VMA or MM might be
bit tricky. Try to emulate VMA/MM states while on stack ?. But sure, will
explore adding such a test.
>
>> The problem:
>>
>> PMD is first invalidated with pmdp_invalidate() before it's splitting. This
>> invalidation clears PMD_SECT_VALID as below.
>>
>> PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID
>>
>> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false
>> on the PMD entry.
>
> I think that's an inconsistency in the expected semantics here. Do you
> mean that pmd_present(pmd_mknotpresent(pmd)) should be true? If not, do
Actually that is true (more so if we are using generic pmdp_invalidate). Else
in general pmd_present(pmdp_invalidate(pmd)) needs to be true to successfully
represent a splitting THP. That is what Andrea explained back on this thread
(https://lkml.org/lkml/2018/10/17/231).
Extracting relevant sections from that thread -
"pmd_present never meant the real present bit in the pte was set, it just means
the pmd points to RAM. It means it doesn't point to swap or migration entry and
you can do pmd_to_page and it works fine."
"The clear of the real present bit during pmd (virtual) splitting is done with
pmdp_invalidate, that is created specifically to keeps pmd_trans_huge=true,
pmd_present=true despite the present bit is not set. So you could imagine
_PAGE_PSE as the real present bit."
pmd_present() and pmd_mknotpresent() are not exact inverse.
Problem is all platforms using generic pmdp_invalidate() calls pmd_mknotpresent()
which invariably across platforms remove the valid or present bit from the entry.
The point to note here is that pmd_mknotpresent() invalidates the entry from MMU
point of view but pmd_present() does not check for a MMU valid PMD entry. Hence
pmd_present(pmd_mknotpresent(pmd)) can still be true.
In absence of a positive section mapping bit on arm64, PTE_SPECIAL is being set
temporarily to remember that it was a mapped PMD which got invalidated recently
but which still points to memory. Hence pmd_present() must evaluate true.
pmd_mknotpresent() does not make !pmd_present() it just invalidates the entry.
> we need to implement our own pmdp_invalidate() or change the generic one
> to set a "special" bit instead of just a pmd_mknotpresent?
Though arm64 can subscribe __HAVE_ARCH_PMDP_INVALIDATE and implement it's own
pmdp_invalidate() in order to not call pmd_mknotpresent() and instead operate
on the invalid and special bits directly. But its not going to alter relevant
semantics here. AFAICS it might be bit better as it saves pmd_mknotpresent()
from putting in that special bit in there which it is not supposed do.
IFAICS there is no compelling reason for generic pmdp_invalidate() to change
either. It calls pmd_mknotpresent() which invalidates the entry through valid
or present bit and platforms which have dedicated huge page bit can still test
positive for pmd_present() after it's invalidation. It works for such platforms.
Platform specific override is required when invalidation via pmd_mknotpresent()
is not enough.
>
>> +static inline int pmd_present(pmd_t pmd)
>> +{
>> + if (pte_present(pmd_pte(pmd)))
>> + return 1;
>> +
>> + return pte_special(pmd_pte(pmd));
>> +}
> [...]
>> +static inline pmd_t pmd_mknotpresent(pmd_t pmd)
>> +{
>> + pmd = pte_pmd(pte_mkspecial(pmd_pte(pmd)));
>> + return __pmd(pmd_val(pmd) & ~PMD_SECT_VALID);
>> +}
>
> I'm not sure I agree with the semantics here where pmd_mknotpresent()
> does not actually make pmd_present() == false.
As Andrea explained, pmd_present() does not check validity of the PMD entry
from MMU perspective but the presence of a valid pmd_page() which still refers
to a valid struct page in the memory. It is irrespective of whether the entry
in itself is valid for MMU walk or not.
+ Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
I have added Andrea on this thread if he would like to add something.
- Anshuman