Re: [PATCH v2 2/5] mm: avoid unnecessary flush on change_huge_pmd()
From: Nadav Amit
Date: Tue Oct 26 2021 - 13:44:22 EST
> On Oct 26, 2021, at 9:47 AM, Nadav Amit <namit@xxxxxxxxxx> wrote:
>
>
>
>> On Oct 26, 2021, at 9:06 AM, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>>
>> On 10/21/21 5:21 AM, Nadav Amit wrote:
>>> The first TLB flush is only necessary to prevent the dirty bit (and with
>>> a lesser importance the access bit) from changing while the PTE is
>>> modified. However, this is not necessary as the x86 CPUs set the
>>> dirty-bit atomically with an additional check that the PTE is (still)
>>> present. One caveat is Intel's Knights Landing that has a bug and does
>>> not do so.
>>
>> First, did I miss the check in this patch for X86_BUG_PTE_LEAK? I don't
>> see it anywhere.
>
> No, it is me who missed it. It should have been in pmdp_invalidate_ad():
>
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 3481b35cb4ec..f14f64cc17b5 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -780,6 +780,30 @@ int pmd_clear_huge(pmd_t *pmd)
> return 0;
> }
>
> +/*
> + * pmdp_invalidate_ad() - prevents the access and dirty bits from being further
> + * updated by the CPU.
> + *
> + * Returns the original PTE.
> + *
> + * During an access to a page, x86 CPUs set the dirty and access bit atomically
> + * with an additional check of the present-bit. Therefore, it is possible to
> + * avoid the TLB flush if we change the PTE atomically, as pmdp_establish does.
> + *
> + * We do not make this optimization on certain CPUs that has a bug that violates
> + * this behavior (specifically Knights Landing).
> + */
> +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
> + pmd_t *pmdp)
> +{
> + pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
> +
> + if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
> + flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + return old;
> +}
>
>>
>>> - * pmdp_invalidate() is required to make sure we don't miss
>>> - * dirty/young flags set by hardware.
>>
>> This got me thinking... In here:
>>
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20160708001909.FB2443E2%40viggo.jf.intel.com%2F&data=04%7C01%7Cnamit%40vmware.com%7Cf6a2a69eec094b12638108d9989afb60%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637708613735772213%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=o8fYbm8BKHvWxYC9aO5e3MFLkaOnQxvDMy%2BEnYxz56I%3D&reserved=0
>>
>> I wrote:
>>
>>> These bits are truly "stray". In the case of the Dirty bit, the
>>> thread associated with the stray set was *not* allowed to write to
>>> the page. This means that we do not have to launder the bit(s); we
>>> can simply ignore them.
>>
>> Is the goal of your proposed patch here to ensure that the dirty bit is
>> not set at *all*? Or, is it to ensure that a dirty bit which we need to
>> *launder* is never set?
>
> At *all*.
>
> Err… I remembered from our previous discussions that the dirty bit cannot
> be set once the R/W bit is cleared atomically. But going back to the SDM,
> I see the (relatively new?) note:
>
> "If software on one logical processor writes to a page while software on
> another logical processor concurrently clears the R/W flag in the
> paging-structure entry that maps the page, execution on some processors may
> result in the entry’s dirty flag being set (due to the write on the first
> logical processor) and the entry’s R/W flag being clear (due to the update
> to the entry on the second logical processor). This will never occur on a
> processor that supports control-flow enforcement technology (CET)”
>
> So I guess that this optimization can only be enabled when CET is enabled.
>
> :(
I still wonder whether the SDM comment applies to present bit vs dirty
bit atomicity as well.
On AMD’s APM I find:
"The processor never sets the Accessed bit or the Dirty bit for a not
present page (P = 0). The ordering of Accessed and Dirty bit updates
with respect to surrounding loads and stores is discussed below.”
( The later comment regards ordering to WC memory ).
I don’t know if I read it too creatively...