Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

From: David Hildenbrand
Date: Thu Sep 01 2022 - 02:58:59 EST


On 01.09.22 00:18, Yang Shi wrote:
> On Wed, Aug 31, 2022 at 12:43 PM Yang Shi <shy828301@xxxxxxxxx> wrote:
>>
>> On Wed, Aug 31, 2022 at 12:36 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>>>
>>> On 31.08.22 21:34, Yang Shi wrote:
>>>> On Wed, Aug 31, 2022 at 12:15 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>>>>>
>>>>> On 31.08.22 21:08, Yang Shi wrote:
>>>>>> On Wed, Aug 31, 2022 at 11:29 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> On 31.08.22 19:55, Yang Shi wrote:
>>>>>>>> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> The comment is stale, because a TLB flush is no longer sufficient and
>>>>>>>>> required to synchronize against concurrent GUP-fast. This used to be true
>>>>>>>>> in the past, whereby a TLB flush would have implied an IPI on architectures
>>>>>>>>> that support GUP-fast, resulting in GUP-fast that disables local interrupts
>>>>>>>>> from completing before completing the flush.
>>>>>>>>
>>>>>>>> Hmm... it seems there might be problem for THP collapse IIUC. THP
>>>>>>>> collapse clears and flushes pmd before doing anything on pte and
>>>>>>>> relies on interrupt disable of fast GUP to serialize against fast GUP.
>>>>>>>> But if TLB flush is no longer sufficient, then we may run into the
>>>>>>>> below race IIUC:
>>>>>>>>
>>>>>>>> CPU A CPU B
>>>>>>>> THP collapse fast GUP
>>>>>>>>
>>>>>>>> gup_pmd_range() <-- see valid pmd
>>>>>>>>
>>>>>>>> gup_pte_range() <-- work on pte
>>>>>>>> clear pmd and flush TLB
>>>>>>>> __collapse_huge_page_isolate()
>>>>>>>> isolate page <-- before GUP bump refcount
>>>>>>>>
>>>>>>>> pin the page
>>>>>>>> __collapse_huge_page_copy()
>>>>>>>> copy data to huge page
>>>>>>>> clear pte (don't flush TLB)
>>>>>>>> Install huge pmd for huge page
>>>>>>>>
>>>>>>>> return the obsolete page
>>>>>>>
>>>>>>> Hm, the is_refcount_suitable() check runs while the PTE hasn't been
>>>>>>> cleared yet. And we don't check if the PMD changed once we're in
>>>>>>> gup_pte_range().
>>>>>>
>>>>>> Yes
>>>>>>
>>>>>>>
>>>>>>> The comment most certainly should be stale as well -- unless there is
>>>>>>> some kind of an implicit IPI broadcast being done.
>>>>>>>
>>>>>>> 2667f50e8b81 mentions: "The RCU page table free logic coupled with an
>>>>>>> IPI broadcast on THP split (which is a rare event), allows one to
>>>>>>> protect a page table walker by merely disabling the interrupts during
>>>>>>> the walk."
>>>>>>>
>>>>>>> I'm not able to quickly locate that IPI broadcast -- maybe there is one
>>>>>>> being done here (in collapse) as well?
>>>>>>
>>>>>> The TLB flush may call IPI. I'm supposed it is arch dependent, right?
>>>>>> Some do use IPI, some may not.
>>>>>
>>>>> Right, and the whole idea of the RCU GUP-fast was to support
>>>>> architectures that don't do it. x86-64 does it. IIRC, powerpc doesn't do
>>>>> it -- but maybe it does so for PMDs?
>>>>
>>>> It looks powerpc does issue IPI for pmd flush. But arm64 doesn't IIRC.
>>>>
>>>> So maybe we should implement pmdp_collapse_flush() for those arches to
>>>> issue IPI.
>>>
>>> ... or find another way to detect and handle this in GUP-fast?
>>>
>>> Not sure if, for handling PMDs, it could be sufficient to propagate the
>>> pmdp pointer + value and double check that the values didn't change.
>>
>> Should work too, right before pinning the page.
>
> I actually mean the same place for checking pte. So, something like:
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 5abdaf487460..2b0703403902 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2392,7 +2392,8 @@ static int gup_pte_range(pmd_t pmd, unsigned
> long addr, unsigned long end,
> goto pte_unmap;
> }
>
> - if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> + unlikely(pte_val(pte) != pte_val(*ptep))) {
> gup_put_folio(folio, 1, flags);
> goto pte_unmap;
> }
>
> It doesn't build, just shows the idea.

Exactly what I had in mind. We should add a comment spelling out that
this is for handling huge PMD collapse.


--
Thanks,

David / dhildenb