Re: [PATCH v3 3/5] mm, thp: introduce FOLL_SPLIT_PMD

From: Song Liu
Date: Thu Jun 13 2019 - 13:48:32 EST




> On Jun 13, 2019, at 9:47 AM, Song Liu <songliubraving@xxxxxx> wrote:
>
> Hi Kirill,
>
>> On Jun 13, 2019, at 8:24 AM, Song Liu <songliubraving@xxxxxx> wrote:
>>
>>
>>
>>> On Jun 13, 2019, at 8:14 AM, Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> wrote:
>>>
>>> On Thu, Jun 13, 2019 at 03:03:01PM +0000, Song Liu wrote:
>>>>
>>>>
>>>>> On Jun 13, 2019, at 7:16 AM, Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
>>>>>
>>>>> On Thu, Jun 13, 2019 at 01:57:30PM +0000, Song Liu wrote:
>>>>>>> And I'm not convinced that it belongs here at all. User requested PMD
>>>>>>> split and it is done after split_huge_pmd(). The rest can be handled by
>>>>>>> the caller as needed.
>>>>>>
>>>>>> I put this part here because split_huge_pmd() for file-backed THP is
>>>>>> not really done after split_huge_pmd(). And I would like it done before
>>>>>> calling follow_page_pte() below. Maybe we can still do them here, just
>>>>>> for file-backed THPs?
>>>>>>
>>>>>> If we would move it, shall we move to callers of follow_page_mask()?
>>>>>> In that case, we will probably end up with similar code in two places:
>>>>>> __get_user_pages() and follow_page().
>>>>>>
>>>>>> Did I get this right?
>>>>>
>>>>> Would it be enough to replace pte_offset_map_lock() in follow_page_pte()
>>>>> with pte_alloc_map_lock()?
>>>>
>>>> This is similar to my previous version:
>>>>
>>>> + } else { /* flags & FOLL_SPLIT_PMD */
>>>> + pte_t *pte;
>>>> + spin_unlock(ptl);
>>>> + split_huge_pmd(vma, pmd, address);
>>>> + pte = get_locked_pte(mm, address, &ptl);
>>>> + if (!pte)
>>>> + return no_page_table(vma, flags);
>>>> + spin_unlock(ptl);
>>>> + ret = 0;
>>>> + }
>>>>
>>>> I think this is cleaner than use pte_alloc_map_lock() in follow_page_pte().
>>>> What's your thought on these two versions (^^^ vs. pte_alloc_map_lock)?
>>>
>>> It's additional lock-unlock cycle and few more lines of code...
>>>
>>>>> This will leave bunch not populated PTE entries, but it is fine: they will
>>>>> be populated on the next access to them.
>>>>
>>>> We need to handle page fault during next access, right? Since we already
>>>> allocated everything, we can just populate the PTE entries and saves a
>>>> lot of page faults (assuming we will access them later).
>>>
>>> Not a lot due to faultaround and they may never happen, but you need to
>>> tear down the mapping any way.
>>
>> I see. Let me try this way.
>>
>> Thanks,
>> Song
>
> To make sure I understand your suggestions. Here is what I got:
>
> diff --git c/mm/gup.c w/mm/gup.c
> index ddde097cf9e4..85e6f46fd925 100644
> --- c/mm/gup.c
> +++ w/mm/gup.c
> @@ -197,7 +197,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> if (unlikely(pmd_bad(*pmd)))
> return no_page_table(vma, flags);
>
> - ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> + ptep = pte_alloc_map_lock(mm, pmd, address, &ptl);
> + if (!ptep)
> + return ERR_PTR(-ENOMEM);
> +
> pte = *ptep;
> if (!pte_present(pte)) {
> swp_entry_t entry;
> @@ -398,7 +401,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> spin_unlock(ptl);
> return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> }
> - if (flags & FOLL_SPLIT) {
> + if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
> int ret;
> page = pmd_page(*pmd);
> if (is_huge_zero_page(page)) {
> @@ -407,7 +410,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> split_huge_pmd(vma, pmd, address);
> if (pmd_trans_unstable(pmd))
> ret = -EBUSY;
> - } else {
> + } else if (flags & FOLL_SPLIT) {
> if (unlikely(!try_get_page(page))) {
> spin_unlock(ptl);
> return ERR_PTR(-ENOMEM);
> @@ -419,6 +422,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> put_page(page);
> if (pmd_none(*pmd))
> return no_page_table(vma, flags);
> + } else { /* flags & FOLL_SPLIT_PMD */
> + spin_unlock(ptl);
> + split_huge_pmd(vma, pmd, address);
> + ret = 0;
> }
>
> return ret ? ERR_PTR(ret) :
> follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>
>
> This version doesn't work as-is, because it returns at the first check:
>
> if (unlikely(pmd_bad(*pmd)))
> return no_page_table(vma, flags);
>
> Did I misunderstand anything here?
>
> Thanks,
> Song

I guess this would be the best. It _is_ a lot simpler.

diff --git c/mm/gup.c w/mm/gup.c
index ddde097cf9e4..0cd3ce599f41 100644
--- c/mm/gup.c
+++ w/mm/gup.c
@@ -398,7 +398,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
spin_unlock(ptl);
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
- if (flags & FOLL_SPLIT) {
+ if (flags & (FOLL_SPLIT | FOLL_SPLIT_PMD)) {
int ret;
page = pmd_page(*pmd);
if (is_huge_zero_page(page)) {
@@ -407,7 +407,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
split_huge_pmd(vma, pmd, address);
if (pmd_trans_unstable(pmd))
ret = -EBUSY;
- } else {
+ } else if (flags & FOLL_SPLIT) {
if (unlikely(!try_get_page(page))) {
spin_unlock(ptl);
return ERR_PTR(-ENOMEM);
@@ -419,6 +419,11 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
put_page(page);
if (pmd_none(*pmd))
return no_page_table(vma, flags);
+ } else { /* flags & FOLL_SPLIT_PMD */
+ spin_unlock(ptl);
+ ret = 0;
+ split_huge_pmd(vma, pmd, address);
+ pte_alloc(mm, pmd);
}

Thanks again for the suggestions. I will send v4 soon.

Song


>
>
>>
>>>
>>> --
>>> Kirill A. Shutemov