Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()

From: Yin, Fengwei
Date: Wed Aug 02 2023 - 09:47:18 EST




On 8/2/2023 9:09 PM, Ryan Roberts wrote:
> On 02/08/2023 13:50, Yin, Fengwei wrote:
>>
>>
>> On 8/2/2023 7:14 PM, Ryan Roberts wrote:
>>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>>> It will be used to check whether the folio is mapped to specific
>>>> VMA and whether the mapping address of folio is in the range.
>>>>
>>>> Also a helper function folio_within_vma() to check whether folio
>>>> is in the range of vma based on folio_in_range().
>>>>
>>>> Signed-off-by: Yin Fengwei <fengwei.yin@xxxxxxxxx>
>>>> ---
>>>> mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 69 insertions(+)
>>>>
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 5a03bc4782a2..63de32154a48 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>> bool write, int *locked);
>>>> extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>> unsigned long bytes);
>>>> +
>>>> +/*
>>>> + * Check whether the folio is in specific range
>>>> + *
>>>> + * First, check whether the folio is in the range of vma.
>>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>>> + * In the end, check whether the folio is fully mapped to the range.
>>>> + *
>>>> + * @pte page table pointer will be checked whether the large folio
>>>> + * is fully mapped to. Currently, if mremap in the middle of
>>>> + * large folio, the large folio could be mapped to to different
>>>> + * VMA and address check can't identify this situation.
>>>> + */
>>>> +static inline bool
>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>>> + unsigned long start, unsigned long end, pte_t *pte)
>>>
>>> This api seems a bit redundant to me. Wouldn't it be better to remove the vma
>>> parameter and instead fix up the start/end addresses in folio_within_vma()?
>> My understanding is it's necessary. As for madvise, we need to check whether
>> the folio is both in the range of VMA and also in the range of [start, end).
>
> But in folio_within_vma() you pass start as vma->vm_start and end as
> vma->vm_end. And in this function, you narrow start/end to be completely
> contained in vma. So surely there is only really one start/end you are
> interested in? Just seems a bit odd to me.
madvise() will call filio_in_range() with VMA and real range [start, end) passed
from user space.

>
>>
>>>
>>>> +{
>>>> + pte_t ptent;
>>>> + unsigned long i, nr = folio_nr_pages(folio);
>>>> + pgoff_t pgoff, addr;
>>>> + unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>> +
>>>> + VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>>> +
>>>> + if (start < vma->vm_start)
>>>> + start = vma->vm_start;
>>>> + if (end > vma->vm_end)
>>>> + end = vma->vm_end;
>>>> +
>>>> + pgoff = folio_pgoff(folio);
>>>> + /* if folio start address is not in vma range */
>>>> + if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>>> + return false;
>>>> +
>>>> + addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>> + if (addr < start || end - addr < folio_size(folio))
>>>> + return false;
>>>> +
>>>> + /* not necessary to check pte for none large folio */
>>>> + if (!folio_test_large(folio))
>>>> + return true;
>>>> +
>>>> + if (!pte)
>>>> + return false;
>>>> +
>>>> + /* check whether parameter pte is associated with folio */
>>>> + ptent = ptep_get(pte);
>>>> + if (pte_none(ptent) || !pte_present(ptent) ||
>>>> + pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>> + return false;
>>>> +
>>>> + pte -= pte_pfn(ptent) - folio_pfn(folio);
>>>> + for (i = 0; i < nr; i++, pte++) {
>>>> + ptent = ptep_get(pte);
>>>> +
>>>> + if (pte_none(ptent) || !pte_present(ptent) ||
>>>> + pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>> + return false;
>>>> + }
>>>
>>> I don't think I see anything to ensure you don't wander off the end (or start)
>>> of the pgtable? If the folio is mremapped so that it straddles multiple tables
>>> (or is bigger than a single table?) then I think pte can become invalid? Perhaps
>>> you intended start/end to always be within the same pgtable, but that is not
>>> guarranteed in the case that folio_within_vma() is making the call.
>> If pte is invalid for any reason (pass wrong parameter, not fully mapped etc), this
>> function just return false in page table entry check phase.
>
> Sorry I don't think this covers the issue I'm describing. If you have a
> pte-mapped THP that gets mremapped to straddle 2 pte tables, don't you have a
> problem?
>
> example for 4K base page set up:
>
> folio_nr_pages = 512
> first page of folio mapped at vaddr = 2M - 4K = 0x1FF000
>
> If you then call this function with the pte pointer for the second page in the
> folio, which is mapped at address 0x200000, that pte is pointing to the first
> pte entry in the table pointed to by the second pmd entry. The pte pointer can
> be legitimately manipulated to point to any entry within that table,
> corrsponding to vaddrs [0x200000, 0x400000). But you will end up subtracting 1
> from the pointer, intending that it now points to the pte entry that represents
> vaddr 0x1FF000. But actually it has fallen off the front of the table into some
> other arbitrary memory in the linear map. 0x1FF000 is represented in a different
> table, pointed to by the first pmd entry.
Yes. This can be an issue as hold the second page table lock can't prevent the first
part unmapped. Let me add another check vaddr align to folio_size in next version.

Regards
Yin, Fengwei

>
>
>>
>>>
>>> Also I want to check that this function is definitely always called under the
>>> PTL for the table that pte belongs to?
>> Yes. I should spell it out. Thanks.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>> +
>>>> + return true;
>>>> +}
>>>> +
>>>> +static inline bool
>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>>> +{
>>>> + return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>>> +}
>>>> +
>>>> /*
>>>> * mlock_vma_folio() and munlock_vma_folio():
>>>> * should be called with vma's mmap_lock held for read or write,
>>>
>