Re: [PATCH 1/3] mm: add functions folio_in_range() and folio_within_vma()

From: Yin, Fengwei
Date: Fri Aug 04 2023 - 05:09:16 EST


Hi Ryan,

On 8/4/2023 4:46 PM, Ryan Roberts wrote:
> On 04/08/2023 00:15, Yin, Fengwei wrote:
>>
>>
>> On 8/3/2023 9:20 PM, Ryan Roberts wrote:
>>> On 03/08/2023 11:48, Yin Fengwei wrote:
>>>>
>>>>
>>>> On 8/3/23 17:58, Ryan Roberts wrote:
>>>>> On 28/07/2023 08:09, Yin Fengwei wrote:
>>>>>> It will be used to check whether the folio is mapped to specific
>>>>>> VMA and whether the mapping address of folio is in the range.
>>>>>>
>>>>>> Also a helper function folio_within_vma() to check whether folio
>>>>>> is in the range of vma based on folio_in_range().
>>>>>>
>>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@xxxxxxxxx>
>>>>>> ---
>>>>>> mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 1 file changed, 69 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>>> index 5a03bc4782a2..63de32154a48 100644
>>>>>> --- a/mm/internal.h
>>>>>> +++ b/mm/internal.h
>>>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
>>>>>> bool write, int *locked);
>>>>>> extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags,
>>>>>> unsigned long bytes);
>>>>>> +
>>>>>
>>>>> Hi Yin,
>>>>>
>>>>> I wanted to take a step back and consolidate my concerns about this function. I
>>>>> should say that my understanding of the pgoff and index stuff is shaky and I
>>>>> could therefore be wrong about some of this; if this is the case, then sorry for
>>>>> the noise! But something about this function doesn't smell right to me, so I
>>>>> figure its better to raise it...
>>>> No worries. And thank you for looking at the code ans share the comments.
>>>> That helps me a lot. Really appreciate it.
>>>>
>>>>>
>>>>>> +/*
>>>>>> + * Check whether the folio is in specific range
>>>>>
>>>>> What exactly is the function trying to do? I *think* the intention is to figure
>>>>> out if a folio is fully and contiguously mapped within a range of virtual
>>>>> addresses belonging to a specific virtual address space? And I assume you want
>>>>> the answer to be precise? I'm assuming 'yes' for the below.
>>>>>
>>>>> I think the only way to do this is to actually check each PTE. And that causes a
>>>>> problem, because a folio can straddle multiple PTE tables, which causes PTL
>>>>> locking issues, and means having just a *pte pointer is insufficient if we need
>>>>> to traverse multiple pte tables. So perhaps you need to allow for a false
>>>>> negative in the case that the folio is not contained within a single pte table.
>>>> Let's check the users of this function first:
>>>> mlock/munlock: Needs only care whether the folio is in the range of VM_LOCKED VMA
>>>> madvise: Needs to check whether the folio is in the range of VMA and
>>>> range [start, end).
>>>
>>> Sure but those 2 ranges (the vma and the user-supplied range) are known to
>>> intersect, right? So really there is only one range of interest; the
>>> intersection. I would argue that should be done in a higher level wrapper, and
>>> is not part of the core primitive to work out if a folio is mapped to a
>>> particular range of virtual addresses.
>>>> This range is from user space. It could contain
>>>> VMA range (in this case, we only need to check VMA range) or is contained
>>>> by VMA range.
>>>>
>>>> So we check:
>>>> 1. Whether the folio is in the range.
>>>>
>>>> To do this, we need to first check whether the folio->index is in the
>>>> VMA range. If not, we know the folio is not in VMA range. Just return false
>>>> without further check.
>>>
>>> Right, so if the check fails, the folio is definitely not mapped by the vma, but
>>> if it passes, it *might* be. Further checks are required. So why is this useful?
>>> Why not just do the check that gives you the precise answer and skip this?
>> What check can give precise answer? I don't think checking PTE is right at this
>> stage. For following case (assume the folio is mapped in same page table and just
>> care about the VMA2 range):
>>
>> |-----------vma1-----------|------------vma2---------|
>> |-------folio-----|
>>
>> How can we tell whether the folio is in the range of VMA2 by just checking PTE?
>> And in this case, why not bail out if we see folio is out range of VMA2?
>>
>>
>> So only need to check PTE if we are sure the folio is in the range of VMA2:
>> |-----------vma1-----------|------------vma2---------|
>> |-------folio-----|
>>
>>>
>>>>
>>>> Then, we will check whether the folio is in the range which is defined as
>>>> [max(start, vma->vm_start), min(end, vma->vm_end)).
>>>
>>> As par above comment, I would personally factor this out of the primitive.
>>>
>>>>
>>>>
>>>> This is the version in RFC as I was not aware of mremap case and forgot the
>>>> page cache case. So if no mremap with an anonymous folio, this check is enough.
>>>> But we need to add PTE check for mremap and page cache cases.
>>>>
>>>> 2. Check PTE for mremap in middle of large folio and page cache
>>>> If the folio is normal 4K and if the folio is in the range, it's not possible
>>>> the folio is partially mapped to two different VMA. So we are sure this folio
>>>> is in the range.
>>>
>>> But you test for small folio and exit early without checking the PTE. Given the
>>> index check only told you that the folio *might be* mapped, I don't see how you
>>> can return true at this point for a small folio, without looking at the PTE?
>> Yes. We can do this. You can check the discussion on the RFC version. I did move
>> the normal 4K folio check out of this function. Yu and Huge suggested to just ignore
>> the normal 4K as this API can' handle it.
>>
>> The thing changed from RFC is that we need to check PTE here. I tried to avoid to
>> check PTE even for normal 4K folio.
>
> I'll go read the RFC. I've made my point; if you and others are convinced this
> is correct, then fair enough.
Thanks a lot for your comments. I will work on new version based on your and Andrew's
comments.

Sometimes, I blindly assume other people know the assumptions (like at least one page
mapped in the VMA) I made. Which did make the review harder. I will try to improve
on this in the future.

Regards
Yin, Fengwei

>
>>
>>>
>>> folio->index is just the page offset within the file it maps (or the (original)
>>> VA/PAGE_SIZE for anon memory - I think?). And vma->vm_pgoff is the page offset
>>> within the file that the vma starts at. So you could have a folio that contains
>>> the contents of 1 file and a vma that maps another file, and the folio's index
>>> might fall within the VMA, but it doesn't mean the folio is mapped by the vma;
>>> its a different file.
>>>
>>> Or perhaps there is an assumption in the function's spec that the folio is known
>>> to have at least one page mapped in the vma? That would change things... but you
>>> should make that very clear in the spec. And in that case, you can move the
>>> small folio check to the start and return true immediately.
>> Yes. "At least one page mapped in VMA" is assumption here. I will make it clear in
>> the comment.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>> Currently, check PTE didn't handle the case you mentioned. But my plan is
>>>> checking whether the folio is mapped cross page table (by checking whether
>>>> the folio start vaddr and end vaddr has same value for low aligned to PMD_SIZE).
>>>> If it cross, I will treat it as out of VMA range. Otherwise, we can reuse
>>>> current check because we know the PTE pointer is always valid.
>>>>
>>>> Obviously, the PTE checking needs hold pte lock as you pointed out.
>>>>
>>>>
>>>> My understanding:
>>>> pgoff is important if we have folio and VMA and want to know the virtual address of
>>>> the folio mapped to. Like first, check whether pgoff of folio belongs to VMA, then get
>>>> vaddr by:
>>>> addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>> rmap_walk() also depends on pgoff. You can check the vma_address() and rmap_walk()
>>>> implementation.
>>>
>>> Yes but the rmap is only traversing vmas that are already known to map the same
>>> file that the folio contains pages for. (See rmap_walk_file(): it grabs the
>>> "mapping" from the folio, then traverses the list of vmas within the range of
>>> interest "vma_interval_tree_foreach"). Its a bit more complicated for anon
>>> memory but I think the princple is the same.
>>>
>>>>
>>>>>
>>>>>> + *
>>>>>> + * First, check whether the folio is in the range of vma.
>>>>>> + * Then, check whether the folio is mapped to the range of [start, end].
>>>>>> + * In the end, check whether the folio is fully mapped to the range.
>>>>>> + *
>>>>>> + * @pte page table pointer will be checked whether the large folio
>>>>>> + * is fully mapped to. Currently, if mremap in the middle of
>>>>>> + * large folio, the large folio could be mapped to to different
>>>>>> + * VMA and address check can't identify this situation.
>>>>>> + */
>>>>>> +static inline bool
>>>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma,
>>>>>> + unsigned long start, unsigned long end, pte_t *pte)
>>>>>
>>>>> The prototype looks odd to me; Fundamentally it seems to me that you need the
>>>>> folio, start and end virtual addresses (the range you want to check that it is
>>>>> in), the pte pointer and the virtual address that the pte represents. That
>>>>> virtual address allows you to figure out the offset between the pa and va. Then
>>>>> you can look at all the ptes to figure out if they reference the folio's pfns,
>>>>> and use the va to pa mapping info to figure out if its within the specified range.
>>>>>
>>>>> I don't really understand why the vma is useful.
>>>>>
>>>>>> +{
>>>>>> + pte_t ptent;
>>>>>> + unsigned long i, nr = folio_nr_pages(folio);
>>>>>> + pgoff_t pgoff, addr;
>>>>>> + unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>>>>>> +
>>>>>> + VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio);
>>>>>> +
>>>>>> + if (start < vma->vm_start)
>>>>>> + start = vma->vm_start;
>>>>>> + if (end > vma->vm_end)
>>>>>> + end = vma->vm_end;
>>>>>> +
>>>>>> + pgoff = folio_pgoff(folio);
>>>>>> + /* if folio start address is not in vma range */
>>>>>> + if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen)
>>>>>> + return false;
>>>>>
>>>>> Why is this pgoff calculation helpful? Surely it's only useful if the folio
>>>>> belongs to the same file that the vma is mapping? Otherwise the folio's pgoff
>>>>> might be referring to a completely different file than the vma's vm_pgoff. So
>>>>> you will get spurious results.
>>>>>
>>>>>> +
>>>>>> + addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>>>>>> + if (addr < start || end - addr < folio_size(folio))
>>>>>> + return false;
>>>>>> +
>>>>>> + /* not necessary to check pte for none large folio */
>>>>>> + if (!folio_test_large(folio))
>>>>>> + return true;
>>>>>
>>>>> I don't understand why you don't need to check the pte for a small folio? As
>>>>> above, if the folio doesn't belong to the file that the vma is mapping the above
>>>>> checks seem wrong and you can't conclude that the folio is mapped in the range
>>>>> without looking at the pte?
>>>>>
>>>>>> +
>>>>>> + if (!pte)
>>>>>> + return false;
>>>>>> +
>>>>>> + /* check whether parameter pte is associated with folio */
>>>>>> + ptent = ptep_get(pte);
>>>>>> + if (pte_none(ptent) || !pte_present(ptent) ||
>>>>>> + pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>>> + return false;
>>>>>> +
>>>>>> + pte -= pte_pfn(ptent) - folio_pfn(folio);
>>>>>
>>>>> I think this could wander off the front or back of the pte table into arbitrary
>>>>> memory if the folio is strddling multiple pte tables.
>>>>>
>>>>>> + for (i = 0; i < nr; i++, pte++) {
>>>>>> + ptent = ptep_get(pte);
>>>>>> +
>>>>>> + if (pte_none(ptent) || !pte_present(ptent) ||
>>>>>> + pte_pfn(ptent) - folio_pfn(folio) >= nr)
>>>>>
>>>>> Doesn't !pte_present() also cover pte_none()? So I think the pte_none() check is
>>>>> redundant?
>>>> I think you are right. pte_none() can be removed here.
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>>> + return false;
>>>>>> + }
>>>>>> +
>>>>>> + return true;
>>>>>> +}
>>>>>> +
>>>>>> +static inline bool
>>>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte)
>>>>>> +{
>>>>>> + return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte);
>>>>>> +}
>>>>>> +
>>>>>> /*
>>>>>> * mlock_vma_folio() and munlock_vma_folio():
>>>>>> * should be called with vma's mmap_lock held for read or write,
>>>>>
>>>
>