Re: [PATCH v3 11/16] mm/mmap: Track start and end of munmap in vma_munmap_struct

From: Liam R. Howlett
Date: Mon Jul 08 2024 - 10:46:11 EST


* Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> [240705 16:27]:
> On Thu, Jul 04, 2024 at 02:27:13PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>
> >
> > Set the start and end address for munmap when the prev and next are
> > gathered. This is needed to avoid incorrect addresses being used during
> > the vms_complete_munmap_vmas() function if the prev/next vma are
> > expanded.
>
> When we spoke about this separately you mentioned that specific arches may
> be more likely to encounter this issue, perhaps worth mentioning something
> about that in the commit msg? Unless I misunderstood you.

What we spoke about was mappings outside vmas, that is between two vmas
may have mappings on certain archs - I'm not entirely sure on this or if
it's still something we have to worry about. That is, we use the
prev->vm_end and next->vm_start as the unmapping range instead of the
actual vma start and end.

There is also the upper and lower limits if prev or next does not exist.
See git id 6ee8630e02be6, and e2cdef8c847b4 - probably from an older git
history than kernel.org: https://github.com/mpe/linux-fullhistory.git

What I am trying to avoid here is using the prev->vm_end address for
munmap when we are changing the prev->vm_end to expand over the area we
are mapping. And the same for expanding next backwards.

>
> >
> > Add a new helper vms_complete6ee8630e02be6_pte_clear(), which is needed later and
> > will avoid growing the argument list to unmap_region() beyond the 9 it
> > already has.
>
> My word.
>
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx>
> > ---
> > mm/internal.h | 2 ++
> > mm/mmap.c | 34 +++++++++++++++++++++++++++-------
> > 2 files changed, 29 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 8cbbbe7d40f3..4c9f06669cc4 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1493,6 +1493,8 @@ struct vma_munmap_struct {
> > struct list_head *uf; /* Userfaultfd list_head */
> > unsigned long start; /* Aligned start addr */
> > unsigned long end; /* Aligned end addr */
> > + unsigned long unmap_start;
> > + unsigned long unmap_end;
> > int vma_count; /* Number of vmas that will be removed */
> > unsigned long nr_pages; /* Number of pages being removed */
> > unsigned long locked_vm; /* Number of locked pages */
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index ecf55d32e804..45443a53be76 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -525,6 +525,8 @@ static inline void init_vma_munmap(struct vma_munmap_struct *vms,
> > vms->vma_count = 0;
> > vms->nr_pages = vms->locked_vm = vms->nr_accounted = 0;
> > vms->exec_vm = vms->stack_vm = vms->data_vm = 0;
> > + vms->unmap_start = FIRST_USER_ADDRESS;
> > + vms->unmap_end = USER_PGTABLES_CEILING;
> > }
> >
> > /*
> > @@ -2610,6 +2612,26 @@ static inline void abort_munmap_vmas(struct ma_state *mas_detach)
> > __mt_destroy(mas_detach->tree);
> > }
> >
> > +
> > +static void vms_complete_pte_clear(struct vma_munmap_struct *vms,
> > + struct ma_state *mas_detach, bool mm_wr_locked)
> > +{
> > + struct mmu_gather tlb;
> > +
> > + /*
> > + * We can free page tables without write-locking mmap_lock because VMAs
> > + * were isolated before we downgraded mmap_lock.
> > + */
> > + mas_set(mas_detach, 1);
> > + lru_add_drain();
> > + tlb_gather_mmu(&tlb, vms->mm);
> > + update_hiwater_rss(vms->mm);
> > + unmap_vmas(&tlb, mas_detach, vms->vma, vms->start, vms->end, vms->vma_count, mm_wr_locked);
> > + mas_set(mas_detach, 1);
>
> I know it's necessary as unmap_vmas() will adjust mas_detach, but it kind
> of aesthetically sucks to set it to 1, do some stuff, then set it to 1
> again. But this is not a big deal :>)
>
> > + free_pgtables(&tlb, mas_detach, vms->vma, vms->unmap_start, vms->unmap_end, mm_wr_locked);
>
> Yeah this bit definitely needs a comment I think, this is very confusing
> indeed. Under what circumstances will these differ from [vms->start,
> vms->end), etc.?
>
> I'm guessing it's to do with !vms->prev and !vms->next needing to be set to
> [FIRST_USER_ADDRESS, USER_PGTABLES_CEILING)?

Yes, exactly. Since we are setting the range to unmap, we can just set
it to the correct value during the gather stage of the VMAs.

>
> > + tlb_finish_mmu(&tlb);
> > +}
> > +
> > /*
> > * vms_complete_munmap_vmas() - Finish the munmap() operation
> > * @vms: The vma munmap struct
> > @@ -2631,13 +2653,7 @@ static void vms_complete_munmap_vmas(struct vma_munmap_struct *vms,
> > if (vms->unlock)
> > mmap_write_downgrade(mm);
> >
> > - /*
> > - * We can free page tables without write-locking mmap_lock because VMAs
> > - * were isolated before we downgraded mmap_lock.
> > - */
> > - mas_set(mas_detach, 1);
> > - unmap_region(mm, mas_detach, vms->vma, vms->prev, vms->next,
> > - vms->start, vms->end, vms->vma_count, !vms->unlock);
> > + vms_complete_pte_clear(vms, mas_detach, !vms->unlock);
> > /* Update high watermark before we lower total_vm */
> > update_hiwater_vm(mm);
> > /* Stat accounting */
> > @@ -2699,6 +2715,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > goto start_split_failed;
> > }
> > vms->prev = vma_prev(vms->vmi);
> > + if (vms->prev)
> > + vms->unmap_start = vms->prev->vm_end;
> >
> > /*
> > * Detach a range of VMAs from the mm. Using next as a temp variable as
> > @@ -2757,6 +2775,8 @@ static int vms_gather_munmap_vmas(struct vma_munmap_struct *vms,
> > }
> >
> > vms->next = vma_next(vms->vmi);
> > + if (vms->next)
> > + vms->unmap_end = vms->next->vm_start;
> >
> > #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> > /* Make sure no VMAs are about to be lost. */
> > --
> > 2.43.0
> >
>
> Other than wanting some extra comments, this looks fine and I know how
> hard-won the unmap range bit of this change was so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>