Re: [PATCH 3/7] mm/mremap: introduce and use vma_remap_struct threaded state
From: Liam R. Howlett
Date: Wed Mar 05 2025 - 14:56:37 EST
* Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> [250305 14:43]:
> On Wed, Mar 05, 2025 at 01:52:10PM -0500, Liam R. Howlett wrote:
> > * Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> [250303 06:08]:
> > > A number of mremap() calls both pass around and modify a large number of
> > > parameters, making the code less readable and often repeatedly having to
> > > determine things such as VMA, size delta, and more.
> > >
> > > Avoid this by using the common pattern of passing a state object through
> > > the operation, updating it as we go. We introduce the vma_remap_struct or
> > > 'VRM' for this purpose.
> > >
> > > This also gives us the ability to accumulate further state through the
> > > operation that would otherwise require awkward and error-prone pointer
> > > passing.
> > >
> > > We can also now trivially define helper functions that operate on a VRM
> > > object.
> > >
> > > This pattern has proven itself to be very powerful when implemented for VMA
> > > merge, VMA unmapping and memory mapping operations, so it is battle-tested
> > > and functional.
> > >
> > > We both introduce the data structure and use it, introducing helper
> > > functions as needed to make things readable, we move some state such as
> > > mmap lock and mlock() status to the VRM, we introduce a means of
> > > classifying the type of mremap() operation and de-duplicate the
> > > get_unmapped_area() lookup.
> > >
> > > We also neatly thread userfaultfd state throughout the operation.
> > >
> > > Note that there is further refactoring to be done, chiefly adjust
> > > move_vma() to accept a VRM parameter. We defer this as there is
> > > pre-requisite work required to be able to do so which we will do in a
> > > subsequent patch.
> > >
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> > > ---
> > > mm/mremap.c | 559 +++++++++++++++++++++++++++++++++-------------------
> > > 1 file changed, 354 insertions(+), 205 deletions(-)
> > >
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index c4abda8dfc57..7f0c71aa9bb9 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -32,6 +32,43 @@
> > >
> > > #include "internal.h"
> > >
> > > +/* Classify the kind of remap operation being performed. */
> > > +enum mremap_operation {
> > > + MREMAP_NO_RESIZE, /* old_len == new_len, if not moved, do nothing. */
> > > + MREMAP_SHRINK, /* old_len > new_len. */
> > > + MREMAP_EXPAND, /* old_len < new_len. */
> >
> > Can we fix the spacing so the comments line up, please?
>
> I've had review the other way before :) But sure, I mean I actually
> agree. Perhaps I was following stuff I already saw in the kernel on this,
> but yeah let's align this!
>
> >
> > It might be worth having a MREMAP_INVALID here and init to that, then a
> > VM_BUG_ON(), but maybe I'm just paranoid.
>
> Sure can do.
>
> >
> > > +};
> > > +
> > > +/*
> > > + * Describes a VMA mremap() operation and is threaded throughout it.
> > > + *
> > > + * Any of the fields may be mutated by the operation, however these values will
> > > + * always accurately reflect the remap (for instance, we may adjust lengths and
> > > + * delta to account for hugetlb alignment).
> > > + */
> > > +struct vma_remap_struct {
> > > + /* User-provided state. */
> > > + unsigned long addr; /* User-specified address from which we remap. */
> > > + unsigned long old_len; /* Length of range being remapped. */
> > > + unsigned long new_len; /* Desired new length of mapping. */
> > > + unsigned long flags; /* user-specified MREMAP_* flags. */
> > > + unsigned long new_addr; /* Optionally, desired new address. */
> >
> > Same comment about the comment spacing here. Might be better to have
> > user_flags? Since we have the passed in flags, the map flags and the
> > vma flags.
>
> Sure can align.
>
> We don't pass in map flags? We set map flags to pass to get_unmapped_area()
> in one place. Or in a final version, it's in one place.
>
> We don't set map_flags in vma_remap_struct either. In fact here we only
> have one flags field, which is commented as such here.
>
> I'm not sure about user_flags, I feel like that's confusing, mremap_flags
> be ok? But then that feels somewhat redundant in a struct literally all
> about an mremap?
>
> It's fine this isn't a big deal, mremap_flags?
Good point. I have no strong opinion on this either, feel free to keep
it as flags.
>
> >
> > > +
> > > + /* uffd state. */
> > > + struct vm_userfaultfd_ctx *uf;
> > > + struct list_head *uf_unmap_early;
> > > + struct list_head *uf_unmap;
> >
> > ... sigh, yeah.
>
> Well, better put here right?
>
> >
> > > +
> > > + /* VMA state, determined in do_mremap(). */
> > > + struct vm_area_struct *vma;
> > > +
> > > + /* Internal state, determined in do_mremap(). */
> > > + unsigned long delta; /* Absolute delta of old_len, new_len. */
> > > + bool locked; /* Was the VMA mlock()'d (has the VM_LOCKED flag set). */
> >
> > bool mlocked ?
>
> Sure will rename.
Thanks.
>
> >
> > > + enum mremap_operation remap_type; /* expand, shrink, etc. */
> > > + bool mmap_locked; /* Is current->mm currently write-locked? */
> > > +};
> > > +
> > > static pud_t *get_old_pud(struct mm_struct *mm, unsigned long addr)
> > > {
> > > pgd_t *pgd;
> > > @@ -693,6 +730,97 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > > return len + old_addr - old_end; /* how much done */
> > > }
> > >
> > > +/* Set vrm->delta to the difference in VMA size specified by user. */
> > > +static void vrm_set_delta(struct vma_remap_struct *vrm)
> > > +{
> > > + vrm->delta = abs_diff(vrm->old_len, vrm->new_len);
> > > +}
> > > +
> > > +/* Determine what kind of remap this is - shrink, expand or no resize at all. */
> > > +static enum mremap_operation vrm_remap_type(struct vma_remap_struct *vrm)
> > > +{
> > > + if (vrm->delta == 0)
> > > + return MREMAP_NO_RESIZE;
> > > +
> > > + if (vrm->old_len > vrm->new_len)
> > > + return MREMAP_SHRINK;
> > > +
> > > + return MREMAP_EXPAND;
> > > +}
> > > +
> > > +/* Set the vrm->remap_type, assumes state is sufficient set up for this. */
> > > +static void vrm_set_remap_type(struct vma_remap_struct *vrm)
> > > +{
> > > + vrm->remap_type = vrm_remap_type(vrm);
> >
> > The vrm_remap_type() function is only used once, maybe we don't need
> > both set and get?
>
> I sort of felt it was neater to have a separate function but you're right
> this is a bit silly, will inline!
Yeah, we can move it externally if we need it. Thanks.
>
> >
> > > +}
> > > +
> > > +/*
> > > + * When moving a VMA to vrm->new_adr, does this result in the new and old VMAs
> > > + * overlapping?
> > > + */
> > > +static bool vrm_overlaps(struct vma_remap_struct *vrm)
> > > +{
> > > + unsigned long start_old = vrm->addr;
> > > + unsigned long start_new = vrm->new_addr;
> > > + unsigned long end_old = vrm->addr + vrm->old_len;
> > > + unsigned long end_new = vrm->new_addr + vrm->new_len;
> > > +
> > > + /*
> > > + * start_old end_old
> > > + * |-----------|
> > > + * | |
> > > + * |-----------|
> > > + * |-------------|
> > > + * | |
> > > + * |-------------|
> > > + * start_new end_new
> > > + */
> > > + if (end_old > start_new && end_new > start_old)
> > > + return true;
> > > +
> > > + return false;
> > > +}
> > > +
> > > +/* Do the mremap() flags require that the new_addr parameter be specified? */
> > > +static bool vrm_implies_new_addr(struct vma_remap_struct *vrm)
> > > +{
> > > + return vrm->flags & (MREMAP_FIXED | MREMAP_DONTUNMAP);
> > > +}
> >
> > These five might benefit from being inlined (although I hope our
> > compiler is good enough for this).
>
> Hm, I thought there was no difference when placed in a compilation unit
> like this?
I think it's up to the compiler, and I'm not sure it changes unless we
force it anyways. Feel free to leave it.
>
> >
> > > +
> > > +/*
> > > + * Find an unmapped area for the requested vrm->new_addr.
> > > + *
> > > + * If MREMAP_FIXED then this is equivalent to a MAP_FIXED mmap() call. If only
> > > + * MREMAP_DONTUNMAP is set, then this is equivalent to providing a hint to
> > > + * mmap(), otherwise this is equivalent to mmap() specifying a NULL address.
> > > + *
> > > + * Returns 0 on success (with vrm->new_addr updated), or an error code upon
> > > + * failure.
> > > + */
> > > +static unsigned long vrm_set_new_addr(struct vma_remap_struct *vrm)
> > > +{
> > > + struct vm_area_struct *vma = vrm->vma;
> > > + unsigned long map_flags = 0;
> > > + /* Page Offset _into_ the VMA. */
> > > + pgoff_t internal_pgoff = (vrm->addr - vma->vm_start) >> PAGE_SHIFT;
> > > + pgoff_t pgoff = vma->vm_pgoff + internal_pgoff;
> > > + unsigned long new_addr = vrm_implies_new_addr(vrm) ? vrm->new_addr : 0;
> > > + unsigned long res;
> > > +
> > > + if (vrm->flags & MREMAP_FIXED)
> > > + map_flags |= MAP_FIXED;
> > > + if (vma->vm_flags & VM_MAYSHARE)
> > > + map_flags |= MAP_SHARED;
> > > +
> > > + res = get_unmapped_area(vma->vm_file, new_addr, vrm->new_len, pgoff,
> > > + map_flags);
> > > + if (IS_ERR_VALUE(res))
> > > + return res;
> > > +
> > > + vrm->new_addr = res;
> > > + return 0;
> > > +}
> > > +
> > > static unsigned long move_vma(struct vm_area_struct *vma,
> > > unsigned long old_addr, unsigned long old_len,
> > > unsigned long new_len, unsigned long new_addr,
> > > @@ -860,18 +988,15 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> > > * resize_is_valid() - Ensure the vma can be resized to the new length at the give
> > > * address.
> > > *
> > > - * @vma: The vma to resize
> > > - * @addr: The old address
> > > - * @old_len: The current size
> > > - * @new_len: The desired size
> > > - * @flags: The vma flags
> > > - *
> > > * Return 0 on success, error otherwise.
> > > */
> > > -static int resize_is_valid(struct vm_area_struct *vma, unsigned long addr,
> > > - unsigned long old_len, unsigned long new_len, unsigned long flags)
> > > +static int resize_is_valid(struct vma_remap_struct *vrm)
> > > {
> > > struct mm_struct *mm = current->mm;
> > > + struct vm_area_struct *vma = vrm->vma;
> > > + unsigned long addr = vrm->addr;
> > > + unsigned long old_len = vrm->old_len;
> > > + unsigned long new_len = vrm->new_len;
> > > unsigned long pgoff;
> > >
> > > /*
> > > @@ -883,11 +1008,12 @@ static int resize_is_valid(struct vm_area_struct *vma, unsigned long addr,
> > > * behavior. As a result, fail such attempts.
> > > */
> > > if (!old_len && !(vma->vm_flags & (VM_SHARED | VM_MAYSHARE))) {
> > > - pr_warn_once("%s (%d): attempted to duplicate a private mapping with mremap. This is not supported.\n", current->comm, current->pid);
> > > + pr_warn_once("%s (%d): attempted to duplicate a private mapping with mremap. This is not supported.\n",
> > > + current->comm, current->pid);
> > > return -EINVAL;
> > > }
> > >
> > > - if ((flags & MREMAP_DONTUNMAP) &&
> > > + if ((vrm->flags & MREMAP_DONTUNMAP) &&
> > > (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
> > > return -EINVAL;
> > >
> > > @@ -907,99 +1033,114 @@ static int resize_is_valid(struct vm_area_struct *vma, unsigned long addr,
> > > if (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))
> > > return -EFAULT;
> > >
> > > - if (!mlock_future_ok(mm, vma->vm_flags, new_len - old_len))
> > > + if (!mlock_future_ok(mm, vma->vm_flags, vrm->delta))
> > > return -EAGAIN;
> > >
> > > - if (!may_expand_vm(mm, vma->vm_flags,
> > > - (new_len - old_len) >> PAGE_SHIFT))
> > > + if (!may_expand_vm(mm, vma->vm_flags, vrm->delta >> PAGE_SHIFT))
> > > return -ENOMEM;
> > >
> > > return 0;
> > > }
> > >
> > > /*
> > > - * mremap_to() - remap a vma to a new location
> > > - * @addr: The old address
> > > - * @old_len: The old size
> > > - * @new_addr: The target address
> > > - * @new_len: The new size
> > > - * @locked: If the returned vma is locked (VM_LOCKED)
> > > - * @flags: the mremap flags
> > > - * @uf: The mremap userfaultfd context
> > > - * @uf_unmap_early: The userfaultfd unmap early context
> > > - * @uf_unmap: The userfaultfd unmap context
> > > + * The user has requested that the VMA be shrunk (i.e., old_len > new_len), so
> > > + * execute this, optionally dropping the mmap lock when we do so.
> > > *
> > > + * In both cases this invalidates the VMA, however if we don't drop the lock,
> > > + * then load the correct VMA into vrm->vma afterwards.
> > > + */
> > > +static unsigned long shrink_vma(struct vma_remap_struct *vrm,
> > > + bool drop_lock)
> > > +{
> > > + struct mm_struct *mm = current->mm;
> > > + unsigned long unmap_start = vrm->addr + vrm->new_len;
> > > + unsigned long unmap_bytes = vrm->delta;
> > > + unsigned long res;
> > > + VMA_ITERATOR(vmi, mm, unmap_start);
> > > +
> > > + VM_BUG_ON(vrm->remap_type != MREMAP_SHRINK);
> > > +
> > > + res = do_vmi_munmap(&vmi, mm, unmap_start, unmap_bytes,
> > > + vrm->uf_unmap, drop_lock);
> > > + vrm->vma = NULL; /* Invalidated. */
> > > + if (res)
> > > + return res;
> > > +
> > > + /*
> > > + * If we've not dropped the lock, then we should reload the VMA to
> > > + * replace the invalidated VMA with the one that may have now been
> > > + * split.
> > > + */
> > > + if (drop_lock)
> > > + vrm->mmap_locked = false;
> > > + else
> > > + vrm->vma = vma_lookup(mm, vrm->addr);
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +/*
> > > + * mremap_to() - remap a vma to a new location.
> > > * Returns: The new address of the vma or an error.
> > > */
> > > -static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
> > > - unsigned long new_addr, unsigned long new_len, bool *locked,
> > > - unsigned long flags, struct vm_userfaultfd_ctx *uf,
> > > - struct list_head *uf_unmap_early,
> > > - struct list_head *uf_unmap)
> > > +static unsigned long mremap_to(struct vma_remap_struct *vrm)
> > > {
> > > struct mm_struct *mm = current->mm;
> > > - struct vm_area_struct *vma;
> > > - unsigned long ret;
> > > - unsigned long map_flags = 0;
> > > + unsigned long err;
> > >
> > > /* Is the new length or address silly? */
> > > - if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len)
> > > + if (vrm->new_len > TASK_SIZE ||
> > > + vrm->new_addr > TASK_SIZE - vrm->new_len)
> > > return -EINVAL;
> > >
> > > - /* Ensure the old/new locations do not overlap. */
> > > - if (addr + old_len > new_addr && new_addr + new_len > addr)
> > > + if (vrm_overlaps(vrm))
> > > return -EINVAL;
> > >
> > > - if (flags & MREMAP_FIXED) {
> > > + if (vrm->flags & MREMAP_FIXED) {
> > > /*
> > > * In mremap_to().
> > > * VMA is moved to dst address, and munmap dst first.
> > > * do_munmap will check if dst is sealed.
> > > */
> > > - ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
> > > - if (ret)
> > > - return ret;
> > > - }
> > > + err = do_munmap(mm, vrm->new_addr, vrm->new_len,
> > > + vrm->uf_unmap_early);
> > > + vrm->vma = NULL; /* Invalidated. */
> > > + if (err)
> > > + return err;
> > >
> > > - if (old_len > new_len) {
> > > - ret = do_munmap(mm, addr+new_len, old_len - new_len, uf_unmap);
> > > - if (ret)
> > > - return ret;
> > > - old_len = new_len;
> > > + /*
> > > + * If we remap a portion of a VMA elsewhere in the same VMA,
> > > + * this can invalidate the old VMA and iterator. Reset.
> > > + */
> > > + vrm->vma = vma_lookup(mm, vrm->addr);
> >
> > You say it invalidates the iterator, but this doesn't change an
> > iterator?
>
> Yeah there used to be one, didn't update comment, will fix.
Thanks.
>
> >
> > > }
> > >
> > > - vma = vma_lookup(mm, addr);
> > > - if (!vma)
> > > - return -EFAULT;
> > > + if (vrm->remap_type == MREMAP_SHRINK) {
> > > + err = shrink_vma(vrm, /* drop_lock= */false);
> >
> > It is not immediately clear if we could have a MREMAP_FIXED also
> > MREMAP_SHRINK. In that case, we would try to do_munmap() twice. This
> > shouldn't be an issue as do_vmi_munmap() would catch it, but I am not
> > sure if you noticed this.
>
> I think this case is pretty subtle actually. The logic already looked like
> this, only more hidden.
>
> So we:
>
> 1. Unmap the target area at the new, shorter, length if MREMAP_FIXED.
> 2. Then remove the 'extra' portion in the source that we are shrinking 'off'.
>
> E.g.:
>
>
> |--.------.---| |--------.---.---|
> | . . | -> | . . |
> |--.------.---| |--------.---.---|
>
> Implies:
>
> 1. Remove the delta from the original:
>
> |--.---| |---| |--------.---.---|
> | . | | | -> | . . |
> |--.---| |---| |--------.---.---|
>
> 2. Unmap target area:
>
> |--.---| |---| |--------| |---|
> | . | | | -> | | | |
> |--.---| |---| |--------| |---|
Nice pictures.
>
> The reason we do this is because MREMAP_DONTUNMAP might be set, and we
> start by doing a _copy_ of the VMA, then move page tables, then finally, if
> MREMAP_DONTUNMAP is _not_ set we finally unmap the original, shrunk VMA.
>
> I mean this is all very very insane and maybe we can just like... not do
> this haha.
>
> But a future thing, perhaps.
Right, thanks.
>
>
> >
> > > + if (err)
> > > + return err;
> > >
> > > - ret = resize_is_valid(vma, addr, old_len, new_len, flags);
> > > - if (ret)
> > > - return ret;
> > > + /* Set up for the move now shrink has been executed. */
> > > + vrm->old_len = vrm->new_len;
> > > + }
> > > +
> > > + err = resize_is_valid(vrm);
> > > + if (err)
> > > + return err;
> > >
> > > /* MREMAP_DONTUNMAP expands by old_len since old_len == new_len */
> > > - if (flags & MREMAP_DONTUNMAP &&
> > > - !may_expand_vm(mm, vma->vm_flags, old_len >> PAGE_SHIFT)) {
> > > + if (vrm->flags & MREMAP_DONTUNMAP &&
> > > + !may_expand_vm(mm, vrm->vma->vm_flags, vrm->old_len >> PAGE_SHIFT)) {
> > > return -ENOMEM;
> >
> > nit: whitespace here is a bit odd to read.
>
> Will find a way to fix, I agree this is a bit horrid.
>
> >
> > > }
> > >
> > > - if (flags & MREMAP_FIXED)
> > > - map_flags |= MAP_FIXED;
> > > -
> > > - if (vma->vm_flags & VM_MAYSHARE)
> > > - map_flags |= MAP_SHARED;
> > > -
> > > - ret = get_unmapped_area(vma->vm_file, new_addr, new_len, vma->vm_pgoff +
> > > - ((addr - vma->vm_start) >> PAGE_SHIFT),
> > > - map_flags);
> > > - if (IS_ERR_VALUE(ret))
> > > - return ret;
> > > -
> > > - /* We got a new mapping */
> > > - if (!(flags & MREMAP_FIXED))
> > > - new_addr = ret;
> > > + err = vrm_set_new_addr(vrm);
> > > + if (err)
> > > + return err;
> > >
> > > - return move_vma(vma, addr, old_len, new_len, new_addr, locked, flags,
> > > - uf, uf_unmap);
> > > + return move_vma(vrm->vma, vrm->addr, vrm->old_len, vrm->new_len,
> > > + vrm->new_addr, &vrm->locked, vrm->flags,
> > > + vrm->uf, vrm->uf_unmap);
> >
> > I see where this is going..
>
> ;)
>
> >
> > > }
> > >
> > > static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
> > > @@ -1016,22 +1157,33 @@ static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
> > > return 1;
> > > }
> > >
> > > -/* Do the mremap() flags require that the new_addr parameter be specified? */
> > > -static bool implies_new_addr(unsigned long flags)
> > > +/* Determine whether we are actually able to execute an in-place expansion. */
> > > +static bool vrm_can_expand_in_place(struct vma_remap_struct *vrm)
> > > {
> > > - return flags & (MREMAP_FIXED | MREMAP_DONTUNMAP);
> > > + /* Number of bytes from vrm->addr to end of VMA. */
> > > + unsigned long suffix_bytes = vrm->vma->vm_end - vrm->addr;
> > > +
> > > + /* If end of range aligns to end of VMA, we can just expand in-place. */
> > > + if (suffix_bytes != vrm->old_len)
> > > + return false;
> > > +
> > > + /* Check whether this is feasible. */
> > > + if (!vma_expandable(vrm->vma, vrm->delta))
> > > + return false;
> > > +
> > > + return true;
> > > }
> > >
> > > /*
> > > * Are the parameters passed to mremap() valid? If so return 0, otherwise return
> > > * error.
> > > */
> > > -static unsigned long check_mremap_params(unsigned long addr,
> > > - unsigned long flags,
> > > - unsigned long old_len,
> > > - unsigned long new_len,
> > > - unsigned long new_addr)
> > > +static unsigned long check_mremap_params(struct vma_remap_struct *vrm)
> > > +
> > > {
> > > + unsigned long addr = vrm->addr;
> > > + unsigned long flags = vrm->flags;
> > > +
> > > /* Ensure no unexpected flag values. */
> > > if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE | MREMAP_DONTUNMAP))
> > > return -EINVAL;
> > > @@ -1045,15 +1197,15 @@ static unsigned long check_mremap_params(unsigned long addr,
> > > * for DOS-emu "duplicate shm area" thing. But
> > > * a zero new-len is nonsensical.
> > > */
> > > - if (!PAGE_ALIGN(new_len))
> > > + if (!PAGE_ALIGN(vrm->new_len))
> > > return -EINVAL;
> > >
> > > /* Remainder of checks are for cases with specific new_addr. */
> > > - if (!implies_new_addr(flags))
> > > + if (!vrm_implies_new_addr(vrm))
> > > return 0;
> > >
> > > /* The new address must be page-aligned. */
> > > - if (offset_in_page(new_addr))
> > > + if (offset_in_page(vrm->new_addr))
> > > return -EINVAL;
> > >
> > > /* A fixed address implies a move. */
> > > @@ -1061,7 +1213,7 @@ static unsigned long check_mremap_params(unsigned long addr,
> > > return -EINVAL;
> > >
> > > /* MREMAP_DONTUNMAP does not allow resizing in the process. */
> > > - if (flags & MREMAP_DONTUNMAP && old_len != new_len)
> > > + if (flags & MREMAP_DONTUNMAP && vrm->old_len != vrm->new_len)
> > > return -EINVAL;
> > >
> > > /*
> > > @@ -1090,11 +1242,11 @@ static unsigned long check_mremap_params(unsigned long addr,
> > > * If we discover the VMA is locked, update mm_struct statistics accordingly and
> > > * indicate so to the caller.
> > > */
> > > -static unsigned long expand_vma_inplace(struct vm_area_struct *vma,
> > > - unsigned long delta, bool *locked)
> > > +static unsigned long expand_vma_in_place(struct vma_remap_struct *vrm)
> > > {
> > > struct mm_struct *mm = current->mm;
> > > - long pages = delta >> PAGE_SHIFT;
> > > + long pages = vrm->delta >> PAGE_SHIFT;
> > > + struct vm_area_struct *vma = vrm->vma;
> > > VMA_ITERATOR(vmi, mm, vma->vm_end);
> > > long charged = 0;
> > >
> > > @@ -1114,7 +1266,7 @@ static unsigned long expand_vma_inplace(struct vm_area_struct *vma,
> > > * adjacent to the expanded vma and otherwise
> > > * compatible.
> > > */
> > > - vma = vma_merge_extend(&vmi, vma, delta);
> > > + vma = vrm->vma = vma_merge_extend(&vmi, vma, vrm->delta);
> > > if (!vma) {
> > > vm_unacct_memory(charged);
> > > return -ENOMEM;
> > > @@ -1123,42 +1275,34 @@ static unsigned long expand_vma_inplace(struct vm_area_struct *vma,
> > > vm_stat_account(mm, vma->vm_flags, pages);
> > > if (vma->vm_flags & VM_LOCKED) {
> > > mm->locked_vm += pages;
> > > - *locked = true;
> > > + vrm->locked = true;
> > > }
> > >
> > > return 0;
> > > }
> > >
> > > -static bool align_hugetlb(struct vm_area_struct *vma,
> > > - unsigned long addr,
> > > - unsigned long new_addr,
> > > - unsigned long *old_len_ptr,
> > > - unsigned long *new_len_ptr,
> > > - unsigned long *delta_ptr)
> > > +static bool align_hugetlb(struct vma_remap_struct *vrm)
> > > {
> > > - unsigned long old_len = *old_len_ptr;
> > > - unsigned long new_len = *new_len_ptr;
> > > - struct hstate *h __maybe_unused = hstate_vma(vma);
> > > + struct hstate *h __maybe_unused = hstate_vma(vrm->vma);
> > >
> > > - old_len = ALIGN(old_len, huge_page_size(h));
> > > - new_len = ALIGN(new_len, huge_page_size(h));
> > > + vrm->old_len = ALIGN(vrm->old_len, huge_page_size(h));
> > > + vrm->new_len = ALIGN(vrm->new_len, huge_page_size(h));
> > >
> > > /* addrs must be huge page aligned */
> > > - if (addr & ~huge_page_mask(h))
> > > + if (vrm->addr & ~huge_page_mask(h))
> > > return false;
> > > - if (new_addr & ~huge_page_mask(h))
> > > + if (vrm->new_addr & ~huge_page_mask(h))
> > > return false;
> > >
> > > /*
> > > * Don't allow remap expansion, because the underlying hugetlb
> > > * reservation is not yet capable to handle split reservation.
> > > */
> > > - if (new_len > old_len)
> > > + if (vrm->new_len > vrm->old_len)
> > > return false;
> > >
> > > - *old_len_ptr = old_len;
> > > - *new_len_ptr = new_len;
> > > - *delta_ptr = abs_diff(old_len, new_len);
> > > + vrm_set_delta(vrm);
> > > +
> > > return true;
> > > }
> > >
> > > @@ -1169,19 +1313,16 @@ static bool align_hugetlb(struct vm_area_struct *vma,
> > > * Try to do so in-place, if this fails, then move the VMA to a new location to
> > > * action the change.
> > > */
> > > -static unsigned long expand_vma(struct vm_area_struct *vma,
> > > - unsigned long addr, unsigned long old_len,
> > > - unsigned long new_len, unsigned long flags,
> > > - bool *locked_ptr, unsigned long *new_addr_ptr,
> > > - struct vm_userfaultfd_ctx *uf_ptr,
> > > - struct list_head *uf_unmap_ptr)
> > > +static unsigned long expand_vma(struct vma_remap_struct *vrm)
> > > {
> > > unsigned long err;
> > > - unsigned long map_flags;
> > > - unsigned long new_addr; /* We ignore any user-supplied one. */
> > > - pgoff_t pgoff;
> > > + struct vm_area_struct *vma = vrm->vma;
> > > + unsigned long addr = vrm->addr;
> > > + unsigned long old_len = vrm->old_len;
> > > + unsigned long new_len = vrm->new_len;
> > > + unsigned long flags = vrm->flags;
> > >
> > > - err = resize_is_valid(vma, addr, old_len, new_len, flags);
> > > + err = resize_is_valid(vrm);
> > > if (err)
> > > return err;
> > >
> > > @@ -1189,10 +1330,9 @@ static unsigned long expand_vma(struct vm_area_struct *vma,
> > > * [addr, old_len) spans precisely to the end of the VMA, so try to
> > > * expand it in-place.
> > > */
> > > - if (old_len == vma->vm_end - addr &&
> > > - vma_expandable(vma, new_len - old_len)) {
> > > - err = expand_vma_inplace(vma, new_len - old_len, locked_ptr);
> > > - if (IS_ERR_VALUE(err))
> > > + if (vrm_can_expand_in_place(vrm)) {
> > > + err = expand_vma_in_place(vrm);
> > > + if (err)
> > > return err;
> > >
> > > /*
> > > @@ -1200,8 +1340,8 @@ static unsigned long expand_vma(struct vm_area_struct *vma,
> > > * satisfy the expectation that mlock()'ing a VMA maintains all
> > > * of its pages in memory.
> > > */
> > > - if (*locked_ptr)
> > > - *new_addr_ptr = addr;
> > > + if (vrm->locked)
> > > + vrm->new_addr = addr;
> > >
> > > /* OK we're done! */
> > > return addr;
> > > @@ -1217,62 +1357,65 @@ static unsigned long expand_vma(struct vm_area_struct *vma,
> > > return -ENOMEM;
> > >
> > > /* Find a new location to move the VMA to. */
> > > - map_flags = (vma->vm_flags & VM_MAYSHARE) ? MAP_SHARED : 0;
> > > - pgoff = vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT);
> > > - new_addr = get_unmapped_area(vma->vm_file, 0, new_len, pgoff, map_flags);
> > > - if (IS_ERR_VALUE(new_addr))
> > > - return new_addr;
> > > - *new_addr_ptr = new_addr;
> > > + err = vrm_set_new_addr(vrm);
> > > + if (err)
> > > + return err;
> > >
> > > - return move_vma(vma, addr, old_len, new_len, new_addr,
> > > - locked_ptr, flags, uf_ptr, uf_unmap_ptr);
> > > + return move_vma(vma, addr, old_len, new_len, vrm->new_addr,
> > > + &vrm->locked, flags, vrm->uf, vrm->uf_unmap);
> > > }
> > >
> > > /*
> > > - * Expand (or shrink) an existing mapping, potentially moving it at the
> > > - * same time (controlled by the MREMAP_MAYMOVE flag and available VM space)
> > > - *
> > > - * MREMAP_FIXED option added 5-Dec-1999 by Benjamin LaHaise
> > > - * This option implies MREMAP_MAYMOVE.
> > > + * Attempt to resize the VMA in-place, if we cannot, then move the VMA to the
> > > + * first available address to perform the operation.
> > > */
> > > -SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> > > - unsigned long, new_len, unsigned long, flags,
> > > - unsigned long, new_addr)
> > > +static unsigned long mremap_at(struct vma_remap_struct *vrm)
> >
> > I hate this and mremap_to() names. I don't have a proposed better name
> > for either and maybe it's just my experience with mremap_to() that has
> > tainted the view of the name itself, but I find them not very
> > descriptive and abruptly ended. I guess move was already taken.
> >
> > I also have the added baggage of parsing "at" to potentially mean
> > "doing".
> >
> > mremap_inplace() seems equally annoying. This is the worst bike shed.
>
> Haha well yeah it's a bikeshed but yeah I agree, it's horrid I agree.
>
> I was using the _at() only in line with the _to().
>
> The 'in place' can kind of conflict with other references to in place which
> is used to differentiate between the case of us having to
> get_unmapped_area() and move vs. not having to.
>
> So I feel like we're stuck with this...
Yes, I agree.
>
> >
> > > +{
> > > + unsigned long res;
> > > +
> > > + switch (vrm->remap_type) {
> > > + case MREMAP_NO_RESIZE:
> > > + /* NO-OP CASE - resizing to the same size. */
> > > + return vrm->addr;
> > > + case MREMAP_SHRINK:
> > > + /*
> > > + * SHRINK CASE. Can always be done in-place.
> > > + *
> > > + * Simply unmap the shrunken portion of the VMA. This does all
> > > + * the needed commit accounting, and we indicate that the mmap
> > > + * lock should be dropped.
> > > + */
> > > + res = shrink_vma(vrm, /* drop_lock= */true);
> > > + if (res)
> > > + return res;
> > > +
> > > + return vrm->addr;
> > > + case MREMAP_EXPAND:
> > > + return expand_vma(vrm);
> > > + }
> > > +
> > > + BUG();
> > > +}
> > > +
> > > +static unsigned long do_mremap(struct vma_remap_struct *vrm)
> > > {
> > > struct mm_struct *mm = current->mm;
> > > struct vm_area_struct *vma;
> > > unsigned long ret;
> > > - unsigned long delta;
> > > - bool locked = false;
> > > - struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
> > > - LIST_HEAD(uf_unmap_early);
> > > - LIST_HEAD(uf_unmap);
> > >
> > > - /*
> > > - * There is a deliberate asymmetry here: we strip the pointer tag
> > > - * from the old address but leave the new address alone. This is
> > > - * for consistency with mmap(), where we prevent the creation of
> > > - * aliasing mappings in userspace by leaving the tag bits of the
> > > - * mapping address intact. A non-zero tag will cause the subsequent
> > > - * range checks to reject the address as invalid.
> > > - *
> > > - * See Documentation/arch/arm64/tagged-address-abi.rst for more
> > > - * information.
> > > - */
> > > - addr = untagged_addr(addr);
> > > -
> > > - ret = check_mremap_params(addr, flags, old_len, new_len, new_addr);
> > > + ret = check_mremap_params(vrm);
> > > if (ret)
> > > return ret;
> > >
> > > - old_len = PAGE_ALIGN(old_len);
> > > - new_len = PAGE_ALIGN(new_len);
> > > - delta = abs_diff(old_len, new_len);
> > > + vrm->old_len = PAGE_ALIGN(vrm->old_len);
> > > + vrm->new_len = PAGE_ALIGN(vrm->new_len);
> > > + vrm_set_delta(vrm);
> > >
> > > if (mmap_write_lock_killable(mm))
> > > return -EINTR;
> > > + vrm->mmap_locked = true;
> > >
> > > - vma = vma_lookup(mm, addr);
> > > + vma = vrm->vma = vma_lookup(mm, vrm->addr);
> > > if (!vma) {
> > > ret = -EFAULT;
> > > goto out;
> > > @@ -1285,62 +1428,68 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> > > }
> > >
> > > /* Align to hugetlb page size, if required. */
> > > - if (is_vm_hugetlb_page(vma) &&
> > > - !align_hugetlb(vma, addr, new_addr, &old_len, &new_len, &delta)) {
> > > + if (is_vm_hugetlb_page(vma) && !align_hugetlb(vrm)) {
> > > ret = -EINVAL;
> > > goto out;
> > > }
> > >
> > > - /* Are we RELOCATING the VMA to a SPECIFIC address? */
> > > - if (implies_new_addr(flags)) {
> > > - ret = mremap_to(addr, old_len, new_addr, new_len,
> > > - &locked, flags, &uf, &uf_unmap_early,
> > > - &uf_unmap);
> > > - goto out;
> > > - }
> > > + vrm_set_remap_type(vrm);
> > >
> > > - /*
> > > - * From here on in we are only RESIZING the VMA, attempting to do so
> > > - * in-place, moving the VMA if we cannot.
> > > - */
> > > + /* Actually execute mremap. */
> > > + ret = vrm_implies_new_addr(vrm) ? mremap_to(vrm) : mremap_at(vrm);
> > >
> > > - /* NO-OP CASE - resizing to the same size. */
> > > - if (new_len == old_len) {
> > > - ret = addr;
> > > - goto out;
> > > - }
> > > -
> > > - /* SHRINK CASE. Can always be done in-place. */
> > > - if (new_len < old_len) {
> > > - VMA_ITERATOR(vmi, mm, addr + new_len);
> > > +out:
> > > + if (vrm->mmap_locked) {
> > > + mmap_write_unlock(mm);
> > > + vrm->mmap_locked = false;
> > >
> > > - /*
> > > - * Simply unmap the shrunken portion of the VMA. This does all
> > > - * the needed commit accounting, unlocking the mmap lock.
> > > - */
> > > - ret = do_vmi_munmap(&vmi, mm, addr + new_len, delta,
> > > - &uf_unmap, true);
> > > - if (ret)
> > > - goto out;
> > > -
> > > - /* We succeeded, mmap lock released for us. */
> > > - ret = addr;
> > > - goto out_unlocked;
> > > + if (!offset_in_page(ret) && vrm->locked && vrm->new_len > vrm->old_len)
> > > + mm_populate(vrm->new_addr + vrm->old_len, vrm->delta);
> >
> > It isn't clear to me why we only populate if we are locked here.
> > Actually, I'm not sure why we keep holding the lock until here or why it
> > matters to drop it early. The main reason we want to drop the lock is
> > to reduce the mmap lock time held for the populate operation.
> >
> > So we can either drop the lock before we get here once the vma tree is
> > updated, or we can unconditionally unlock.
> >
> > I think we can simplify this further if we just keep the lock held until
> > we downgrade here and populate if necessary after it's dropped. That
> > is, shrink just does nothing with the lock and we just unlock it
> > regardless.
> >
> > I'm pretty sure that we would struggle to measure the performance impact
> > holding the lock for the return path when the populate is removed from
> > the critical section.
>
> I'm in agreement that this is a horrid and weird mess, and I stared at this
> a long time myself.
>
> But I feel we should defer changing this until after the refactor series as
> it was an existing eldrich horror changing of which would sort of sit
> outside of 'obvious small fixups/safe moving around/etc.' that I limited
> myself to here.
>
> Let's definitely come back and address this.
Ack, yeah. Shouldn't change locking from a known mess to an unknown
but-maybe-cleaner mess in a refactor.
>
> >
> > > }
> > >
> > > - /* EXPAND case. We try to do in-place, if we can't, then we move it. */
> > > - ret = expand_vma(vma, addr, old_len, new_len, flags, &locked, &new_addr,
> > > - &uf, &uf_unmap);
> > > + userfaultfd_unmap_complete(mm, vrm->uf_unmap_early);
> > > + mremap_userfaultfd_complete(vrm->uf, vrm->addr, ret, vrm->old_len);
> > > + userfaultfd_unmap_complete(mm, vrm->uf_unmap);
> > >
> > > -out:
> > > - if (offset_in_page(ret))
> > > - locked = false;
> > > - mmap_write_unlock(mm);
> > > - if (locked && new_len > old_len)
> > > - mm_populate(new_addr + old_len, delta);
> > > -out_unlocked:
> > > - userfaultfd_unmap_complete(mm, &uf_unmap_early);
> > > - mremap_userfaultfd_complete(&uf, addr, ret, old_len);
> > > - userfaultfd_unmap_complete(mm, &uf_unmap);
> > > return ret;
> > > }
> > > +
> > > +/*
> > > + * Expand (or shrink) an existing mapping, potentially moving it at the
> > > + * same time (controlled by the MREMAP_MAYMOVE flag and available VM space)
> > > + *
> > > + * MREMAP_FIXED option added 5-Dec-1999 by Benjamin LaHaise
> > > + * This option implies MREMAP_MAYMOVE.
> > > + */
> > > +SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
> > > + unsigned long, new_len, unsigned long, flags,
> > > + unsigned long, new_addr)
> > > +{
> > > + struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
> > > + LIST_HEAD(uf_unmap_early);
> > > + LIST_HEAD(uf_unmap);
> > > + /*
> > > + * There is a deliberate asymmetry here: we strip the pointer tag
> > > + * from the old address but leave the new address alone. This is
> > > + * for consistency with mmap(), where we prevent the creation of
> > > + * aliasing mappings in userspace by leaving the tag bits of the
> > > + * mapping address intact. A non-zero tag will cause the subsequent
> > > + * range checks to reject the address as invalid.
> > > + *
> > > + * See Documentation/arch/arm64/tagged-address-abi.rst for more
> > > + * information.
> > > + */
> > > + struct vma_remap_struct vrm = {
> > > + .addr = untagged_addr(addr),
> > > + .old_len = old_len,
> > > + .new_len = new_len,
> > > + .flags = flags,
> > > + .new_addr = new_addr,
> > > +
> > > + .uf = &uf,
> > > + .uf_unmap_early = &uf_unmap_early,
> > > + .uf_unmap = &uf_unmap,
> > > + };
> > > +
> > > + return do_mremap(&vrm);
> > > +}
> > > --
> > > 2.48.1
> > >