Re: [PATCH] maple_tree: Fix use of node for global range in mas_wr_spanning_store()

From: Hugh Dickins
Date: Fri Jul 08 2022 - 00:37:51 EST


On Wed, 6 Jul 2022, Liam Howlett wrote:
> * Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> [220705 22:55]:
> > On Wed, 6 Jul 2022 02:05:37 +0000 Liam Howlett <liam.howlett@xxxxxxxxxx> wrote:
> >
> > > When writing a range with a NULL which expands to 0 - ULONG_MAX, don't
> > > use a node to store this value. Instead, call mas_new_root() which will
> > > set the tree pointer to NULL and free all the nodes.
> > >
> > > Fix a comment for the allocations in mas_wr_spanning_store().
> > >
> > > Add mas_node_count_gfp() and use it to clean up mas_preallocate().
> > >
> > > Clean up mas_preallocate() and ensure the ma_state is safe on return.
> > >
> > > Update maple_tree.h to set alloc = NULL.
> >
> > Cool.
> >
> > How are we looking now? Any known issues still being worked on?
>
> Did you pick up "Subject: [PATCH] mm/mmap: Fix copy_vma() new_vma
> check"? I sent that yesterday as well.
>
> I think we are in good shape. There were two outstanding issues I had
> and this patch plus the copy_vma() patch fixes both.

I'm not so sure.

I haven't gone back to check, but I do think there was a very recent
lib/maple_tree.c change or set of changes which greatly improved it
(say an hour to reach a problem instead of a minute).

But errors I do see still (I've only had time for it this month, and
there have been three other non-maple-tree bugs in linux-next which
got in the way of my huge tmpfs kernel build swapping load testing).

I see one kind of problem on one machine, and another on another,
and have no idea why the difference, so cannot advise you on how
to reproduce either. I do have CONFIG_DEBUG_VM_MAPLE_TREE=y and
CONFIG_DEBUG_MAPLE_TREE=y on both; but might try removing those,
since they tell me nothing without a long education, and more
important for me to write this mail now than get into capturing
those numbers for you.

One machine does show such output, with BUG at mas_validate_gaps,
doing __vma_adjust from __split_vma from do_mas_align_munmap.
That's fairly typical of all reports from that machine, I think.

Other machine (laptop I'm writing from) has never shown any such MT
debug output, but hits the kernel BUG at include/linux/swapops.h:378!
pfn_swap_entry_to_page() BUG_ON(is_migration_entry() && !PageLocked().

I've often called that the best BUG in the kernel: it tells whether
rmap is holding together: migration entries were inserted on one
rmap walk, some may have been zapped by unmapping meanwhile, but
before midnight strikes and the page is unlocked, all the remaining
ones have to be found by the second rmap walk. (And if we removed the
BUG, then it would be mysterious rare segfaults and/or data leaks:
which I cannot call "stable").

Hitting that BUG suggests that the rmap locking is deficient somewhere
(might be something else, but that's a best guess), and I wouldn't be
surprised if that somewhere is __vma_adjust() - which is not quite
the same as it was before (I wonder if the changes were essential for
maple tree, or "simplifications" which seemed a good idea at the time).
If there's a way to minimally maple-ize how it was before, known working,
that would be a useful comparison to make.

Here's a patch I've applied, that makes no difference to the problems
I see, but fixes or highlights a little that worried me in the source.
Earlier, you had that anon_vma locking in vma_expand() under "else",
it's good to see that fixed, but I believe there's a case it misses;
and I don't think you can change the lock ordering just because it looks
nicer (see comments at head of mm/rmap.c for lock ordering, or mm/mremap.c
for an example of what order we take file versus anon rmap locks).

Hopefully maple tree stable is not far off, but seems not quite yet.

Hugh

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -528,7 +528,8 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
if (next->anon_vma && !vma->anon_vma) {
int error;

- vma->anon_vma = next->anon_vma;
+ anon_vma = next->anon_vma;
+ vma->anon_vma = anon_vma;
error = anon_vma_clone(vma, next);
if (error)
return error;
@@ -546,16 +547,19 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,

vma_adjust_trans_huge(vma, start, end, 0);

+ if (file) {
+ mapping = file->f_mapping;
+ root = &mapping->i_mmap;
+ uprobe_munmap(vma, vma->vm_start, vma->vm_end);
+ i_mmap_lock_write(mapping);
+ }
+
if (anon_vma) {
anon_vma_lock_write(anon_vma);
anon_vma_interval_tree_pre_update_vma(vma);
}

if (file) {
- mapping = file->f_mapping;
- root = &mapping->i_mmap;
- uprobe_munmap(vma, vma->vm_start, vma->vm_end);
- i_mmap_lock_write(mapping);
flush_dcache_mmap_lock(mapping);
vma_interval_tree_remove(vma, root);
}
@@ -3021,8 +3025,12 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
}
vma->vm_end = addr + len;
vma->vm_flags |= VM_SOFTDIRTY;
- if (mas_store_gfp(mas, vma, GFP_KERNEL))
- return -ENOMEM;
+ if (WARN_ON(mas_store_gfp(mas, vma, GFP_KERNEL))) {
+ /*
+ * Rewind not yet worked out: at the very least
+ * we must unlock if locked, so proceed to below.
+ */
+ }

if (vma->anon_vma) {
anon_vma_interval_tree_post_update_vma(vma);