Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()

From: Bert Karwatzki
Date: Fri Oct 04 2024 - 20:56:31 EST


Am Freitag, dem 04.10.2024 um 23:41 +0100 schrieb Lorenzo Stoakes:
> On Fri, Oct 04, 2024 at 11:35:44AM +0200, Bert Karwatzki wrote:
> > Here's the log procduced by this kernel:
> >
> > c9e7f76815d3 (HEAD -> maple_tree_debug_4) hack: set of info stuff v5
> > 7e3bb072761a mm: correct error handling in mmap_region()
> > 77df9e4bb222 (tag: next-20241001, origin/master, origin/HEAD, master) Add linux-next specific files for 20241001
> >
> > Again it took two attempts to trigger the bug.
> >
> > Bert Karwatzki
> >
>
> Sending an updated, cleaned up version of the patch with a lot of
> explanation. This is functionally identical to the v3 fix I already sent so
> you can try that or this to confirm it resolves your issue.
>
> If you are able to do so, I can submit this to upstream for a hotfix. If
> not, well then back to the drawing board and I'd be very shocked :)
>
> I have been able to reproduce the issue locally in our userland testing
> suite entirely consistently, and this patch resolves the issue and also
> continues to pass all maple tree unit tests.
>
> Again thank you so much for all your help - I hope you are able to find a
> spare moment to quickly give this one a try and confirm whether it does
> indeed address the problem you've reported.
>
> Thanks, Lorenzo
>
> ----8<----
> From 126d65bd9839cd3ec941007872b357e27fd56066 Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> Date: Fri, 4 Oct 2024 15:18:58 +0100
> Subject: [PATCH] maple_tree: correct tree corruption on spanning store
>
> Writing a data range into a maple tree may involve overwriting a number of
> existing entries that span across more than one node. Doing so invokes a
> 'spanning' store.
>
> Performing a spanning store across two leaf nodes in a maple tree in which
> entries are overwritten is achieved by first initialising a 'big' node,
> which will store the coalesced entries between the two nodes comprising
> entries prior to the newly stored entry, the newly stored entry, and
> subsequent entries.
>
> This 'big node' is then merged back into the tree and the tree is
> rebalanced, replacing the entries across the spanned nodes with those
> contained in the big node.
>
> The operation is performed in mas_wr_spanning_store() which starts by
> establishing two maple tree state objects ('mas' objects) on the left of
> the range and on the right (l_mas and r_mas respectively).
>
> l_mas traverses to the beginning of the range to be stored in order to copy
> the data BEFORE the requested store into the big node.
>
> We then insert our new entry immediately afterwards (both the left copy and
> the storing of the new entry are combined and performed by
> mas_store_b_node()).
>
> r_mas traverses to the populated slot immediately after, in order to copy
> the data AFTER the requested store into the big node.
>
> This copy of the right-hand node is performed by mas_mab_cp() as long as
> r_mas indicates that there's data to copy, i.e. r_mas.offset <= r_mas.end.
>
> We traverse r_mas to this position in mas_wr_node_walk() using a simple
> loop:
>
> while (offset < count && mas->index > wr_mas->pivots[offset])
> offset++;
>
> Note here that count is determined to be the (inclusive) index of the last
> node containing data in the node as determined by ma_data_end().
>
> This means that even in searching for mas->index, which will have been set
> to one plus the end of the target range in order to traverse to the next
> slot in mas_wr_spanning_store(), we will terminate the iteration at the end
> of the node range even if this condition is not met due to the offset <
> count condition.
>
> The fact this right hand node contains the end of the range being stored is
> why we are traversing it, and this loop is why we appear to discover a
> viable range within the right node to copy to the big one.
>
> However, if the node that r_mas traverses contains a pivot EQUAL to the end
> of the range being stored, and this is the LAST pivot contained within the
> node, something unexpected happens:
>
> 1. The l_mas traversal copy and insertion of the new entry in the big node
> is performed via mas_store_b_node() correctly.
>
> 2. The traversal performed by mas_wr_node_walk() means our r_mas.offset is
> set to the offset of the entry equal to the end of the range we store.
>
> 3. We therefore copy this DUPLICATE of the final pivot into the big node,
> and insert this DUPLICATE entry, alongside its invalid slot entry
> immediately after the newly inserted entry.
>
> 4. The big node containing this duplicated is inserted into the tree which
> is rebalanced, and therefore the maple tree becomes corrupted.
>
> Note that if the right hand node had one or more entries with pivots of
> greater value than the end of the stored range, this would not happen. If
> it contained entries with pivots of lesser value it would not be the right
> node in this spanning store.
>
> This appears to have been at risk of happening throughout the maple tree's
> history, however it seemed significantly less likely to occur until
> recently.
>
> The balancing of the tree seems to have made it unlikely that you would
> happen to perform a store that both spans two nodes AND would overwrite
> precisely the entry with the largest pivot in the right-hand node which
> contains no further larger pivots.
>
> The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree
> in mmap_region()") seems to have made the probability of this event much
> more likely.
>
> Previous to this change, MAP_FIXED mappings which were overwritten would
> first be cleared before any subsequent store or importantly - merge of
> surrounding entries - would be performed.
>
> After this change, this is no longer the case, and this means that, in the
> worst case, a number of entries might be overwritten in combination with a
> merge (and subsequent overwriting expansion) between both the prior entry
> AND a subsequent entry.
>
> The motivation for this change arose from Bert Karwatzki's report of
> encountering mm instability after the release of kernel v6.12-rc1 which,
> after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
> options, was identified as maple tree corruption.
>
> After Bert very generously provided his time and ability to reproduce this
> event consistently, I was able to finally identify that the issue discussed
> in this commit message was occurring for him.
>
> The solution implemented in this patch is:
>
> 1. Adjust mas_wr_walk_index() to return a boolean value indicating whether
> the containing node is actually populated with entries possessing pivots
> equal to or greater than mas->index.
>
> 2. When traversing the right node in mas_wr_spanning_store(), use this
> value to determine whether to try to copy from the right node - if it is
> not populated, then do not do so.
>
> This passes all maple tree unit tests and resolves the reported bug.
> ---
> lib/maple_tree.c | 20 ++++++++++++++++----
> 1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 37abf0fe380b..e6f0da908ba7 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -2194,6 +2194,8 @@ static inline void mas_node_or_none(struct ma_state *mas,
>
> /*
> * mas_wr_node_walk() - Find the correct offset for the index in the @mas.
> + * If @mas->index cannot be found within the containing
> + * node, we traverse to the last entry in the node.
> * @wr_mas: The maple write state
> *
> * Uses mas_slot_locked() and does not need to worry about dead nodes.
> @@ -3527,6 +3529,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas)
> return true;
> }
>
> +/*
> + * Traverse the maple tree until the offset of mas->index is reached.
> + *
> + * Return: Is this node actually populated with entries possessing pivots equal
> + * to or greater than mas->index?
> + */
> static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
> {
> struct ma_state *mas = wr_mas->mas;
> @@ -3535,8 +3543,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
> mas_wr_walk_descend(wr_mas);
> wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
> mas->offset);
> - if (ma_is_leaf(wr_mas->type))
> - return true;
> + if (ma_is_leaf(wr_mas->type)) {
> + unsigned long pivot = wr_mas->pivots[mas->offset];
> +
> + return pivot == 0 || mas->index <= pivot;
> + }
> mas_wr_walk_traverse(wr_mas);
>
> }
> @@ -3696,6 +3707,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> struct maple_big_node b_node;
> struct ma_state *mas;
> unsigned char height;
> + bool r_populated;
>
> /* Left and Right side of spanning store */
> MA_STATE(l_mas, NULL, 0, 0);
> @@ -3737,7 +3749,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> r_mas.last++;
>
> r_mas.index = r_mas.last;
> - mas_wr_walk_index(&r_wr_mas);
> + r_populated = mas_wr_walk_index(&r_wr_mas);
> r_mas.last = r_mas.index = mas->last;
>
> /* Set up left side. */
> @@ -3761,7 +3773,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> /* Copy l_mas and store the value in b_node. */
> mas_store_b_node(&l_wr_mas, &b_node, l_mas.end);
> /* Copy r_mas into b_node. */
> - if (r_mas.offset <= r_mas.end)
> + if (r_populated && r_mas.offset <= r_mas.end)
> mas_mab_cp(&r_mas, r_mas.offset, r_mas.end,
> &b_node, b_node.b_end + 1);
> else
> --
> 2.46.2

I just tested this and it passed ten tests (i.e. upgrading the proton version i
steam) in a row.

Bert Karwatzki