Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
From: Lorenzo Stoakes
Date: Sat Oct 05 2024 - 02:21:39 EST
On Sat, Oct 05, 2024 at 02:56:01AM +0200, Bert Karwatzki wrote:
> Am Freitag, dem 04.10.2024 um 23:41 +0100 schrieb Lorenzo Stoakes:
> > On Fri, Oct 04, 2024 at 11:35:44AM +0200, Bert Karwatzki wrote:
> > > Here's the log procduced by this kernel:
> > >
> > > c9e7f76815d3 (HEAD -> maple_tree_debug_4) hack: set of info stuff v5
> > > 7e3bb072761a mm: correct error handling in mmap_region()
> > > 77df9e4bb222 (tag: next-20241001, origin/master, origin/HEAD, master) Add linux-next specific files for 20241001
> > >
> > > Again it took two attempts to trigger the bug.
> > >
> > > Bert Karwatzki
> > >
> >
> > Sending an updated, cleaned up version of the patch with a lot of
> > explanation. This is functionally identical to the v3 fix I already sent so
> > you can try that or this to confirm it resolves your issue.
> >
> > If you are able to do so, I can submit this to upstream for a hotfix. If
> > not, well then back to the drawing board and I'd be very shocked :)
> >
> > I have been able to reproduce the issue locally in our userland testing
> > suite entirely consistently, and this patch resolves the issue and also
> > continues to pass all maple tree unit tests.
> >
> > Again thank you so much for all your help - I hope you are able to find a
> > spare moment to quickly give this one a try and confirm whether it does
> > indeed address the problem you've reported.
> >
> > Thanks, Lorenzo
> >
> > ----8<----
> > From 126d65bd9839cd3ec941007872b357e27fd56066 Mon Sep 17 00:00:00 2001
> > From: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
> > Date: Fri, 4 Oct 2024 15:18:58 +0100
> > Subject: [PATCH] maple_tree: correct tree corruption on spanning store
> >
> > Writing a data range into a maple tree may involve overwriting a number of
> > existing entries that span across more than one node. Doing so invokes a
> > 'spanning' store.
> >
> > Performing a spanning store across two leaf nodes in a maple tree in which
> > entries are overwritten is achieved by first initialising a 'big' node,
> > which will store the coalesced entries between the two nodes comprising
> > entries prior to the newly stored entry, the newly stored entry, and
> > subsequent entries.
> >
> > This 'big node' is then merged back into the tree and the tree is
> > rebalanced, replacing the entries across the spanned nodes with those
> > contained in the big node.
> >
> > The operation is performed in mas_wr_spanning_store() which starts by
> > establishing two maple tree state objects ('mas' objects) on the left of
> > the range and on the right (l_mas and r_mas respectively).
> >
> > l_mas traverses to the beginning of the range to be stored in order to copy
> > the data BEFORE the requested store into the big node.
> >
> > We then insert our new entry immediately afterwards (both the left copy and
> > the storing of the new entry are combined and performed by
> > mas_store_b_node()).
> >
> > r_mas traverses to the populated slot immediately after, in order to copy
> > the data AFTER the requested store into the big node.
> >
> > This copy of the right-hand node is performed by mas_mab_cp() as long as
> > r_mas indicates that there's data to copy, i.e. r_mas.offset <= r_mas.end.
> >
> > We traverse r_mas to this position in mas_wr_node_walk() using a simple
> > loop:
> >
> > while (offset < count && mas->index > wr_mas->pivots[offset])
> > offset++;
> >
> > Note here that count is determined to be the (inclusive) index of the last
> > node containing data in the node as determined by ma_data_end().
> >
> > This means that even in searching for mas->index, which will have been set
> > to one plus the end of the target range in order to traverse to the next
> > slot in mas_wr_spanning_store(), we will terminate the iteration at the end
> > of the node range even if this condition is not met due to the offset <
> > count condition.
> >
> > The fact this right hand node contains the end of the range being stored is
> > why we are traversing it, and this loop is why we appear to discover a
> > viable range within the right node to copy to the big one.
> >
> > However, if the node that r_mas traverses contains a pivot EQUAL to the end
> > of the range being stored, and this is the LAST pivot contained within the
> > node, something unexpected happens:
> >
> > 1. The l_mas traversal copy and insertion of the new entry in the big node
> > is performed via mas_store_b_node() correctly.
> >
> > 2. The traversal performed by mas_wr_node_walk() means our r_mas.offset is
> > set to the offset of the entry equal to the end of the range we store.
> >
> > 3. We therefore copy this DUPLICATE of the final pivot into the big node,
> > and insert this DUPLICATE entry, alongside its invalid slot entry
> > immediately after the newly inserted entry.
> >
> > 4. The big node containing this duplicated is inserted into the tree which
> > is rebalanced, and therefore the maple tree becomes corrupted.
> >
> > Note that if the right hand node had one or more entries with pivots of
> > greater value than the end of the stored range, this would not happen. If
> > it contained entries with pivots of lesser value it would not be the right
> > node in this spanning store.
> >
> > This appears to have been at risk of happening throughout the maple tree's
> > history, however it seemed significantly less likely to occur until
> > recently.
> >
> > The balancing of the tree seems to have made it unlikely that you would
> > happen to perform a store that both spans two nodes AND would overwrite
> > precisely the entry with the largest pivot in the right-hand node which
> > contains no further larger pivots.
> >
> > The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree
> > in mmap_region()") seems to have made the probability of this event much
> > more likely.
> >
> > Previous to this change, MAP_FIXED mappings which were overwritten would
> > first be cleared before any subsequent store or importantly - merge of
> > surrounding entries - would be performed.
> >
> > After this change, this is no longer the case, and this means that, in the
> > worst case, a number of entries might be overwritten in combination with a
> > merge (and subsequent overwriting expansion) between both the prior entry
> > AND a subsequent entry.
> >
> > The motivation for this change arose from Bert Karwatzki's report of
> > encountering mm instability after the release of kernel v6.12-rc1 which,
> > after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
> > options, was identified as maple tree corruption.
> >
> > After Bert very generously provided his time and ability to reproduce this
> > event consistently, I was able to finally identify that the issue discussed
> > in this commit message was occurring for him.
> >
> > The solution implemented in this patch is:
> >
> > 1. Adjust mas_wr_walk_index() to return a boolean value indicating whether
> > the containing node is actually populated with entries possessing pivots
> > equal to or greater than mas->index.
> >
> > 2. When traversing the right node in mas_wr_spanning_store(), use this
> > value to determine whether to try to copy from the right node - if it is
> > not populated, then do not do so.
> >
> > This passes all maple tree unit tests and resolves the reported bug.
> > ---
> > lib/maple_tree.c | 20 ++++++++++++++++----
> > 1 file changed, 16 insertions(+), 4 deletions(-)
> >
> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > index 37abf0fe380b..e6f0da908ba7 100644
> > --- a/lib/maple_tree.c
> > +++ b/lib/maple_tree.c
> > @@ -2194,6 +2194,8 @@ static inline void mas_node_or_none(struct ma_state *mas,
> >
> > /*
> > * mas_wr_node_walk() - Find the correct offset for the index in the @mas.
> > + * If @mas->index cannot be found within the containing
> > + * node, we traverse to the last entry in the node.
> > * @wr_mas: The maple write state
> > *
> > * Uses mas_slot_locked() and does not need to worry about dead nodes.
> > @@ -3527,6 +3529,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas)
> > return true;
> > }
> >
> > +/*
> > + * Traverse the maple tree until the offset of mas->index is reached.
> > + *
> > + * Return: Is this node actually populated with entries possessing pivots equal
> > + * to or greater than mas->index?
> > + */
> > static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
> > {
> > struct ma_state *mas = wr_mas->mas;
> > @@ -3535,8 +3543,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
> > mas_wr_walk_descend(wr_mas);
> > wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
> > mas->offset);
> > - if (ma_is_leaf(wr_mas->type))
> > - return true;
> > + if (ma_is_leaf(wr_mas->type)) {
> > + unsigned long pivot = wr_mas->pivots[mas->offset];
> > +
> > + return pivot == 0 || mas->index <= pivot;
> > + }
> > mas_wr_walk_traverse(wr_mas);
> >
> > }
> > @@ -3696,6 +3707,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> > struct maple_big_node b_node;
> > struct ma_state *mas;
> > unsigned char height;
> > + bool r_populated;
> >
> > /* Left and Right side of spanning store */
> > MA_STATE(l_mas, NULL, 0, 0);
> > @@ -3737,7 +3749,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> > r_mas.last++;
> >
> > r_mas.index = r_mas.last;
> > - mas_wr_walk_index(&r_wr_mas);
> > + r_populated = mas_wr_walk_index(&r_wr_mas);
> > r_mas.last = r_mas.index = mas->last;
> >
> > /* Set up left side. */
> > @@ -3761,7 +3773,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> > /* Copy l_mas and store the value in b_node. */
> > mas_store_b_node(&l_wr_mas, &b_node, l_mas.end);
> > /* Copy r_mas into b_node. */
> > - if (r_mas.offset <= r_mas.end)
> > + if (r_populated && r_mas.offset <= r_mas.end)
> > mas_mab_cp(&r_mas, r_mas.offset, r_mas.end,
> > &b_node, b_node.b_end + 1);
> > else
> > --
> > 2.46.2
>
> I just tested this and it passed ten tests (i.e. upgrading the proton version i
> steam) in a row.
>
> Bert Karwatzki
Perfect :) will send the fix upstream then as a hotfix for 6.12! Thanks
very much for helping out with this, your help has been absolutely
invaluable and HUGELY appreciated.
Cheers, Lorenzo