Re: [PATCH 5/5] radix tree: fix multi-order iteration race

From: Jan Kara
Date: Wed May 09 2018 - 08:46:18 EST


On Thu 03-05-18 13:24:30, Ross Zwisler wrote:
> Fix a race in the multi-order iteration code which causes the kernel to hit
> a GP fault. This was first seen with a production v4.15 based kernel
> (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used order 9 PMD
> DAX entries.
>
> The race has to do with how we tear down multi-order sibling entries when
> we are removing an item from the tree. Remember for example that an order
> 2 entry looks like this:
>
> struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
>
> where 'entry' is in some slot in the struct radix_tree_node, and the three
> slots following 'entry' contain sibling pointers which point back to
> 'entry.'
>
> When we delete 'entry' from the tree, we call :
> radix_tree_delete()
> radix_tree_delete_item()
> __radix_tree_delete()
> replace_slot()
>
> replace_slot() first removes the siblings in order from the first to the
> last, then at then replaces 'entry' with NULL. This means that for a brief
> period of time we end up with one or more of the siblings removed, so:
>
> struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
>
> This causes an issue if you have a reader iterating over the slots in the
> tree via radix_tree_for_each_slot() while only under
> rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
> mm/filemap.c.
>
> The issue is that when __radix_tree_next_slot() => skip_siblings() tries to
> skip over the sibling entries in the slots, it currently does so with an
> exact match on the slot directly preceding our current slot. Normally this
> works:
> V preceding slot
> struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
> ^ current slot
>
> This lets you find the first sibling, and you skip them all in order.
>
> But in the case where one of the siblings is NULL, that slot is skipped and
> then our sibling detection is interrupted:
>
> V preceding slot
> struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
> ^ current slot
>
> This means that the sibling pointers aren't recognized since they point all
> the way back to 'entry', so we think that they are normal internal radix
> tree pointers. This causes us to think we need to walk down to a struct
> radix_tree_node starting at the address of 'entry'.
>
> In a real running kernel this will crash the thread with a GP fault when
> you try and dereference the slots in your broken node starting at 'entry'.
>
> We fix this race by fixing the way that skip_siblings() detects sibling
> nodes. Instead of testing against the preceding slot we instead look for
> siblings via is_sibling_entry() which compares against the position of the
> struct radix_tree_node.slots[] array. This ensures that sibling entries
> are properly identified, even if they are no longer contiguous with the
> 'entry' they point to.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> Reported-by: CR, Sapthagirish <sapthagirish.cr@xxxxxxxxx>
> Fixes: commit 148deab223b2 ("radix-tree: improve multiorder iterators")
> Cc: <stable@xxxxxxxxxxxxxxx>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

Honza

> ---
> lib/radix-tree.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index da9e10c827df..43e0cbedc3a0 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1612,11 +1612,9 @@ static void set_iter_tags(struct radix_tree_iter *iter,
> static void __rcu **skip_siblings(struct radix_tree_node **nodep,
> void __rcu **slot, struct radix_tree_iter *iter)
> {
> - void *sib = node_to_entry(slot - 1);
> -
> while (iter->index < iter->next_index) {
> *nodep = rcu_dereference_raw(*slot);
> - if (*nodep && *nodep != sib)
> + if (*nodep && !is_sibling_entry(iter->node, *nodep))
> return slot;
> slot++;
> iter->index = __radix_tree_iter_add(iter, 1);
> @@ -1631,7 +1629,7 @@ void __rcu **__radix_tree_next_slot(void __rcu **slot,
> struct radix_tree_iter *iter, unsigned flags)
> {
> unsigned tag = flags & RADIX_TREE_ITER_TAG_MASK;
> - struct radix_tree_node *node = rcu_dereference_raw(*slot);
> + struct radix_tree_node *node;
>
> slot = skip_siblings(&node, slot, iter);
>
> --
> 2.14.3
>
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR