Re: [PATCH 5/5] radix tree: fix multi-order iteration race

From: Ross Zwisler
Date: Wed May 09 2018 - 11:09:50 EST


On Wed, May 09, 2018 at 02:46:11PM +0200, Jan Kara wrote:
> On Thu 03-05-18 13:24:30, Ross Zwisler wrote:
> > Fix a race in the multi-order iteration code which causes the kernel to hit
> > a GP fault. This was first seen with a production v4.15 based kernel
> > (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used order 9 PMD
> > DAX entries.
> >
> > The race has to do with how we tear down multi-order sibling entries when
> > we are removing an item from the tree. Remember for example that an order
> > 2 entry looks like this:
> >
> > struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
> >
> > where 'entry' is in some slot in the struct radix_tree_node, and the three
> > slots following 'entry' contain sibling pointers which point back to
> > 'entry.'
> >
> > When we delete 'entry' from the tree, we call :
> > radix_tree_delete()
> > radix_tree_delete_item()
> > __radix_tree_delete()
> > replace_slot()
> >
> > replace_slot() first removes the siblings in order from the first to the
> > last, then at then replaces 'entry' with NULL. This means that for a brief
> > period of time we end up with one or more of the siblings removed, so:
> >
> > struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
> >
> > This causes an issue if you have a reader iterating over the slots in the
> > tree via radix_tree_for_each_slot() while only under
> > rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
> > mm/filemap.c.
> >
> > The issue is that when __radix_tree_next_slot() => skip_siblings() tries to
> > skip over the sibling entries in the slots, it currently does so with an
> > exact match on the slot directly preceding our current slot. Normally this
> > works:
> > V preceding slot
> > struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
> > ^ current slot
> >
> > This lets you find the first sibling, and you skip them all in order.
> >
> > But in the case where one of the siblings is NULL, that slot is skipped and
> > then our sibling detection is interrupted:
> >
> > V preceding slot
> > struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
> > ^ current slot
> >
> > This means that the sibling pointers aren't recognized since they point all
> > the way back to 'entry', so we think that they are normal internal radix
> > tree pointers. This causes us to think we need to walk down to a struct
> > radix_tree_node starting at the address of 'entry'.
> >
> > In a real running kernel this will crash the thread with a GP fault when
> > you try and dereference the slots in your broken node starting at 'entry'.
> >
> > We fix this race by fixing the way that skip_siblings() detects sibling
> > nodes. Instead of testing against the preceding slot we instead look for
> > siblings via is_sibling_entry() which compares against the position of the
> > struct radix_tree_node.slots[] array. This ensures that sibling entries
> > are properly identified, even if they are no longer contiguous with the
> > 'entry' they point to.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> > Reported-by: CR, Sapthagirish <sapthagirish.cr@xxxxxxxxx>
> > Fixes: commit 148deab223b2 ("radix-tree: improve multiorder iterators")
> > Cc: <stable@xxxxxxxxxxxxxxx>
>
> Looks good to me. You can add:
>
> Reviewed-by: Jan Kara <jack@xxxxxxx>

Thank you for the review, Jan.