Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

From: Filipe Manana
Date: Wed Jun 17 2020 - 13:36:33 EST


On Wed, Jun 17, 2020 at 6:20 PM Filipe Manana <fdmanana@xxxxxxxxx> wrote:
>
> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@xxxxxx> wrote:
> >
> > Under somewhat convoluted conditions, it is possible to attempt to
> > release an extent_buffer that is under io, which triggers a BUG_ON in
> > btrfs_release_extent_buffer_pages.
> >
> > This relies on a few different factors. First, extent_buffer reads done
> > as readahead for searching use WAIT_NONE, so they free the local extent
> > buffer reference while the io is outstanding. However, they should still
> > be protected by TREE_REF. However, if the system is doing signficant
> > reclaim, and simultaneously heavily accessing the extent_buffers, it is
> > possible for releasepage to race with two concurrent readahead attempts
> > in a way that leaves TREE_REF unset when the readahead extent buffer is
> > released.
> >
> > Essentially, if two tasks race to allocate a new extent_buffer, but the
> > winner who attempts the first io is rebuffed by a page being locked
> > (likely by the reclaim itself) then the loser will still go ahead with
> > issuing the readahead. The loser's call to find_extent_buffer must also
> > race with the reclaim task reading the extent_buffer's refcount as 1 in
> > a way that allows the reclaim to re-clear the TREE_REF checked by
> > find_extent_buffer.
> >
> > The following represents an example execution demonstrating the race:
> >
> > CPU0 CPU1 CPU2
> > reada_for_search reada_for_search
> > readahead_tree_block readahead_tree_block
> > find_create_tree_block find_create_tree_block
> > alloc_extent_buffer alloc_extent_buffer
> > find_extent_buffer // not found
> > allocates eb
> > lock pages
> > associate pages to eb
> > insert eb into radix tree
> > set TREE_REF, refs == 2
> > unlock pages
> > read_extent_buffer_pages // WAIT_NONE
> > not uptodate (brand new eb)
> > lock_page
> > if !trylock_page
> > goto unlock_exit // not an error
> > free_extent_buffer
> > release_extent_buffer
> > atomic_dec_and_test refs to 1
> > find_extent_buffer // found
> > try_release_extent_buffer
> > take refs_lock
> > reads refs == 1; no io
> > atomic_inc_not_zero refs to 2
> > mark_buffer_accessed
> > check_buffer_tree_ref
> > // not STALE, won't take refs_lock
> > refs == 2; TREE_REF set // no action
> > read_extent_buffer_pages // WAIT_NONE
> > clear TREE_REF
> > release_extent_buffer
> > atomic_dec_and_test refs to 1
> > unlock_page
> > still not uptodate (CPU1 read failed on trylock_page)
> > locks pages
> > set io_pages > 0
> > submit io
> > return
> > release_extent_buffer
>
> Small detail, missing the call to free_extent_buffer() from
> readahead_tree_block(), which is what ends calling
> release_extent_buffer().
>
> > dec refs to 0
> > delete from radix tree
> > btrfs_release_extent_buffer_pages
> > BUG_ON(io_pages > 0)!!!
> >
> > We observe this at a very low rate in production and were also able to
> > reproduce it in a test environment by introducing some spurious delays
> > and by introducing probabilistic trylock_page failures.
> >
> > To fix it, we apply check_tree_ref at a point where it could not
> > possibly be unset by a competing task: after io_pages has been
> > incremented. There is no race in write_one_eb, that we know of, but for
> > consistency, apply it there too. All the codepaths that clear TREE_REF
> > check for io, so they would not be able to clear it after this point.
> >
> > Signed-off-by: Boris Burkov <boris@xxxxxx>
>
> Btw, if you have a stack trace, it would be good to include it in the
> change log for grep-ability in case someone runs into the same
> problem.
>
> > ---
> > fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++----------------
> > 1 file changed, 29 insertions(+), 16 deletions(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index c59e07360083..f6758ebbb6a2 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
> > clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> > num_pages = num_extent_pages(eb);
> > atomic_set(&eb->io_pages, num_pages);
> > + /*
> > + * It is possible for releasepage to clear the TREE_REF bit before we
> > + * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> > + */
> > + check_buffer_tree_ref(eb);
>
> This is a whole different case from the one described in the
> changelog, as this is in the write path.
> Why do we need this one?
>
> At this point the eb pages are marked with the dirty bit, and
> btree_releasepage() returns 0 if the page is dirty and we don't end up
> calling try_release_extent_buffer().

In addition to that, when write_one_eb() is called we have incremented
its ref count before and EXTENT_BUFFER_WRITEBACK is set on the eb,
try_release_extent_buffer() should return and never call
release_extent_buffer(), since the refcount is > 1 and
extent_buffer_under_io() returns true.

> So I don't understand why this hunk is needed, what variant of the
> problem it's trying to solve.
>
> >
> > /* set btree blocks beyond nritems with 0 to avoid stale content. */
> > nritems = btrfs_header_nritems(eb);
> > @@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
> > static void check_buffer_tree_ref(struct extent_buffer *eb)
> > {
> > int refs;
> > - /* the ref bit is tricky. We have to make sure it is set
> > - * if we have the buffer dirty. Otherwise the
> > - * code to free a buffer can end up dropping a dirty
> > - * page
> > + /*
> > + * The TREE_REF bit is first set when the extent_buffer is added
> > + * to the radix tree. It is also reset, if unset, when a new reference
> > + * is created by find_extent_buffer.
> > *
> > - * Once the ref bit is set, it won't go away while the
> > - * buffer is dirty or in writeback, and it also won't
> > - * go away while we have the reference count on the
> > - * eb bumped.
> > + * It is only cleared in two cases: freeing the last non-tree
> > + * reference to the extent_buffer when its STALE bit is set or
> > + * calling releasepage when the tree reference is the only reference.
> > *
> > - * We can't just set the ref bit without bumping the
> > - * ref on the eb because free_extent_buffer might
> > - * see the ref bit and try to clear it. If this happens
> > - * free_extent_buffer might end up dropping our original
> > - * ref by mistake and freeing the page before we are able
> > - * to add one more ref.
> > + * In both cases, care is taken to ensure that the extent_buffer's
> > + * pages are not under io. However, releasepage can be concurrently
> > + * called with creating new references, which is prone to race
> > + * conditions between the calls to check_tree_ref in those codepaths
> > + * and clearing TREE_REF in try_release_extent_buffer.
> > *
> > - * So bump the ref count first, then set the bit. If someone
> > - * beat us to it, drop the ref we added.
> > + * The actual lifetime of the extent_buffer in the radix tree is
> > + * adequately protected by the refcount, but the TREE_REF bit and
> > + * its corresponding reference are not. To protect against this
> > + * class of races, we call check_buffer_tree_ref from the codepaths
> > + * which trigger io after they set eb->io_pages. Note that once io is
> > + * initiated, TREE_REF can no longer be cleared, so that is the
> > + * moment at which any such race is best fixed.
> > */
> > refs = atomic_read(&eb->refs);
> > if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> > @@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
> > clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
> > eb->read_mirror = 0;
> > atomic_set(&eb->io_pages, num_reads);
> > + /*
> > + * It is possible for releasepage to clear the TREE_REF bit before we
> > + * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> > + */
> > + check_buffer_tree_ref(eb);
>
> This is the hunk that fixes the problem described in the change log,
> and it looks good to me.
>
> Thanks.
>
> > for (i = 0; i < num_pages; i++) {
> > page = eb->pages[i];
> >
> > --
> > 2.24.1
> >
>
>
> --
> Filipe David Manana,
>
> âWhether you think you can, or you think you can't â you're right.â



--
Filipe David Manana,

âWhether you think you can, or you think you can't â you're right.â