Re: [PATCH] btrfs: fix data race when accessing the block_group's used field

From: haoran zheng
Date: Wed Feb 19 2025 - 04:06:32 EST


On Wed, Feb 19, 2025 at 12:29 AM Filipe Manana <fdmanana@xxxxxxxxxx> wrote:
>
> On Tue, Feb 18, 2025 at 4:14 PM Daniel Vacek <neelx@xxxxxxxx> wrote:
> >
> > On Tue, 18 Feb 2025 at 11:19, Filipe Manana <fdmanana@xxxxxxxxxx> wrote:
> > >
> > > On Tue, Feb 18, 2025 at 8:08 AM Daniel Vacek <neelx@xxxxxxxx> wrote:
> > > >
> > > > On Mon, 10 Feb 2025 at 12:11, Filipe Manana <fdmanana@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Sat, Feb 8, 2025 at 7:38 AM Hao-ran Zheng <zhenghaoran154@xxxxxxxxx> wrote:
> > > > > >
> > > > > > A data race may occur when the function `btrfs_discard_queue_work()`
> > > > > > and the function `btrfs_update_block_group()` is executed concurrently.
> > > > > > Specifically, when the `btrfs_update_block_group()` function is executed
> > > > > > to lines `cache->used = old_val;`, and `btrfs_discard_queue_work()`
> > > > > > is accessing `if(block_group->used == 0)` will appear with data race,
> > > > > > which may cause block_group to be placed unexpectedly in discard_list or
> > > > > > discard_unused_list. The specific function call stack is as follows:
> > > > > >
> > > > > > ============DATA_RACE============
> > > > > > btrfs_discard_queue_work+0x245/0x500 [btrfs]
> > > > > > __btrfs_add_free_space+0x3066/0x32f0 [btrfs]
> > > > > > btrfs_add_free_space+0x19a/0x200 [btrfs]
> > > > > > unpin_extent_range+0x847/0x2120 [btrfs]
> > > > > > btrfs_finish_extent_commit+0x9a3/0x1840 [btrfs]
> > > > > > btrfs_commit_transaction+0x5f65/0xc0f0 [btrfs]
> > > > > > transaction_kthread+0x764/0xc20 [btrfs]
> > > > > > kthread+0x292/0x330
> > > > > > ret_from_fork+0x4d/0x80
> > > > > > ret_from_fork_asm+0x1a/0x30
> > > > > > ============OTHER_INFO============
> > > > > > btrfs_update_block_group+0xa9d/0x2430 [btrfs]
> > > > > > __btrfs_free_extent+0x4f69/0x9920 [btrfs]
> > > > > > __btrfs_run_delayed_refs+0x290e/0xd7d0 [btrfs]
> > > > > > btrfs_run_delayed_refs+0x317/0x770 [btrfs]
> > > > > > flush_space+0x388/0x1440 [btrfs]
> > > > > > btrfs_preempt_reclaim_metadata_space+0xd65/0x14c0 [btrfs]
> > > > > > process_scheduled_works+0x716/0xf10
> > > > > > worker_thread+0xb6a/0x1190
> > > > > > kthread+0x292/0x330
> > > > > > ret_from_fork+0x4d/0x80
> > > > > > ret_from_fork_asm+0x1a/0x30
> > > > > > =================================
> > > > > >
> > > > > > Although the `block_group->used` was checked again in the use of the
> > > > > > `peek_discard_list` function, considering that `block_group->used` is
> > > > > > a 64-bit variable, we still think that the data race here is an
> > > > > > unexpected behavior. It is recommended to add `READ_ONCE` and
> > > > > > `WRITE_ONCE` to read and write.
> > > > > >
> > > > > > Signed-off-by: Hao-ran Zheng <zhenghaoran154@xxxxxxxxx>
> > > > > > ---
> > > > > > fs/btrfs/block-group.c | 4 ++--
> > > > > > fs/btrfs/discard.c | 2 +-
> > > > > > 2 files changed, 3 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > > > > > index c0a8f7d92acc..c681b97f6835 100644
> > > > > > --- a/fs/btrfs/block-group.c
> > > > > > +++ b/fs/btrfs/block-group.c
> > > > > > @@ -3678,7 +3678,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> > > > > > old_val = cache->used;
> > > > > > if (alloc) {
> > > > > > old_val += num_bytes;
> > > > > > - cache->used = old_val;
> > > > > > + WRITE_ONCE(cache->used, old_val);
> > > > > > cache->reserved -= num_bytes;
> > > > > > cache->reclaim_mark = 0;
> > > > > > space_info->bytes_reserved -= num_bytes;
> > > > > > @@ -3690,7 +3690,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> > > > > > spin_unlock(&space_info->lock);
> > > > > > } else {
> > > > > > old_val -= num_bytes;
> > > > > > - cache->used = old_val;
> > > > > > + WRITE_ONCE(cache->used, old_val);
> > > > > > cache->pinned += num_bytes;
> > > > > > btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
> > > > > > space_info->bytes_used -= num_bytes;
> > > > > > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c
> > > > > > index e815d165cccc..71c57b571d50 100644
> > > > > > --- a/fs/btrfs/discard.c
> > > > > > +++ b/fs/btrfs/discard.c
> > > > > > @@ -363,7 +363,7 @@ void btrfs_discard_queue_work(struct btrfs_discard_ctl *discard_ctl,
> > > > > > if (!block_group || !btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
> > > > > > return;
> > > > > >
> > > > > > - if (block_group->used == 0)
> > > > > > + if (READ_ONCE(block_group->used) == 0)
> > > > >
> > > > > There are at least 3 more places in discard.c where we access ->used
> > > > > without being under the protection of the block group's spinlock.
> > > > > So let's fix this for all places and not just a single one...
> > > > >
> > > > > Also, this is quite ugly to spread READ_ONCE/WRITE_ONCE all over the place.
> > > > > What we typically do in btrfs is to add helpers that hide them, see
> > > > > block-rsv.h for example.
> > > > >
> > > > > Also, I don't think we need READ_ONCE/WRITE_ONCE.
> > > > > We could use data_race(), though I think that could be subject to
> > > > > load/store tearing, or just take the lock.
> > > > > So adding a helper like this to block-group.h:
> > > > >
> > > > > static inline u64 btrfs_block_group_used(struct btrfs_block_group *bg)
> > > > > {
> > > > > u64 ret;
> > > > >
> > > > > spin_lock(&bg->lock);
> > > > > ret = bg->used;
> > > > > spin_unlock(&bg->lock);
> > > > >
> > > > > return ret;
> > > > > }

I understand that using lock to protect block_group->used
in discard.c file is feasible. In addition, I looked at the code
of block-group.c and found that locks have been added in
some places where block_group->used are used. , it
seems that it is not appropriate to call
btrfs_block_group_used again to obtain (because it will
cause deadlock). I would like to ask other similar places
where block_group->used needs to call
btrfs_block_group_used function in addition to the four
places in discard.c?

> > > >
> > > > Would memory barriers be sufficient here? Taking a lock just for
> > > > reading one member seems excessive...

Do I need to perform performance testing to ensure the impact if I lock it?

> > >
> > > Do you think there's heavy contention on this lock?
> >
> > This is not about contention. Spin lock should just never be used this
> > way. Or any lock actually. The critical section only contains a single
> > fetch operation which does not justify using a lock.
> > Hence the only guarantee such lock usage provides are the implicit
> > memory barriers (from which maybe only one of them is really needed
> > depending on the context where this helper is going to be used).
> >
> > Simply put, the lock is degraded into a memory barrier this way. So
> > why not just use the barriers in the first place if only ordering
> > guarantees are required? It only makes sense.
>
> As said earlier, it's a lot easier to reason about lock() and unlock()
> calls rather than spreading memory barriers in the write and read
> sides.
> Historically he had several mistakes with barriers, they're simply not
> as straightforward to reason as locks.
>
> Plus spreading the barriers in the read and write sides makes the code
> not so easy to read, not to mention in case of any new sites updating
> the member, we'll have to not forget adding a barrier.
>
> It's just easier to reason and maintain.
>
>
> >
> > > Usually I prefer to go the simplest way, and using locks is a lot more
> > > straightforward and easier to understand than memory barriers.
> >
> > How so? Locks provide the same memory barriers implicitly.
>
> I'm not saying they don't.
> I'm talking about easier code to read and maintain.
>
> >
> > > Unless it's clear there's an observable performance penalty, keeping
> > > it simple is better.
> >
> > Exactly. Here is where our opinions differ. For me 'simple' means
> > without useless locking.
> > I mean especially with locks they should only be used when absolutely
> > necessary. In a sense of 'use the right tool for the job'. There are
> > more lightweight tools possible in this context. Locks provide
> > additional guarantees which are not needed here.
> >
> > On the other hand I understand that using a lock is a 'better safe
> > than sorry' approach which should also work. Just keep in mind that
> > spinlock may sleep on RT.
>
> It's not about a better safe than sorry, but easier to read, reason
> and maintain.
>
> >
> > > data_race() may be ok here, at least for one of the unprotected
> > > accesses it's fine, but would have to analyse the other 3 cases.
> >
> > That's exactly my thoughts. Maybe not even the barriers are needed
> > after all. This needs to be checked on a per-case basis.
> >
> > > >
> > > > > And then use btrfs_bock_group_used() everywhere in discard.c where we
> > > > > aren't holding a block group's lock.
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > > > add_to_discard_unused_list(discard_ctl, block_group);
> > > > > > else
> > > > > > add_to_discard_list(discard_ctl, block_group);
> > > > > > --
> > > > > > 2.34.1
> > > > > >
> > > > > >
> > > > >