Re: [PATCH v3] btrfs: fix data race when accessing the inode's disk_i_size at btrfs_drop_extents()

From: Filipe Manana
Date: Tue Dec 03 2024 - 07:29:47 EST


On Tue, Dec 3, 2024 at 7:59 AM Hao-ran Zheng <zhenghaoran154@xxxxxxxxx> wrote:
>
> A data race occurs when the function `insert_ordered_extent_file_extent()`
> and the function `btrfs_inode_safe_disk_i_size_write()` are executed
> concurrently. The function `insert_ordered_extent_file_extent()` is not
> locked when reading inode->disk_i_size, causing
> `btrfs_inode_safe_disk_i_size_write()` to cause data competition when
> writing inode->disk_i_size, thus affecting the value of `modify_tree`.
>
> The specific call stack that appears during testing is as follows:
> ============DATA_RACE============
> btrfs_drop_extents+0x89a/0xa060 [btrfs]
> insert_reserved_file_extent+0xb54/0x2960 [btrfs]
> insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs]
> btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs]
> btrfs_finish_ordered_io+0x37/0x60 [btrfs]
> finish_ordered_fn+0x3e/0x50 [btrfs]
> btrfs_work_helper+0x9c9/0x27a0 [btrfs]
> process_scheduled_works+0x716/0xf10
> worker_thread+0xb6a/0x1190
> kthread+0x292/0x330
> ret_from_fork+0x4d/0x80
> ret_from_fork_asm+0x1a/0x30
> ============OTHER_INFO============
> btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs]
> btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs]
> btrfs_finish_ordered_io+0x37/0x60 [btrfs]
> finish_ordered_fn+0x3e/0x50 [btrfs]
> btrfs_work_helper+0x9c9/0x27a0 [btrfs]
> process_scheduled_works+0x716/0xf10
> worker_thread+0xb6a/0x1190
> kthread+0x292/0x330
> ret_from_fork+0x4d/0x80
> ret_from_fork_asm+0x1a/0x30
> =================================
>
> The main purpose of the modify_tree variable is to optimize the
> acquisition of read-write locks at or after the end of file (EOF).
> When the value of modify_tree is modified due to concurrency, resulting
> in unnecessary acquisition of write locks, the correctness of the
> system will not be affected. When the system requires a write lock but
> does not acquire the write lock because the value of modify_tree is
> incorrect, the path will be subsequently released and the lock will
> be reacquired to ensure the correctness of the system.

Looks good, I'll slightly rephrase this on things like "modify_tree is
modified due to concurrency" because what gets modified concurrently
is inode->disk_i_size.
I'll push it to the github btrfs for-next branch soon with this
paragraph instead:

"The main purpose of the check of the inode's disk_i_size is to avoid
taking write locks on a btree path when we have a write at or beyond
eof, since in these cases we don't expect to find extent items in the
root to drop. However if we end up taking write locks due to a data
race on disk_i_size, everything is still correct, we only add extra
lock contention on the tree in case there's concurrency from other tasks.
If the race causes us to not take write locks when we actually need them,
then everything is functionally correct as well, since if we find out we
have extent items to drop and we took read locks (modify_tree set to 0),
we release the path and retry again with write locks."

Thanks.

Reviewed-by: Filipe Manana <fdmanana@xxxxxxxx>

>
> Since this data race does not affect the correctness of the function,
> it is a harmless data race, and it is recommended to use `data_race`
> to mark it.
>
> Signed-off-by: Hao-ran Zheng <zhenghaoran154@xxxxxxxxx>
> ---
> V2->V3: Added details on why this is a harmless data race
> V1->V2: Modified patch description and format
> ---
> fs/btrfs/file.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 4fb521d91b06..559c177456e6 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -242,7 +242,7 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans,
> if (args->drop_cache)
> btrfs_drop_extent_map_range(inode, args->start, args->end - 1, false);
>
> - if (args->start >= inode->disk_i_size && !args->replace_extent)
> + if (data_race(args->start >= inode->disk_i_size) && !args->replace_extent)
> modify_tree = 0;
>
> update_refs = (btrfs_root_id(root) != BTRFS_TREE_LOG_OBJECTID);
> --
> 2.34.1
>
>