[PATCH AUTOSEL 5.4 100/130] btrfs: fix race that results in logging old extents during a fast fsync

From: Sasha Levin
Date: Tue Dec 22 2020 - 21:58:32 EST


From: Filipe Manana <fdmanana@xxxxxxxx>

[ Upstream commit 5f96bfb7633c55b578c6b32f32624061f25010db ]

When logging the extents of an inode during a fast fsync, we have a time
window where we can log extents that are from the previous transaction and
already persisted. This only makes us waste time unnecessarily.

The following sequence of steps shows how this can happen:

1) We are at transaction 1000;

2) An ordered extent E from inode I completes, that is it has gone through
btrfs_finish_ordered_io(), and it set the extent maps' generation to
1000 when we unpin the extent, which is the generation of the current
transaction;

3) The commit for transaction 1000 starts by task A;

4) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;

5) Some change is made to inode I, resulting in creation of a new
transaction with a generation of 1001;

6) The transaction 1000 commit starts unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;

7) Task B starts an fsync on inode I, and when it gets to
btrfs_log_changed_extents() sees the extent map for extent E in the
list of modified extents. It sees the extent map has a generation of
1000 and fs_info->last_trans_committed has a value of 999, so it
proceeds to logging the respective file extent item and all the
checksums covering its range.

So we end up wasting time since the extent was already persisted and
is reachable through the trees pointed to by the super block committed
by transaction 1000.

So just fix this by comparing the extent maps generation against the
generation of the transaction handle - if it is smaller then the id in the
handle, we know the extent was already persisted and we do not need to log
it.

This patch belongs to a patch set that is comprised of the following
patches:

btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
Signed-off-by: David Sterba <dsterba@xxxxxxxx>
Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
---
fs/btrfs/tree-log.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index de53e51669976..12182db88222b 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4372,14 +4372,12 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
struct extent_map *em, *n;
struct list_head extents;
struct extent_map_tree *tree = &inode->extent_tree;
- u64 test_gen;
int ret = 0;
int num = 0;

INIT_LIST_HEAD(&extents);

write_lock(&tree->lock);
- test_gen = root->fs_info->last_trans_committed;

list_for_each_entry_safe(em, n, &tree->modified_extents, list) {
/*
@@ -4412,7 +4410,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
goto process;
}

- if (em->generation <= test_gen)
+ if (em->generation < trans->transid)
continue;

/* We log prealloc extents beyond eof later. */
--
2.27.0