[PATCH 5.3 086/112] Btrfs: fix memory leak due to concurrent append writes with fiemap

From: Greg Kroah-Hartman
Date: Wed Oct 16 2019 - 18:01:45 EST

From: Filipe Manana <fdmanana@xxxxxxxx>

commit c67d970f0ea8dcc423e112137d34334fa0abb8ec upstream.

When we have a buffered write that starts at an offset greater than or
equals to the file's size happening concurrently with a full ranged
fiemap, we can end up leaking an extent state structure.

Suppose we have a file with a size of 1Mb, and before the buffered write
and fiemap are performed, it has a single extent state in its io tree
representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.

The following sequence diagram shows how the memory leak happens if a
fiemap a buffered write, starting at offset 1Mb and with a length of
4Kb, are performed concurrently.


--> it's a full ranged fiemap
range from 0 to LLONG_MAX - 1

--> locks range in the inode's
io tree
--> after this we have 2 extent
states in the io tree:
--> 1 for range [0, 1Mb[ with
the bits EXTENT_LOCKED and
--> 1 for the range
[1Mb, LLONG_MAX[ with
the EXTENT_LOCKED bit set

--> start buffered write at offset
1Mb with a length of 4Kb


--> cached_state is NULL

--> returns 0 and does not lock
range because it starts
at current i_size / eof

--> cached_state remains NULL


--> splits extent state for range
[1Mb, LLONG_MAX[ and now we
have 2 extent states:

--> one for the range
[1Mb, 1Mb + 4Kb[ with
--> another one for the range
[1Mb + 4Kb, LLONG_MAX[ with
EXTENT_LOCKED set as well

--> sets EXTENT_DELALLOC on the
extent state for the range
[1Mb, 1Mb + 4Kb[
--> caches extent state
[1Mb, 1Mb + 4Kb[ into
@cached_state because it has
the bit EXTENT_LOCKED set

--> btrfs_buffered_write() ends up
with a non-NULL cached_state and
never calls anything to release its
reference on it, resulting in a
memory leak

Fix this by calling free_extent_state() on cached_state if the range was
not locked by lock_and_cleanup_extent_if_need().

The same issue can happen if anything else other than fiemap locks a range
that covers eof and beyond.

This could be triggered, sporadically, by test case generic/561 from the
fstests suite, which makes duperemove run concurrently with fsstress, and
duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
the leak is reported in dmesg/syslog when removing the btrfs module with
a message like the following:

[77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1

Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.

CC: stable@xxxxxxxxxxxxxxx # 4.16+
Reviewed-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
Signed-off-by: Filipe Manana <fdmanana@xxxxxxxx>
Reviewed-by: David Sterba <dsterba@xxxxxxxx>
Signed-off-by: David Sterba <dsterba@xxxxxxxx>
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

fs/btrfs/file.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1591,7 +1591,6 @@ static noinline ssize_t btrfs_buffered_w
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_root *root = BTRFS_I(inode)->root;
struct page **pages = NULL;
- struct extent_state *cached_state = NULL;
struct extent_changeset *data_reserved = NULL;
u64 release_bytes = 0;
u64 lockstart;
@@ -1611,6 +1610,7 @@ static noinline ssize_t btrfs_buffered_w
return -ENOMEM;

while (iov_iter_count(i) > 0) {
+ struct extent_state *cached_state = NULL;
size_t offset = offset_in_page(pos);
size_t sector_offset;
size_t write_bytes = min(iov_iter_count(i),
@@ -1758,9 +1758,20 @@ again:
if (copied > 0)
ret = btrfs_dirty_pages(inode, pages, dirty_pages,
pos, copied, &cached_state);
+ /*
+ * If we have not locked the extent range, because the range's
+ * start offset is >= i_size, we might still have a non-NULL
+ * cached extent state, acquired while marking the extent range
+ * as delalloc through btrfs_dirty_pages(). Therefore free any
+ * possible cached extent state to avoid a memory leak.
+ */
if (extents_locked)
lockstart, lockend, &cached_state);
+ else
+ free_extent_state(cached_state);
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes,
if (ret) {