Re: [PATCH 3/5] Btrfs: only associate the locked page with one async_cow struct

From: Chris Mason
Date: Thu Jul 11 2019 - 15:52:41 EST


On 11 Jul 2019, at 12:00, Nikolay Borisov wrote:

> On 10.07.19 Ð. 22:28 Ñ., Tejun Heo wrote:
>> From: Chris Mason <clm@xxxxxx>
>>
>> The btrfs writepages function collects a large range of pages flagged
>> for delayed allocation, and then sends them down through the COW code
>> for processing. When compression is on, we allocate one async_cow
>
> nit: The code no longer uses async_cow to represent in-flight chunks
> but
> the more aptly named async_chunk. Presumably this patchset predates
> those changes.

Not by much, but yes.

>
>>
>> The end result is that cowA is completed and cleaned up before cowB
>> even
>> starts processing. This means we can free locked_page() and reuse it
>> elsewhere. If we get really lucky, it'll have the same page->index
>> in
>> its new home as it did before.
>>
>> While we're processing cowB, we might decide we need to fall back to
>> uncompressed IO, and so compress_file_range() will call
>> __set_page_dirty_nobufers() on cowB->locked_page.
>>
>> Without cgroups in use, this creates as a phantom dirty page, which>
>> isn't great but isn't the end of the world. With cgroups in use, we
>
> Having a phantom dirty page is not great but not terrible without
> cgroups but apart from that, does it have any other implications?

Best case, it'll go through the writepage fixup worker and go through
the whole cow machinery again. Worst case we go to this code more than
once:

/*
* if page_started, cow_file_range inserted an
* inline extent and took care of all the
unlocking
* and IO for us. Otherwise, we need to submit
* all those pages down to the drive.
*/
if (!page_started && !ret)
extent_write_locked_range(inode,
async_extent->start,
async_extent->start +
async_extent->ram_size
- 1,
WB_SYNC_ALL);
else if (ret)
unlock_page(async_chunk->locked_page);


That never happened in production as far as I can tell, but it seems
possible.

>
>
>> might crash in the accounting code because page->mapping->i_wb isn't
>> set.
>>
>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference
>> at 00000000000000d0
>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
>> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
>> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
>> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX:
>> 0000000000026115
>> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI:
>> 0000000000000090
>> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09:
>> 0000000000000000
>> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12:
>> ffffffffffffffff
>> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15:
>> 0000000000000001
>> [ 8308.582520] FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000)
>> knlGS:0000000000000000
>> [ 8308.585440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4:
>> 0000000000360ee0
>> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>> 0000000000000400
>> [ 8308.594469] Call Trace:
>> [ 8308.595149] account_page_cleaned+0x15b/0x1f0
>> [ 8308.596340] __cancel_dirty_page+0x146/0x200
>> [ 8308.599395] truncate_cleanup_page+0x92/0xb0
>> [ 8308.600480] truncate_inode_pages_range+0x202/0x7d0
>> [ 8308.617392] btrfs_evict_inode+0x92/0x5a0
>> [ 8308.619108] evict+0xc1/0x190
>> [ 8308.620023] do_unlinkat+0x176/0x280
>> [ 8308.621202] do_syscall_64+0x63/0x1a0
>> [ 8308.623451] entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>
>> The fix here is to make asyc_cow->locked_page NULL everywhere but the
>> one async_cow struct that's allowed to do things to the locked page.
>>
>> Signed-off-by: Chris Mason <clm@xxxxxx>
>> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and
>> reads")
>> Reviewed-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
>> ---
>> fs/btrfs/extent_io.c | 2 +-
>> fs/btrfs/inode.c | 25 +++++++++++++++++++++----
>> 2 files changed, 22 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 5106008f5e28..a31574df06aa 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct
>> address_space *mapping,
>> if (page_ops & PAGE_SET_PRIVATE2)
>> SetPagePrivate2(pages[i]);
>>
>> - if (pages[i] == locked_page) {
>> + if (locked_page && pages[i] == locked_page) {
>
> Why not make the check just if (locked_page) then clean it up, since
> if
> __process_pages_contig is called from the owner of the page then it's
> guaranteed that the page will fall within it's range.

I'm not convinced that every single caller of __process_pages_contig is
making sure to only send locked_page for ranges that correspond to the
locked_page. I'm not sure exactly what you're asking for though, it
looks like it would require some larger changes to the flow of that
loop.

>
>> put_page(pages[i]);
>> pages_locked++;
>> continue;
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 6e6df0eab324..a81e9860ee1f 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -666,10 +666,12 @@ static noinline void compress_file_range(struct
>> async_chunk *async_chunk,
>> * to our extent and set things up for the async work queue to run
>> * cow_file_range to do the normal delalloc dance.
>> */
>> - if (page_offset(async_chunk->locked_page) >= start &&
>> - page_offset(async_chunk->locked_page) <= end)
>> + if (async_chunk->locked_page &&
>> + (page_offset(async_chunk->locked_page) >= start &&
>> + page_offset(async_chunk->locked_page)) <= end) {
>
> DITTO since locked_page is now only set to the chunk that has the
> right
> to it then there is no need to check the offsets and this will
> simplify
> the code.
>

start is adjusted higher up in the loop:

if (start + total_in < end) {
start += total_in;
pages = NULL;
cond_resched();
goto again;
}

So we might get to the __set_page_dirty_nobuffers() test with a range
that no longer corresponds to the locked page.

-chris