Re: [PATCH] ext4: get discard out of jbd2 commit kthread

From: Wang Jianchao
Date: Tue May 18 2021 - 21:29:19 EST




On 2021/5/18 10:57 PM, Theodore Y. Ts'o wrote:
> On Tue, May 18, 2021 at 09:19:13AM +0800, Wang Jianchao wrote:
>>> That way we don't need to move all of this to a kworker context.
>>
>> The submit_bio also needs to be out of jbd2 commit kthread as it may be
>> blocked due to blk-wbt or no enough request tag. ;)
>
> Actually, there's a bigger deal that I hadn't realized, about why we
> is why are currently using submit_bio_wait(). We *must* wait until
> discard has completed before we call ext4_free_data_in_buddy(), which
> is what allows those blocks to be reused by the block allocator.
>
> If the discard happens after we reallocate the block, there is a good
> chance that we will end up corrupting a data or metadata block,
> leading to user data loss.

Yes

>
> There's another corollary to this; if you use blk-wbt, and you are
> doing lots of deletes, and we move this all to a writeback thread,
> this *significantly* increases the chance that the user will see
> ENOSPC errors in the case where they are with a very full (close to
> 100% used) file system.

We would flush the kwork that's doing discard in this patch.
That's done in ext4_should_retry_alloc()

>
> I'd argue that this is a *really* good reason why using mount -o
> discard is Just A Bad Idea if you are running with blk-wbt. If
> discards are slow, using fstrim is a much better choice. It's also
> the case that for most SSD's and workloads, doing frequent discards
> doesn't actually help that much. The write endurance of the device is
> not compromised that much if you only run fs-trim and discard unused
> blocks once a day, or even once a week --- I only recommend use of
> mount -o discard in cases where the discard operation is effectively
> free. (e.g., in cases where the FTL is implemented on the Host OS, or
> you are running with super-fast flash which is PCIe or NVMe attached.)

We're running ext4 with discard on a nbd device whose backend is storage
cluster. The discard can help to free the unused space to storage pool.

And sometimes application delete a lot of data and discard is flooding.
Then we see the jbd2 commit kthread is blocked for a long time. Even
move the discard out of jbd2, we still see the write IO of jbd2 log
could be blocked. blk-wbt could help to relieve this. Finally the delay
is shift to allocation path. But this is better than blocking the page
fault path which holds the read mm->mmap_sem.

Best regards
Jianchao

>
> Cheers,
>
> - Ted
>