Re: [PATCH stable] block/mq-deadline: fix different priority request on the same zone

From: Wu Bo
Date: Thu May 16 2024 - 21:31:51 EST

Next message: Stephen Boyd: "[GIT PULL] clk changes for the merge window"
Previous message: Stephen Rothwell: "linux-next: manual merge of the tip tree with the kbuild tree"
In reply to: Bart Van Assche: "Re: [PATCH stable] block/mq-deadline: fix different priority request on the same zone"
Next in thread: Bart Van Assche: "Re: [PATCH stable] block/mq-deadline: fix different priority request on the same zone"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, May 16, 2024 at 07:45:21AM -0600, Bart Van Assche wrote:
> On 5/16/24 03:28, Wu Bo wrote:
> > Zoned devices request sequential writing on the same zone. That means
> > if 2 requests on the saem zone, the lower pos request need to dispatch
> > to device first.
> > While different priority has it's own tree & list, request with high
> > priority will be disptch first.
> > So if requestA & requestB are on the same zone. RequestA is BE and pos
> > is X+0. ReqeustB is RT and pos is X+1. RequestB will be disptched before
> > requestA, which got an ERROR from zoned device.
> >
> > This is found in a practice scenario when using F2FS on zoned device.
> > And it is very easy to reproduce:
> > 1. Use fsstress to run 8 test processes
> > 2. Use ionice to change 4/8 processes to RT priority
>
> Hi Wu,
>
> I agree that there is a problem related to the interaction of I/O
> priority and zoned storage. A solution with a lower runtime overhead
> is available here:
> https://lore.kernel.org/linux-block/20231218211342.2179689-1-bvanassche@xxxxxxx/T/#me97b088c535278fe3d1dc5846b388ed58aa53f46
Hi Bart,

I have tried to set all seq write requests the same priority:

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 6a05dd86e8ca..b560846c63cb 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -841,7 +841,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx,
struct request *rq,
*/
blk_req_zone_write_unlock(rq);

- prio = ioprio_class_to_prio[ioprio_class];
+ if (blk_rq_is_seq_zoned_write(rq))
+ prio = DD_BE_PRIO;
+ else
+ prio = ioprio_class_to_prio[ioprio_class];
per_prio = &dd->per_prio[prio];
if (!rq->elv.priv[0]) {
per_prio->stats.inserted++;

I think this is the same effect as the patch you mentioned here. Unfortunatelly,
this fix causes another issue.
As all write requests are set to the same priority while read requests still
have different priotities. This makes f2fs prone to hung when under stress test:

[129412.105440][T1100129] vkhungtaskd: INFO: task "f2fs_ckpt-254:5":769 blocked for more than 193 seconds.
[129412.106629][T1100129] vkhungtaskd: 6.1.25-android14-11-maybe-dirty #1
[129412.107624][T1100129] vkhungtaskd: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[129412.108873][T1100129] vkhungtaskd: task:f2fs_ckpt-254:5 state:D stack:10496 pid:769 ppid:2 flags:0x00000408
[129412.110194][T1100129] vkhungtaskd: Call trace:
[129412.110769][T1100129] vkhungtaskd: __switch_to+0x174/0x338
[129412.111566][T1100129] vkhungtaskd: __schedule+0x604/0x9e4
[129412.112275][T1100129] vkhungtaskd: schedule+0x7c/0xe8
[129412.112938][T1100129] vkhungtaskd: rwsem_down_write_slowpath+0x4cc/0xf98
[129412.113813][T1100129] vkhungtaskd: down_write+0x38/0x40
[129412.114500][T1100129] vkhungtaskd: __write_checkpoint_sync+0x8c/0x11c
[129412.115409][T1100129] vkhungtaskd: __checkpoint_and_complete_reqs+0x54/0x1dc
[129412.116323][T1100129] vkhungtaskd: issue_checkpoint_thread+0x8c/0xec
[129412.117148][T1100129] vkhungtaskd: kthread+0x110/0x224
[129412.117826][T1100129] vkhungtaskd: ret_from_fork+0x10/0x20
[129412.484027][T1700129] vkhungtaskd: task:f2fs_gc-254:55 state:D stack:10832 pid:771 ppid:2 flags:0x00000408
[129412.485337][T1700129] vkhungtaskd: Call trace:
[129412.485906][T1700129] vkhungtaskd: __switch_to+0x174/0x338
[129412.486618][T1700129] vkhungtaskd: __schedule+0x604/0x9e4
[129412.487327][T1700129] vkhungtaskd: schedule+0x7c/0xe8
[129412.487985][T1700129] vkhungtaskd: io_schedule+0x38/0xc4
[129412.488675][T1700129] vkhungtaskd: folio_wait_bit_common+0x3d8/0x4f8
[129412.489496][T1700129] vkhungtaskd: __folio_lock+0x1c/0x2c
[129412.490196][T1700129] vkhungtaskd: __folio_lock_io+0x24/0x44
[129412.490936][T1700129] vkhungtaskd: __filemap_get_folio+0x190/0x400
[129412.491736][T1700129] vkhungtaskd: pagecache_get_page+0x1c/0x5c
[129412.492501][T1700129] vkhungtaskd: f2fs_wait_on_block_writeback+0x60/0xf8
[129412.493376][T1700129] vkhungtaskd: do_garbage_collect+0x1100/0x223c
[129412.494185][T1700129] vkhungtaskd: f2fs_gc+0x284/0x778
[129412.494858][T1700129] vkhungtaskd: gc_thread_func+0x304/0x838
[129412.495603][T1700129] vkhungtaskd: kthread+0x110/0x224
[129412.496271][T1700129] vkhungtaskd: ret_from_fork+0x10/0x20

I think because f2fs is a CoW filesystem. Some threads holding lock need much
reading & writing at the same time. Different reading & writing priority of this
thread makes this process very long. And other FS operations will be blocked.

So I figured this solution to fix this priority issue on zoned device. It sure
raises the overhead but can do fix it.

Thanks,
Wu Bo
>
> Are you OK with that alternative solution?
>
> Thanks,
>
> Bart.

Next message: Stephen Boyd: "[GIT PULL] clk changes for the merge window"
Previous message: Stephen Rothwell: "linux-next: manual merge of the tip tree with the kbuild tree"
In reply to: Bart Van Assche: "Re: [PATCH stable] block/mq-deadline: fix different priority request on the same zone"
Next in thread: Bart Van Assche: "Re: [PATCH stable] block/mq-deadline: fix different priority request on the same zone"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]