Re: [PATCH v4 4/4] blk-flush: reuse rq queuelist in flush state machine

From: Chengming Zhou
Date: Wed May 29 2024 - 04:53:38 EST


On 2024/5/28 22:40, Friedrich Weber wrote:
> On 28/05/2024 11:09, Chengming Zhou wrote:
>> On 2024/5/28 16:42, Friedrich Weber wrote:
>>> Hope I did this correctly. With this, the reproducer triggered a BUG
>>> pretty quickly, see [0]. If I can provide anything else, just let me know.
>>
>> Thanks for your patience, it's correct. Then how about this fix?
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index d98654869615..b2ec5c4c738f 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -485,6 +485,7 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
>> if (data->nr_tags > 1) {
>> rq = __blk_mq_alloc_requests_batch(data);
>> if (rq) {
>> + INIT_LIST_HEAD(&rq->queuelist);
>> blk_mq_rq_time_init(rq, alloc_time_ns);
>> return rq;
>> }
>>
>
> Nice, seems like with this patch applied on top of 6.9, the reproducer
> does not trigger crashes anymore for me! Thanks!

Good news. Thanks.

>
> To verify that the reproducer hits the new INIT_LIST_HEAD, I added debug
> prints before/after:
[...]
> And indeed, I see quite some printouts where rq->queuelist.next differs
> before/after the request, e.g.
>
> May 28 16:31:25 reproflushfull kernel: before init: list:
> 000000001e0a144f next: 00000000aaa2e372 prev: 000000001e0a144f
> May 28 16:31:25 reproflushfull kernel: after init: list:
> 000000001e0a144f next: 000000001e0a144f prev: 000000001e0a144f
> May 28 16:31:26 reproflushfull kernel: before init: list:
> 000000001e0a144f next: 00000000aaa2e372 prev: 000000001e0a144f
> May 28 16:31:26 reproflushfull kernel: after init: list:
> 000000001e0a144f next: 000000001e0a144f prev: 000000001e0a144f
>
> I know very little about the block layer, but if I understand the commit
> message of the original 81ada09cc25e correctly, it's expected that
> queuelist needs to be reinitialized at some places. I'm just a little

Yes, because we use list_move_tail() in the flush sequences. Maybe we can
just use list_add_tail() so we don't need the queuelist initialized. It
should be ok since rq can't be on any list when PREFLUSH or POSTFLUSH,
so there isn't any move actually.

But now I'm concerned that rq->queuelist maybe changed by driver after
request end? Don't know if this is a reasonable behavior, or I'm afraid
that the safest way is to revert the last patch. Want to know what Jens,
Ming and Christoph think?

> confused to see the same pointer 00000000aaa2e372 in two subsequent
> "before init" printouts for the same queuelist 000000001e0a144f. Is this
> expected too?
Not sure, but it seems possible. This is a rq_list in the plug cache,
000000001e0a144f is a PREFLUSH request, it will be freed after request end.
Then next time we again allocate it and put it in the plug cache,
just before 00000000aaa2e372 again. The reason why block doesn't use
00000000aaa2e372 maybe it's from a different queue or hardware queue.
But these are just my guess.

>
> Also, just out of interest: Can you estimate whether this issue is
> specific to software RAID setups, or could similar NULL pointer
> dereferences also happen in setups without software RAID?

I think it can also happen without software RAID.

Thanks.