Re: [PATCH v4 4/4] blk-flush: reuse rq queuelist in flush state machine

From: Chengming Zhou
Date: Mon May 27 2024 - 19:51:23 EST


On 2024/5/28 07:34, Chengming Zhou wrote:
> On 2024/5/28 00:04, Friedrich Weber wrote:
>> Hi Chengming,
>>
>> Thank you for taking a look at this!
>>
>> On 27/05/2024 07:09, Chengming Zhou wrote:
>>>> I've used this reproducer for a bisect, which produced
>>>>
>>>> 81ada09cc25e (blk-flush: reuse rq queuelist in flush state machine)
>>>>
>>>> as the first commit with which I can reproduce the crashes. I'm not 100%
>>>> sure it is this one because the reproducer is a bit flaky. But it does
>>>> sound plausible, as the commit is included in our 6.8 kernel, and
>>>> touches `queuelist` which is AFAICT where blk_flush_complete_seq
>>>> dereferences the NULL pointer.
>>>
>>> Ok, it will be better that I can reproduce it locally, will try later.
>>
>> Interestingly, so far I haven't been able to reproduce the crash when
>> generating IO on the host itself, I only got crashes when generating IO
>> in a QEMU VM.
>>
>> The reproducer in more detail:
>
> Thanks for these details, I will try to setup and reproduce when I back to work.
>
>>
>> - Compile Linux 6.9 with CONFIG_FAULT_INJECTION,
> [...]
>>>
>>> BUG shows it panic on 0000000000000008, not sure what it's accessing then,
>>> does it means rq->queuelist.next == 0 or something? Could you use add2line
>>> to show the exact source code line that panic? I use blk_flush_complete_seq+0x296/0x2e0
>>> and get block/blk-flush.c:190, which is "fq->flush_data_in_flight++;",
>>> obviously fq can't be NULL. (I'm using the v6.9 kernel)
>>
>> Sorry for the confusion, the crash dump was from a kernel compiled at
>> 81ada09cc25e -- with 6.9, the offset seems to be different. See [2] for
>> a kernel 6.9 crash dump.
>>
>> I don't know too much about kernel debugging, but I tried to get
>> something useful out of addr2line:
>>
>> # addr2line -f -e /usr/lib/debug/vmlinux-6.9.0-debug2
>> blk_flush_complete_seq+0x291/0x2d0
>> __list_del
>> /[...]./include/linux/list.h:195
>>
>> I tried to find the relevant portions in `objdump -SD blk-flush.o`, see
>> [3]. If I'm not mistaken, blk_flush_complete_seq+0x291 should point to
>>
>> 351: 48 89 4f 08 mov %rcx,0x8(%rdi)
>>
>> To me this looks like part of
>>
>> list_move_tail(&rq->queuelist, pending);
>>
>> What do you think?
>
> Yeah, it seems correct, so the rq->queuelist.next == NULL. It can't be NULL
> if went through REQ_FSEQ_POSTFLUSH, so it must be REQ_FSEQ_PREFLUSH. It means
> we allocated a request but its queuelist is not initialized or corrupted?
>
> Anyway, I will use below changes for debugging when reproduce, and you could
> also try this to see if we could get something useful. :)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3b4df8e5ac9e..6e3a6cd7739d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2989,6 +2989,8 @@ void blk_mq_submit_bio(struct bio *bio)
> blk_mq_use_cached_rq(rq, plug, bio);
> }
>
> + BUG_ON(rq->queuelist.next == NULL);
> +
> trace_block_getrq(bio);
>
> rq_qos_track(q, rq, bio);
> @@ -3006,6 +3008,8 @@ void blk_mq_submit_bio(struct bio *bio)
> if (bio_zone_write_plugging(bio))
> blk_zone_write_plug_init_request(rq);
>
> + BUG_ON(rq->queuelist.next == NULL);
> +
> if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
> return;
>

Ah, I forgot to change to your kernel version, then should be:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d98654869615..908fdfb62132 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2963,6 +2963,8 @@ void blk_mq_submit_bio(struct bio *bio)
return;
}

+ BUG_ON(rq->queuelist.next == NULL);
+
trace_block_getrq(bio);

rq_qos_track(q, rq, bio);
@@ -2977,6 +2979,8 @@ void blk_mq_submit_bio(struct bio *bio)
return;
}

+ BUG_ON(rq->queuelist.next == NULL);
+
if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
return;