Re: 6.13/regression/bisected - new nvme timeout errors

From: Keith Busch
Date: Thu Mar 06 2025 - 10:19:37 EST


On Wed, Jan 15, 2025 at 02:58:04AM +0500, Mikhail Gavrilov wrote:
> Hi,
> During 6.13 development cycle I spotted strange new nvme errors in the
> log which I never seen before.
>
> [87774.010474] nvme nvme1: I/O tag 0 (3000) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:131072

...

> I still haven't found a stable way to reproduce this.
> But I'm pretty sure that if this error don't appearing within two
> days, then we can assume that the kernel isn't affected by the
> problem.
> So I made bisection with above assumption and found this commit:
>
> beadf0088501d9dcf2454b05d90d5d31ea3ba55f is the first bad commit
> commit beadf0088501d9dcf2454b05d90d5d31ea3ba55f
> Author: Christoph Hellwig <hch@xxxxxx>
> Date: Wed Nov 13 16:20:41 2024 +0100
>
> nvme-pci: reverse request order in nvme_queue_rqs

The patch here uses the order recieved to dispatch commands in
consequetive submission queue entries, which is supposed to be the
desired behavior for any device. I did some testing on mailine, and it
sure looks like the order the driver does this is optimal, so I'm not
sure what's going on with your observation.

Do you have a scheduler enabled on your device?

How are you generating IO? Is it a pattern I should be able to replicate
with 'fio'?