Re: 6.13/regression/bisected - new nvme timeout errors
From: Mikhail Gavrilov
Date: Wed Mar 05 2025 - 17:45:34 EST
On Mon, Feb 10, 2025 at 1:47 PM Mikhail Gavrilov
<mikhail.v.gavrilov@xxxxxxxxx> wrote:
>
> On Wed, Jan 15, 2025 at 2:58 AM Mikhail Gavrilov
> <mikhail.v.gavrilov@xxxxxxxxx> wrote:
> >
> > Hi,
> > During 6.13 development cycle I spotted strange new nvme errors in the
> > log which I never seen before.
> >
> > [87774.010474] nvme nvme1: I/O tag 0 (3000) opcode 0x1 (I/O Cmd) QID 1
> > timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010502] nvme nvme1: I/O tag 1 (d001) opcode 0x1 (I/O Cmd) QID 1
> > timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010507] nvme nvme1: I/O tag 2 (4002) opcode 0x1 (I/O Cmd) QID 1
> > timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010513] nvme nvme1: I/O tag 3 (e003) opcode 0x1 (I/O Cmd) QID 1
> > timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010525] nvme nvme1: Abort status: 0x0
> > [87774.010535] nvme nvme1: I/O tag 21 (a015) opcode 0x1 (I/O Cmd) QID
> > 1 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010535] nvme nvme1: Abort status: 0x0
> > [87774.010538] nvme nvme1: I/O tag 22 (3016) opcode 0x1 (I/O Cmd) QID
> > 1 timeout, aborting req_op:WRITE(1) size:32768
> > [87774.010539] nvme nvme1: Abort status: 0x0
> > [87774.010542] nvme nvme1: Abort status: 0x0
> > [87774.010545] nvme nvme1: Abort status: 0x0
> > [87774.010548] nvme nvme1: I/O tag 384 (7180) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010553] nvme nvme1: I/O tag 385 (b181) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010554] nvme nvme1: Abort status: 0x0
> > [87774.010557] nvme nvme1: I/O tag 386 (8182) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010561] nvme nvme1: I/O tag 387 (5183) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010562] nvme nvme1: Abort status: 0x0
> > [87774.010564] nvme nvme1: I/O tag 388 (6184) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010570] nvme nvme1: Abort status: 0x0
> > [87774.010573] nvme nvme1: I/O tag 393 (5189) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010578] nvme nvme1: Abort status: 0x0
> > [87774.010581] nvme nvme1: I/O tag 398 (718e) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010586] nvme nvme1: Abort status: 0x0
> > [87774.010589] nvme nvme1: I/O tag 403 (a193) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:61440
> > [87774.010593] nvme nvme1: Abort status: 0x0
> > [87774.010595] nvme nvme1: I/O tag 406 (0196) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010601] nvme nvme1: Abort status: 0x0
> > [87774.010604] nvme nvme1: I/O tag 411 (419b) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:40960
> > [87774.010608] nvme nvme1: Abort status: 0x0
> > [87774.010612] nvme nvme1: I/O tag 416 (91a0) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010616] nvme nvme1: Abort status: 0x0
> > [87774.010619] nvme nvme1: I/O tag 420 (f1a4) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010624] nvme nvme1: Abort status: 0x0
> > [87774.010627] nvme nvme1: I/O tag 425 (51a9) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010632] nvme nvme1: Abort status: 0x0
> > [87774.010634] nvme nvme1: I/O tag 429 (11ad) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:98304
> > [87774.010639] nvme nvme1: Abort status: 0x0
> > [87774.010641] nvme nvme1: I/O tag 433 (11b1) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010647] nvme nvme1: Abort status: 0x0
> > [87774.010650] nvme nvme1: I/O tag 438 (c1b6) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010654] nvme nvme1: Abort status: 0x0
> > [87774.010657] nvme nvme1: I/O tag 442 (91ba) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010662] nvme nvme1: Abort status: 0x0
> > [87774.010664] nvme nvme1: I/O tag 446 (d1be) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:32768
> > [87774.010670] nvme nvme1: Abort status: 0x0
> > [87774.010674] nvme nvme1: I/O tag 452 (f1c4) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010678] nvme nvme1: Abort status: 0x0
> > [87774.010681] nvme nvme1: I/O tag 456 (a1c8) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010686] nvme nvme1: Abort status: 0x0
> > [87774.010689] nvme nvme1: I/O tag 461 (31cd) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:90112
> > [87774.010694] nvme nvme1: Abort status: 0x0
> > [87774.010697] nvme nvme1: I/O tag 466 (51d2) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:8192
> > [87774.010702] nvme nvme1: Abort status: 0x0
> > [87774.010704] nvme nvme1: I/O tag 470 (e1d6) opcode 0x1 (I/O Cmd) QID
> > 17 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.010709] nvme nvme1: Abort status: 0x0
> > [87774.010717] nvme nvme1: Abort status: 0x0
> > [87774.010725] nvme nvme1: Abort status: 0x0
> > [87774.010732] nvme nvme1: Abort status: 0x0
> > [87774.034477] nvme nvme1: I/O tag 1 (a001) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.034487] nvme nvme1: I/O tag 2 (0002) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.034492] nvme nvme1: I/O tag 3 (d003) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.034496] nvme nvme1: I/O tag 4 (b004) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:90112
> > [87774.034512] nvme nvme1: Abort status: 0x0
> > [87774.034524] nvme nvme1: Abort status: 0x0
> > [87774.034533] nvme nvme1: Abort status: 0x0
> > [87774.034537] nvme nvme1: Abort status: 0x0
> > [87774.051474] nvme nvme1: I/O tag 0 (1000) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:81920
> > [87774.051484] nvme nvme1: I/O tag 5 (e005) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051489] nvme nvme1: I/O tag 6 (a006) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051493] nvme nvme1: I/O tag 7 (7007) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051496] nvme nvme1: Abort status: 0x0
> > [87774.051514] nvme nvme1: I/O tag 25 (a019) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051518] nvme nvme1: Abort status: 0x0
> > [87774.051523] nvme nvme1: Abort status: 0x0
> > [87774.051523] nvme nvme1: I/O tag 31 (601f) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051527] nvme nvme1: Abort status: 0x0
> > [87774.051528] nvme nvme1: I/O tag 32 (c020) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051532] nvme nvme1: I/O tag 33 (4021) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051535] nvme nvme1: Abort status: 0x0
> > [87774.051539] nvme nvme1: I/O tag 37 (4025) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051540] nvme nvme1: Abort status: 0x0
> > [87774.051544] nvme nvme1: I/O tag 38 (0026) opcode 0x1 (I/O Cmd) QID
> > 29 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051550] nvme nvme1: Abort status: 0x0
> > [87774.051559] nvme nvme1: Abort status: 0x0
> > [87774.051559] nvme nvme1: I/O tag 512 (f200) opcode 0x1 (I/O Cmd) QID
> > 32 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051562] nvme nvme1: Abort status: 0x0
> > [87774.051564] nvme nvme1: I/O tag 513 (a201) opcode 0x1 (I/O Cmd) QID
> > 32 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051568] nvme nvme1: I/O tag 514 (4202) opcode 0x1 (I/O Cmd) QID
> > 32 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051570] nvme nvme1: Abort status: 0x0
> > [87774.051573] nvme nvme1: I/O tag 516 (d204) opcode 0x1 (I/O Cmd) QID
> > 32 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051576] nvme nvme1: Abort status: 0x0
> > [87774.051579] nvme nvme1: I/O tag 520 (b208) opcode 0x1 (I/O Cmd) QID
> > 32 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051583] nvme nvme1: Abort status: 0x0
> > [87774.051586] nvme nvme1: I/O tag 524 (f20c) opcode 0x1 (I/O Cmd) QID
> > 32 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051592] nvme nvme1: Abort status: 0x0
> > [87774.051594] nvme nvme1: I/O tag 529 (f211) opcode 0x1 (I/O Cmd) QID
> > 32 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051599] nvme nvme1: Abort status: 0x0
> > [87774.051601] nvme nvme1: I/O tag 533 (6215) opcode 0x1 (I/O Cmd) QID
> > 32 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051607] nvme nvme1: Abort status: 0x0
> > [87774.051609] nvme nvme1: I/O tag 538 (121a) opcode 0x1 (I/O Cmd) QID
> > 32 timeout, aborting req_op:WRITE(1) size:131072
> > [87774.051615] nvme nvme1: Abort status: 0x0
> > [87774.051628] nvme nvme1: Abort status: 0x0
> > [87774.051632] nvme nvme1: Abort status: 0x0
> > [87774.051642] nvme nvme1: Abort status: 0x0
> > [87775.490533] nvme nvme1: I/O tag 542 (021e) opcode 0x2 (I/O Cmd) QID
> > 32 timeout, aborting req_op:READ(0) size:16384
> > [87775.490568] nvme nvme1: Abort status: 0x0
> > [87778.435607] nvme nvme1: I/O tag 543 (e21f) opcode 0x9 (I/O Cmd) QID
> > 32 timeout, aborting req_op:DISCARD(3) size:45056
> > [87778.435643] nvme nvme1: Abort status: 0x0
> > [87801.658089] nvme nvme1: I/O tag 23 (7017) opcode 0x2 (I/O Cmd) QID
> > 1 timeout, aborting req_op:READ(0) size:16384
> > [87801.658159] nvme nvme1: Abort status: 0x0
> > [87804.019139] nvme nvme1: I/O tag 0 (3000) opcode 0x1 (I/O Cmd) QID 1
> > timeout, reset controller
> > [87804.064993] nvme nvme1: 32/0/0 default/read/poll queues
> >
> > I still haven't found a stable way to reproduce this.
> > But I'm pretty sure that if this error don't appearing within two
> > days, then we can assume that the kernel isn't affected by the
> > problem.
> > So I made bisection with above assumption and found this commit:
> >
> > beadf0088501d9dcf2454b05d90d5d31ea3ba55f is the first bad commit
> > commit beadf0088501d9dcf2454b05d90d5d31ea3ba55f
> > Author: Christoph Hellwig <hch@xxxxxx>
> > Date: Wed Nov 13 16:20:41 2024 +0100
> >
> > nvme-pci: reverse request order in nvme_queue_rqs
> >
> > blk_mq_flush_plug_list submits requests in the reverse order that they
> > were submitted, which leads to a rather suboptimal I/O pattern especially
> > in rotational devices. Fix this by rewriting nvme_queue_rqs so that it
> > always pops the requests from the passed in request list, and then adds
> > them to the head of a local submit list. This actually simplifies the
> > code a bit as it removes the complicated list splicing, at the cost of
> > extra updates of the rq_next pointer. As that should be cache hot
> > anyway it should be an easy price to pay.
> >
> > Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()")
> > Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> > Link: https://lore.kernel.org/r/20241113152050.157179-2-hch@xxxxxx
> > Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
> >
> > drivers/nvme/host/pci.c | 39 +++++++++++++++++----------------------
> > 1 file changed, 17 insertions(+), 22 deletions(-)
> >
> > Christoph can you look into it, please?
> >
> > My machine specs: https://linux-hardware.org/?probe=a4003b140c
> > Hiding under nvme1 is INTEL SSDPF2NV307TZ (ACV10340)
> >
> > # nvme smart-log /dev/nvme1
> > Smart Log for NVME device:nvme1 namespace-id:ffffffff
> > critical_warning : 0
> > temperature : 98 °F (310 K)
> > available_spare : 100%
> > available_spare_threshold : 10%
> > percentage_used : 1%
> > endurance group critical warning summary: 0
> > Data Units Read : 694928083 (355.80 TB)
> > Data Units Written : 861761625 (441.22 TB)
> > host_read_commands : 6389380673
> > host_write_commands : 11970151442
> > controller_busy_time : 147
> > power_cycles : 474
> > power_on_hours : 17268
> > unsafe_shutdowns : 307
> > media_errors : 0
> > num_err_log_entries : 111
> > Warning Temperature Time : 566
> > Critical Composite Temperature Time : 0
> > Thermal Management T1 Trans Count : 0
> > Thermal Management T2 Trans Count : 0
> > Thermal Management T1 Total Time : 0
> > Thermal Management T2 Total Time : 0
> >
> > I attached below build config and full kernel log.
> >
> > --
> > Best Regards,
> > Mike Gavrilov.
>
> Hi again.
> So no one looked at this?
> Unfornitally this issue is still happening in 6.14-rc1 :(
> I attached a fresh kernel log here.
>
> --
> Best Regards,
> Mike Gavrilov.
Hi,
Has anyone tried to dig into it yet?
I have again tested two commits (beadf0088501) and (e559ee022658).
And ensure that issue with nvme:
I/O tag 448 (d1c0) opcode 0x1 (I/O Cmd) QID 22 timeout, aborting
req_op:WRITE(1) size:131072
definitely happens after a short period of time (~3990 seconds) if
kernel build at commit (beadf0088501) .
The problem is usually triggered by updating containers (podman pull ...).
Apparently this makes a lot of stress on my nvme SSD.
And if the kernel is built at commit (e559ee022658) it does not happen
even after 4 days of uptime (I checked it several times).
Bad news - on the kernel 6.14 rc5 (7eb172143d55) it still happens and
quite often.
Below I have attached three kernel logs.
One for the commit (beadf0088501) the second for the commit
(e559ee022658) and the third for 6.14 rc5 (7eb172143d55)
Please help fix it.
I'm ready to test any patches.
--
Best Regards,
Mike Gavrilov.
Attachment:
dmesg-6.12.0-rc4-10-beadf0088501d9dcf2454b05d90d5d31ea3ba55f-3.zip
Description: Zip archive
Attachment:
dmesg-6.12.0-rc4-11-e559ee022658c70bdc07c4846bf279f5a5abc494-4.zip
Description: Zip archive
Attachment:
dmesg-6.14.0-rc5-7eb172143d55.zip
Description: Zip archive