Re: Patch "nvme: re-read ANA log page after ns scan completes" causing regression

From: Hannes Reinecke
Date: Mon Apr 14 2025 - 07:10:12 EST


On 4/14/25 12:53, Aithal, Srikanth wrote:
Hello,

With below patch in todays linux-next next-20250414 and v6.15-rc2 we are seeing host boot issues. The host with nvme disk just hangs on boot.

If we revert this patch or disable CONFIG_NVME_MULTIPATH then host boots fine.

commit 62baf70c327444338c34703c71aa8cc8e4189bd6
Author: Hannes Reinecke <hare@xxxxxxxxxx>
Date:   Thu Apr 3 09:19:30 2025 +0200

    nvme: re-read ANA log page after ns scan completes

    When scanning for new namespaces we might have missed an ANA AEN.

    The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous
    Event Information - Notice': Asymmetric Namespace Access Change) states:

      A controller shall not send this even if an Attached Namespace
      Attribute Changed asynchronous event [...] is sent for the same event.

    so we need to re-read the ANA log page after we rescanned the namespace
    list to update the ANA states of the new namespaces.

    Signed-off-by: Hannes Reinecke <hare@xxxxxxxxxx>
    Reviewed-by: Keith Busch <kbusch@xxxxxxxxxx>
    Signed-off-by: Christoph Hellwig <hch@xxxxxx>


Host console starts dumping a lot of errors and log size is more than 100 MB. So I am not posting all logs here. I am pasting part of the logs here:
...
...
[   49.361223] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x1010
[   49.434564] nvme0n1: I/O Cmd(0x2) @ LBA 0, 8 blocks, I/O Error (sct 0x3 / sc 0x71)
[   49.443123] I/O error, dev nvme0n1, sector 0 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[   49.457080] nvme nvme0: Failed to get ANA log: -4
[   49.506511] nvme nvme0: D3 entry latency set to 8 seconds
[   49.536300] nvme nvme0: 32/0/0 default/read/poll queues
[   49.605281] nvme 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0018 address=0x0 flags=0x0000]
[   80.081190] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x1010
[   80.154109] nvme0n1: I/O Cmd(0x2) @ LBA 128, 8 blocks, I/O Error (sct 0x3 / sc 0x71)
[   80.162864] I/O error, dev nvme0n1, sector 128 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[   80.177032] nvme nvme0: Failed to get ANA log: -4
[   80.225460] nvme nvme0: D3 entry latency set to 8 seconds
[   80.255395] nvme nvme0: 32/0/0 default/read/poll queues
[   80.301278] nvme 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0018 address=0x0 flags=0x0000]
[  110.789207] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x1010
[  110.861990] nvme0n1: I/O Cmd(0x2) @ LBA 2048, 8 blocks, I/O Error (sct 0x3 / sc 0x71)
[  110.870842] I/O error, dev nvme0n1, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  110.885040] nvme nvme0: Failed to get ANA log: -4
[  110.933460] nvme nvme0: D3 entry latency set to 8 seconds
[  110.963447] nvme nvme0: 32/0/0 default/read/poll queues
[  111.009276] nvme 0000:41:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0018 address=0x0 flags=0x0000]
...
...


Can you try this?

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 78963cab1f74..425c00b02f3e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4455,7 +4455,7 @@ static void nvme_scan_work(struct work_struct *work)
if (test_bit(NVME_AER_NOTICE_NS_CHANGED, &ctrl->events))
nvme_queue_scan(ctrl);
#if CONFIG_NVME_MULTIPATH
- else
+ else if (ctrl->ana_log_buf)
/* Re-read the ANA log page to not miss updates */
queue_work(nvme_wq, &ctrl->ana_work);
#endif

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich