Re: Fwd: Marvell 88SE6320 SAS controller (mvsas) cannot survive ACPI S3 or ACPI S4

From: Damien Le Moal
Date: Thu Oct 26 2023 - 04:56:21 EST


On 2023/10/26 17:25, Bagas Sanjaya wrote:
> Hi,
>
> I notice a bug report on Bugzilla [1]. Quoting from it:

[...]

>> [ 437.249448] PM: suspend entry (deep)
>> [ 437.255308] Filesystems sync: 0.005 seconds
>> [ 437.255570] Freezing user space processes
>> [ 437.257093] Freezing user space processes completed (elapsed 0.001 seconds)
>> [ 437.257097] OOM killer disabled.
>> [ 437.257098] Freezing remaining freezable tasks
>> [ 437.258226] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
>> [ 437.258281] printk: Suspending console(s) (use no_console_suspend to debug)
>> [ 437.291778] sd 0:0:0:0: [sdb] Synchronizing SCSI cache
>> [ 437.291825] sd 0:0:1:0: [sdc] Synchronizing SCSI cache
>> [ 437.292083] sd 0:0:0:0: [sdb] Stopping disk
>> [ 437.292083] sd 0:0:1:0: [sdc] Stopping disk
>> [ 438.363660] sd 1:0:0:0: [sda] Synchronizing SCSI cache
>> [ 438.363760] sd 1:0:0:0: [sda] Stopping disk

Given this message, this does not look like the latest kernel.

>> [ 589.081341] drivers/scsi/mvsas/mv_sas.c 1304:mvs_I_T_nexus_reset for device[1]:rc= 0
>> [ 610.481270] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [ 610.481280] rcu: 11-...0: (0 ticks this GP) idle=4f84/1/0x4000000000000000 softirq=19873/19873 fqs=1159
>> [ 610.481292] (detected by 5, t=5252 jiffies, g=53581, q=31630 ncpus=12)
>> [ 610.481299] Sending NMI from CPU 5 to CPUs 11:
>> [ 610.481309] NMI backtrace for cpu 11
>> [ 610.481312] CPU: 11 PID: 3152 Comm: kworker/u32:59 Tainted: G I 6.1.57-vanilla #14
>> [ 610.481318] Hardware name: System manufacturer System Product Name/P6T WS PRO, BIOS 1205 09/24/2010
>> [ 610.481321] Workqueue: events_unbound async_run_entry_fn
>> [ 610.481329] RIP: 0010:mvs_int_rx+0x81/0x150 [mvsas]
>> [ 610.481346] Code: 00 00 44 39 75 70 74 47 48 8b 45 60 45 89 e6 41 81 e6 ff 03 00 00 41 8d 56 01 8b 1c 90 49 89 d4 41 89 df 41 81 e7 00 00 08 00 <f7> c3 00 00 01 00 74 58 31 d2 89 de 48 89 ef e8 0b f9 ff ff 45 85
>> [ 610.481350] RSP: 0018:ffffb61f06acbb60 EFLAGS: 00000046
>> [ 610.481354] RAX: ffff9a7cc2658000 RBX: 0000000000010000 RCX: 0000000000000000
>> [ 610.481358] RDX: 000000000000026e RSI: 0000000000010000 RDI: ffff9a7ce2660000
>> [ 610.481361] RBP: ffff9a7ce2660000 R08: ffff9a7ce2660f00 R09: ffff9a7ce2660000
>> [ 610.481364] R10: ffff9a7ce26600c8 R11: ffffffff884d4300 R12: 000000000000026e
>> [ 610.481367] R13: 0000000000000000 R14: 000000000000026d R15: 0000000000000000
>> [ 610.481371] FS: 0000000000000000(0000) GS:ffff9a7df7cc0000(0000) knlGS:0000000000000000
>> [ 610.481375] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 610.481378] CR2: 0000563633425300 CR3: 0000000077210006 CR4: 00000000000206e0
>> [ 610.481382] Call Trace:
>> [ 610.481385] <NMI>
>> [ 610.481389] ? nmi_cpu_backtrace.cold+0x1b/0x76
>> [ 610.481398] ? nmi_cpu_backtrace_handler+0xd/0x20
>> [ 610.481403] ? nmi_handle+0x5d/0x120
>> [ 610.481410] ? mvs_int_rx+0x81/0x150 [mvsas]
>> [ 610.481423] ? default_do_nmi+0x69/0x170
>> [ 610.481428] ? exc_nmi+0x13c/0x170
>> [ 610.481432] ? end_repeat_nmi+0x16/0x67
>> [ 610.481443] ? mvs_int_rx+0x81/0x150 [mvsas]
>> [ 610.481457] ? mvs_int_rx+0x81/0x150 [mvsas]
>> [ 610.481470] ? mvs_int_rx+0x81/0x150 [mvsas]
>> [ 610.481483] </NMI>
>> [ 610.481484] <TASK>
>> [ 610.481487] mvs_do_release_task+0x3f/0x90 [mvsas]
>> [ 610.481501] mvs_release_task+0x13e/0x1a0 [mvsas]
>> [ 610.481516] mvs_I_T_nexus_reset+0xb2/0xd0 [mvsas]
>> [ 610.481530] ? sas_ata_wait_after_reset+0x80/0x80 [libsas]
>> [ 610.481552] sas_ata_hard_reset+0x48/0x80 [libsas]
>> [ 610.481575] ata_eh_reset+0x2e5/0x1090 [libata]
>> [ 610.481631] ? sas_ata_wait_after_reset+0x80/0x80 [libsas]
>> [ 610.481652] ? sas_ata_wait_after_reset+0x80/0x80 [libsas]
>> [ 610.481676] ata_eh_recover+0x2e6/0xe00 [libata]
>> [ 610.481728] ? __wake_up_klogd.part.0+0x56/0x80
>> [ 610.481735] ? vprintk_emit+0x207/0x290
>> [ 610.481739] ? smp_ata_check_ready_type+0xb0/0xb0 [libsas]
>> [ 610.481760] ? sas_ata_wait_after_reset+0x80/0x80 [libsas]
>> [ 610.481783] ? smp_ata_check_ready_type+0xb0/0xb0 [libsas]
>> [ 610.481804] ? sas_ata_wait_after_reset+0x80/0x80 [libsas]
>> [ 610.481824] ata_do_eh+0x75/0xf0 [libata]
>> [ 610.481876] ? del_timer_sync+0x6f/0xb0
>> [ 610.481884] ata_scsi_port_error_handler+0x3a8/0x800 [libata]
>> [ 610.481938] async_sas_ata_eh+0x44/0x7f [libsas]
>> [ 610.481960] async_run_entry_fn+0x30/0x130
>> [ 610.481966] process_one_work+0x1c7/0x380
>> [ 610.481974] worker_thread+0x4d/0x380
>> [ 610.481981] ? rescuer_thread+0x3a0/0x3a0
>> [ 610.481987] kthread+0xe9/0x110
>> [ 610.481992] ? kthread_complete_and_exit+0x20/0x20
>> [ 610.481999] ret_from_fork+0x22/0x30
>> [ 610.482009] </TASK>
>> [ 665.286198] NMI watchdog: Watchdog detected hard LOCKUP on cpu 11
Could be due to the libata deadlock without the recent suspend/resume fixes. Or
this is yet another adapter that was not tested for suspend/resume. mpt3sas
crashes the machine 100% of the time as well. I had no time to dig into that issue.

>> Hibernate/resume with a sas disk and then running gnome disks for a couple of reads:
>>
>> [ 0.000000] Linux version 6.1.57-vanilla (turbo@dbn-003) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #14 SMP PREEMPT_DYNAMIC Sat Oct 14 11:48:05 EEST 2023
>> [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.1.57-vanilla root=/dev/mapper/test-root ro quiet
>>
>> 4.191667] scsi host0: mvsas
>> [ 4.202932] sas: phy-0:1 added to port-0:0, phy_mask:0x2 (2bad2bad2bad2bad)
>> [ 4.202935] drivers/scsi/mvsas/mv_sas.c 1099:set wide port phy map 2
>> [ 4.232802] sas: DOING DISCOVERY on port 0, pid:78
>> [ 4.234233] scsi 0:0:0:0: Direct-Access WD WD3001HKHG-02VUC SR07 PQ: 0 ANSI: 6
>> [ 4.234723] scsi 0:0:0:0: Power-on or device reset occurred
>> [ 4.243180] sas: DONE DISCOVERY on port 0, pid:78, result:0
>> [ 4.243958] sd 0:0:0:0: [sdb] 586072368 512-byte logical blocks: (300 GB/279 GiB)
>> [ 4.251366] sd 0:0:0:0: [sdb] Write Protect is off
>> [ 4.251372] sd 0:0:0:0: [sdb] Mode Sense: df 00 10 08
>> [ 4.256088] sd 0:0:0:0: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
>>
>> [ 331.841349] PM: hibernation: hibernation entry
>> [ 331.846068] Filesystems sync: 0.004 seconds
>> [ 331.846251] Freezing user space processes
>> [ 331.847662] Freezing user space processes completed (elapsed 0.001 seconds)
>> [ 331.847665] OOM killer disabled.
>> [ 331.847752] PM: hibernation: Marking nosave pages: [mem 0x00000000-0x00000fff]
>> [ 331.847754] PM: hibernation: Marking nosave pages: [mem 0x0009c000-0x000fffff]
>> [ 331.847757] PM: hibernation: Marking nosave pages: [mem 0xbf780000-0xffffffff]
>> [ 331.848818] PM: hibernation: Basic memory bitmaps created
>> [ 331.848868] PM: hibernation: Preallocating image memory
>> [ 332.320952] PM: hibernation: Allocated 468298 pages for snapshot
>> [ 332.320958] PM: hibernation: Allocated 1873192 kbytes in 0.47 seconds (3985.51 MB/s)
>> [ 332.320961] Freezing remaining freezable tasks
>> [ 332.322205] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
>> [ 332.322378] printk: Suspending console(s) (use no_console_suspend to debug)
>> [ 334.826522] ACPI: PM: Preparing to enter system sleep state S4
>> [ 334.826773] ACPI: PM: Saving platform NVS memory
>> [ 334.826979] Disabling non-boot CPUs ...
>> [ 334.829452] smpboot: CPU 1 is now offline
>> [ 334.831537] smpboot: CPU 2 is now offline
>> [ 334.833750] smpboot: CPU 3 is now offline
>> [ 334.835897] smpboot: CPU 4 is now offline
>> [ 334.838296] smpboot: CPU 5 is now offline
>> [ 334.840868] smpboot: CPU 6 is now offline
>> [ 334.842966] smpboot: CPU 7 is now offline
>> [ 334.845060] smpboot: CPU 8 is now offline
>> [ 334.847030] smpboot: CPU 9 is now offline
>> [ 334.848955] smpboot: CPU 10 is now offline
>> [ 334.850837] smpboot: CPU 11 is now offline
>> [ 334.851340] PM: hibernation: Creating image:
>> [ 335.133820] PM: hibernation: Need to copy 480821 pages
>> [ 335.133826] PM: hibernation: Normal pages needed: 480821 + 1024, available pages: 1613904
>> [ 334.852388] ACPI: PM: Restoring platform NVS memory
>> [ 334.852777] Enabling non-boot CPUs ...
>> [ 334.852817] x86: Booting SMP configuration:
>> [ 334.852818] smpboot: Booting Node 0 Processor 1 APIC 0x2
>> [ 334.856338] CPU1 is up
>> [ 334.856369] smpboot: Booting Node 0 Processor 2 APIC 0x4
>> [ 334.859345] CPU2 is up
>> [ 334.859375] smpboot: Booting Node 0 Processor 3 APIC 0x10
>> [ 334.862383] CPU3 is up
>> [ 334.862412] smpboot: Booting Node 0 Processor 4 APIC 0x12
>> [ 334.865376] CPU4 is up
>> [ 334.865407] smpboot: Booting Node 0 Processor 5 APIC 0x14
>> [ 334.868395] CPU5 is up
>> [ 334.868423] smpboot: Booting Node 0 Processor 6 APIC 0x1
>> [ 334.871618] CPU6 is up
>> [ 334.871649] smpboot: Booting Node 0 Processor 7 APIC 0x3
>> [ 334.874844] CPU7 is up
>> [ 334.874878] smpboot: Booting Node 0 Processor 8 APIC 0x5
>> [ 334.878160] CPU8 is up
>> [ 334.878189] smpboot: Booting Node 0 Processor 9 APIC 0x11
>> [ 334.881557] CPU9 is up
>> [ 334.881586] smpboot: Booting Node 0 Processor 10 APIC 0x13
>> [ 334.884946] CPU10 is up
>> [ 334.884974] smpboot: Booting Node 0 Processor 11 APIC 0x15
>> [ 334.888415] CPU11 is up
>> [ 334.901008] ACPI: PM: Waking up from system sleep state S4
>> [ 335.001313] usb usb4: root hub lost power or was reset
>> [ 335.001314] usb usb2: root hub lost power or was reset
>> [ 335.001314] usb usb5: root hub lost power or was reset
>> [ 335.001365] usb usb1: root hub lost power or was reset
>> [ 335.001372] usb usb6: root hub lost power or was reset
>> [ 335.001375] usb usb7: root hub lost power or was reset
>> [ 335.001428] usb usb8: root hub lost power or was reset
>> [ 335.001449] usb usb3: root hub lost power or was reset
>> [ 335.297280] usb 3-4: reset high-speed USB device number 2 using ehci-pci
>> [ 335.537215] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
>> [ 335.549624] ata1.00: configured for UDMA/133
>> [ 335.549810] ata1.00: Enabling discard_zeroes_data
>> [ 335.769243] usb 3-4.3: reset full-speed USB device number 3 using ehci-pci
>> [ 335.957245] usb 3-4.4: reset full-speed USB device number 4 using ehci-pci
>> [ 336.385247] usb 3-4.4.1: reset full-speed USB device number 5 using ehci-pci
>> [ 336.539864] PM: hibernation: Basic memory bitmaps freed
>> [ 336.540367] OOM killer enabled.
>> [ 336.540368] Restarting tasks ...
>> [ 336.541593] done.
>> [ 336.541600] PM: hibernation: hibernation exit
>> [ 336.636953] RTL8211B Gigabit Ethernet r8169-0-800:00: attached PHY driver (mii_bus:phy_addr=r8169-0-800:00, irq=MAC)
>> [ 336.703336] r8169 0000:08:00.0 enp8s0: Link is Down
>> [ 336.704991] RTL8211B Gigabit Ethernet r8169-0-900:00: attached PHY driver (mii_bus:phy_addr=r8169-0-900:00, irq=MAC)
>> [ 336.772497] r8169 0000:09:00.0 enp9s0: Link is Down
>>
>> [ 463.977280] sas: Enter sas_scsi_recover_host busy: 3 failed: 3
>> [ 463.977288] sas: trying to find task 0x00000000f13c22cf
>> [ 463.977292] sas: sas_scsi_find_task: aborting task 0x00000000f13c22cf
>> [ 463.977416] sas: task done but aborted
>> [ 463.977422] sas: sas_scsi_find_task: task 0x00000000f13c22cf is done
>> [ 463.977425] sas: sas_eh_handle_sas_errors: task 0x00000000f13c22cf is done
>> [ 463.977429] sas: trying to find task 0x0000000089d39e25
>> [ 463.977431] sas: sas_scsi_find_task: aborting task 0x0000000089d39e25
>> [ 472.026033] sas: task done but aborted
>> [ 472.026040] sas: sas_scsi_find_task: task 0x0000000089d39e25 is done
>> [ 472.026044] sas: sas_eh_handle_sas_errors: task 0x0000000089d39e25 is done
>> [ 472.026048] sas: trying to find task 0x000000002d226601
>> [ 472.026050] sas: sas_scsi_find_task: aborting task 0x000000002d226601
>> [ 492.601193] sas: TMF task timeout for 2bad2bad2bad2bad and not done
>> [ 492.601204] drivers/scsi/mvsas/mv_sas.c 1410:mvs_abort_task:rc= 5
>> [ 492.601208] sas: sas_scsi_find_task: querying task 0x000000002d226601
>> [ 513.081228] sas: TMF task timeout for 2bad2bad2bad2bad and not done
>> [ 513.081241] drivers/scsi/mvsas/mv_sas.c 1340:mvs_query_task:rc= 5
>> [ 513.081247] sas: sas_scsi_find_task: task 0x000000002d226601 failed to abort
>> [ 513.081252] sas: task 0x000000002d226601 is not at LU: I_T recover
>> [ 513.081255] sas: I_T nexus reset for dev 2bad2bad2bad2bad
>> [ 513.577383] drivers/scsi/mvsas/mv_sas.c 1690:Release slot [5] tag[5], task [000000002d226601]:
>> [ 513.577396] sas: task done but aborted
>> [ 513.577398] drivers/scsi/mvsas/mv_sas.c 1690:Release slot [1] tag[1], task [00000000d0d28957]:
>> [ 515.321219] drivers/scsi/mvsas/mv_sas.c 1304:mvs_I_T_nexus_reset for device[0]:rc= 0
>> [ 515.321226] sas: I_T 2bad2bad2bad2bad recovered
>> [ 515.321243] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 3 tries: 1
>> [ 515.337273] sd 0:0:0:0: [sdb] tag#8 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=81s
>> [ 515.337281] sd 0:0:0:0: [sdb] tag#8 CDB: Read(10) 28 00 11 77 78 98 00 04 00 00
>> [ 515.337283] I/O error, dev sdb, sector 293042328 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 2
>> [ 515.337299] sd 0:0:0:0: [sdb] tag#9 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=81s
>> [ 515.337303] sd 0:0:0:0: [sdb] tag#9 CDB: Read(10) 28 00 11 77 7c 98 00 04 00 00
>> [ 515.337305] I/O error, dev sdb, sector 293043352 op 0x0:(READ) flags 0x4000 phys_seg 8 prio class 2
>> [ 515.337315] sd 0:0:0:0: [sdb] tag#10 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=81s
>> [ 515.337318] sd 0:0:0:0: [sdb] tag#10 CDB: Read(10) 28 00 11 77 80 98 00 04 00 00
>> [ 515.337320] I/O error, dev sdb, sector 293044376 op 0x0:(READ) flags 0x4000 phys_seg 9 prio class 2
>> [ 515.710666] drivers/scsi/mvsas/mv_sas.c 1099:set wide port phy map 2
>> [ 515.710714] sas: sas_form_port: phy1 belongs to port0 already(1)!
>> [ 519.721300] drivers/scsi/mvsas/mv_sas.c 1690:Release slot [1] tag[1], task [00000000d0d28957]:
>> [ 521.822694] drivers/scsi/mvsas/mv_sas.c 1099:set wide port phy map 2
>> [ 521.822702] sas: sas_form_port: phy1 belongs to port0 already(1)!

This one is not libata :) So the adapter driver itself may have an issue.

--
Damien Le Moal
Western Digital Research