[bisected] 5.12-rc1 hpsa regression: "scsi: hpsa: Correct dev cmds outstanding for retried cmds" breaks hpsa P600

From: Sergei Trofimovich
Date: Wed Mar 03 2021 - 06:02:28 EST


On Tue, 2 Mar 2021 23:31:32 +0100
John Paul Adrian Glaubitz <glaubitz@xxxxxxxxxxxxxxxxxxx> wrote:

> Hi Sergei!
>
> On 3/2/21 11:26 PM, Sergei Trofimovich wrote:
> > Gave v5.12-rc1 a try today and got a similar boot failure around
> > hpsa queue initialization, but my failure is later:
> > https://dev.gentoo.org/~slyfox/configs/guppy-dmesg-5.12-rc1
> > Maybe I get different error because I flipped on most debugging
> > kernel options :)
> >
> > Looks like 'ERROR: Invalid distance value range' while being
> > very scary are harmless. It's just a new spammy way for kernel
> > to report lack of NUMA config on the machine (no SRAT and SLIT
> > ACPI tables).
> >
> > At least I get hpsa detected on PCI bus. But I guess it's discovered
> > configuration is very wrong as I get unaligned accesses:
> > [ 19.811570] kernel unaligned access to 0xe000000105dd8295, ip=0xa000000100b874d1
> >
> > Bisecting now.
>
> Sounds good. I guess we should get Jens' fix for the signal regression
> merged as well as your two fixes for strace.

"bisected" (cheated halfway through) and verified that reverting
f749d8b7a9896bc6e5ffe104cc64345037e0b152 makes rx3600 boot again.

CCing authors who might be able to help us here.

commit f749d8b7a9896bc6e5ffe104cc64345037e0b152
Author: Don Brace <don.brace@xxxxxxxxxxxxx>
Date: Mon Feb 15 16:26:57 2021 -0600

scsi: hpsa: Correct dev cmds outstanding for retried cmds

Prevent incrementing device->commands_outstanding for ioaccel command
retries that are driver initiated. If the command goes through the retry
path, the device->commands_outstanding counter has already accounted for
the number of commands outstanding to the device. Only commands going
through function hpsa_cmd_resolve_events decrement this counter.

- ioaccel commands go to either HBA disks or to logical volumes comprised
of SSDs.

The extra increment is causing device resets to hang.

- Resets wait for all device outstanding commands to complete before
returning.

Replace unused field abort_pending with retry_pending. This is a
maintenance driver so these changes have the least impact/risk.

Link: https://lore.kernel.org/r/161342801747.29388.13045495968308188518.stgit@brunhilda
Tested-by: Joe Szczypek <jszczype@xxxxxxxxxx>
Reviewed-by: Scott Benesh <scott.benesh@xxxxxxxxxxxxx>
Reviewed-by: Scott Teel <scott.teel@xxxxxxxxxxxxx>
Reviewed-by: Tomas Henzl <thenzl@xxxxxxxxxx>
Signed-off-by: Don Brace <don.brace@xxxxxxxxxxxxx>
Signed-off-by: Martin K. Petersen <martin.petersen@xxxxxxxxxx>

Don, do you happen to know why this patch caused some controller init failure
for device
14:01.0 RAID bus controller: Hewlett-Packard Company Smart Array P600
?

Boot failure: https://dev.gentoo.org/~slyfox/configs/guppy-dmesg-5.12-rc1
Boot success: https://dev.gentoo.org/~slyfox/configs/guppy-dmesg-5.12-rc1-good

The difference between the two boots is
f749d8b7a9896bc6e5ffe104cc64345037e0b152 reverted on top of 5.12-rc1
in -good case.

Looks like hpsa controller fails to initialize in bad case (could be a race?).

--

Sergei