Re: [PATCH 00/16] scsi: libsas and users: Factor out LLDD TMF code

From: John Garry
Date: Fri Feb 04 2022 - 05:36:58 EST


On 04/02/2022 03:02, Damien Le Moal wrote:
The inconsistency is this line says:
[ 137.193944] pm80xx0:: pm80xx_chip_sata_req 4581:no data
Which seems to be sensical for NCQ_NON_DATA command, but then, this line
seems wrong:
[ 137.228015] pm80xx0:: mpi_sata_completion 2515:FPDMA len = 8

I need to go and check the specs what the FIS reply format is for
NCQ_NON_DATA.


[ 137.187184] pm80xx0:: pm8001_queue_command 408:pm8001_task_exec device
[ 137.193944] pm80xx0:: pm80xx_chip_sata_req 4581:no data
[ 137.199339] pm80xx0:: pm80xx_chip_sata_req 4682:Sending Normal SATA
command 0x63 inb 4
[ 137.207577] pm80xx0:: pm8001_mpi_msg_consume 1446:: CI=46 PI=47
msgHeader=8104200d
[ 137.215399] pm80xx0:: mpi_sata_completion 2481:IO_SUCCESS
[ 137.220961] pm80xx0:: mpi_sata_completion 2503:SAS_PROTO_RESPONSE
len = 20
[ 137.228015] pm80xx0:: mpi_sata_completion 2515:FPDMA len = 8
[ 137.233878] pm80xx0:: pm8001_mpi_msg_free_set 1403: CI=47 PI=47
[ 137.236696] pm80xx0:: pm8001_queue_command 408:pm8001_task_exec device
[ 137.247102] pm80xx0:: pm80xx_chip_sata_req 4585:DMA
[ 137.252186] pm80xx0:: pm80xx_chip_sata_req 4593:FPDMA
[ 137.257400] pm80xx0:: pm80xx_chip_sata_req 4682:Sending Normal SATA
command 0x65 inb f
[ 167.506280] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[ 167.512363] sas: sas_scsi_find_task: aborting task 0x00000000aa372627
[ 167.519049] pm80xx0:: pm8001_chip_abort_task 4607:cmd_tag = 2, abort
task tag = 0x1
[ 187.969173] pm80xx0:: pm8001_exec_internal_task_abort 753:TMF task
timeout.

As I mentioned, having this fail is a red flag. If I was pushed to guess what has happened, I'd say the FW is faulting due to some erroneous driver behaviour.

[ 187.976450] sas: sas_scsi_find_task: task 0x00000000aa372627 is aborted
[ 187.983244] sas: sas_eh_handle_sas_errors: task 0x00000000aa372627 is
aborted




After these messages, the tests exit on failure (drive dropped) and
there are no more messages. Doing rmmod or anything else get stuck too.
I have to reset the machine to get back to a good state.

I am starting to think that NCQ NON DATA command is being very badly
handled... Switching the tests to run with all non-NCQ commands is
working fine, albeit horribly slow (much slower than with other HBAs,
e.g. Broadcom).

Digging...
I missed a KASAN splat during device scan on boot:

33.725184]
==================================================================
[ 33.746168] BUG: KASAN: use-after-free in __lock_acquire+0x41a5/0x5b50
[ 33.764181] Read of size 8 at addr ffff88818a318660 by task
kworker/u64:6/583

...

==================================================================

This is the submission path, not completion. The code is:

(gdb) list *(pm8001_queue_command+0x842)
0x3d42 is in pm8001_queue_command (drivers/scsi/pm8001/pm8001_sas.c:491).
486 atomic_dec(&pm8001_dev->running_req);
487 goto err_out_tag;
488 }
489 /* TODO: select normal or high priority */
490 spin_lock(&t->task_state_lock);
491 t->task_state_flags |= SAS_TASK_AT_INITIATOR;
492 spin_unlock(&t->task_state_lock);
493 } while (0);
494 rc = 0;
495 goto out_done;

So the task is already completed when the submission path tries to set
the state flag ? Debugging...

Yeah, that's how it looks.

I already mentioned this problem here:

https://lore.kernel.org/linux-scsi/0cc0c435-b4f2-9c76-258d-865ba50a29dd@xxxxxxxxxx/

Maybe we should just fix it now to rule it out of possibly causing other issues... I was reluctant to fix it as many places seems to need to be touched. Let me check it.

Thanks,
John