[ 385.102073] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[ 385.108026] sas: sas_scsi_find_task: aborting task 0x000000007068ed73
[ 405.561099] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
timeout.
[ 405.568236] sas: sas_scsi_find_task: task 0x000000007068ed73 is
aborted
[ 405.574930] sas: sas_eh_handle_sas_errors: task 0x000000007068ed73 is
aborted
[ 411.192602] ata21.00: qc timeout (cmd 0xec)
[ 431.672122] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
timeout.
[ 431.679282] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 431.685544] ata21.00: revalidation failed (errno=-5)
[ 441.911948] ata21.00: qc timeout (cmd 0xec)
[ 462.391545] pm80xx0:: pm8001_exec_internal_task_abort 757:TMF task
timeout.
[ 462.398696] ata21.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 462.404992] ata21.00: revalidation failed (errno=-5)
[ 492.598769] ata21.00: qc timeout (cmd 0xec)
...
}This did not help. Still seeing 100% reproducible hangs.
- res = -TMF_RESP_FUNC_FAILED;
+ res = TMF_RESP_FUNC_FAILED;
That's effectively the same as what I have in this series in
sas_execute_tmf().
However your testing is a SATA device, which I'll check further.
I did a lot of testing/digging today,
At random, a task times out as its completion
does not come, and subsequent abort trial for the task fail, revalidate
fails
and the device is dropped (capacity goes to 0). But at that point,
doing rmmod/modprobe to reset the device does not work. sync cache
command issued at rmmod time never completes. I end up needing to power
cycle the machine every time...
No clue about the root cause yet, but it definitely seem to be related
to NCQ/high QD operation. If I force my tests to use non-NCQ commands,
everything is fine and the tests run to completion without any issue.
I wonder if their is a tag management bug somewhere...
I did stumble on something very ugly in libsas too: sas_ata_qc_issue()
drops and retake the ata port lock. No other ATA driver do that since
the ata completion also take that lock. The ata port lock is taken
before ata_qc_issue() is called with IRQ disabled (spin_lock_irqsave()).
So doing a spin_unlock()/spin_lock() in sas_ata_qc_issue() (called from
ata_qc_issue()) seems like a very bad idea. I removed that and
everything work the same way (the lld execute does not sleep). But that
did not solve the hang problem.
Of note is this is all with your libsas patches applied. Without the
patches, I have KASAN screaming at me about use-after-free in completion
context. With your patches, KASAN is silent.
Another thing: this driver does not allow changing the max qd... Very
annoying.
echo 1 > /sys/block/sdX/device/queue_depth
has no effect. QD stays at 32 for an ATA drive. Need to look into that too.