On 6/15/21 9:00 PM, Can Guo wrote:
I would like to stick to my way as of now because
1. Merely preventing task abort cannot prevent suspend/resume fail.
Task abort (to PM requests), in real cases, is just one of many kinds
of failure which can fail the suspend/resume callbacks. During
suspend/resume, if AH8 error and/or UIC errors happen, IRQ handler
may complete SSU cmd with errors and schedule the error handler (I've
seen such scenarios in real customer cases). My idea is to treat task
abort (to PM requests) as a failure (let scsi_execute() return with
whatever error) and let error handler recover everything just like
any other UFS errors which invoke error handler. In case this, again,
goes back to the topic that is why don't just do error recovery in
suspend/resume, let me paste my previous reply here -
Does this mean that the IRQ handler can complete an SSU command with an
error and that the error handler can later recover from that error?
That sounds completely wrong to me. The IRQ handler should never complete any
command with an error if that error could be recoverable. Instead, the
IRQ handler should add that command to a list and leave it to the error
handler to fail that command or to retry it.
2. And say we want SCSI layer to resubmit PM requests to prevent
suspend/resume fail, we should keep retrying the PM requests (so
long as error handler can recover everything successfully), meaning
we should give them unlimited retries (which I think is a bad idea),
otherwise (if they have zero retries or limited retries), in extreme
conditions, what may happen is that error handler can recover everything
successfully every time, but all these retries (say 3) still time out,
which block the power management for too long (retries * 60 seconds) and,
most important, when the last retry times out, scsi layer will anyways
complete the PM request (even we return DID_IMM_RETRY), then we end up
same - suspend/resume shall run concurrently with error handler and we
couldn't recover saved PM errors.
Hmm ... it is not clear to me why this behavior is considered a problem?
What is wrong with blocking RPM while a START STOP UNIT command is being
processed? If there are UFS devices for which it takes long to process
that command I think it is up to the vendors of these devices to fix
these UFS devices.
Additionally, if a UFS device needs more than (retries * 60 seconds) to
process a START STOP UNIT command, shouldn't it be marked as broken?
Thanks,
Bart.