Re: [PATCH v5 0/7] libsas and drivers: NCQ error handling

From: John Garry
Date: Thu Oct 06 2022 - 04:33:55 EST


On 05/10/2022 23:42, Damien Le Moal wrote:
Hello Damien,

John explained that he got a timeout from EH when reading the log:
[ 350.281581] ata1: failed to read log page 10h (errno=-5)
[ 350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen

ata_eh_read_log_10h() uses ata_read_log_page(), which will first try to read
the log using READ LOG DMA EXT. If that fails, it will retry using READ LOG EXT.

Therefore, to see if this is a driver specific bug, I suggested to try to read
the NCQ Command Error log using ATA16 passthrough commands:

$ sudo sg_sat_read_gplog -d --log=0x10 /dev/sdc
will read the log using READ LOG DMA EXT.

$ sudo sg_sat_read_gplog --log=0x10 /dev/sdc
will read the log using READ LOG EXT.

Note that I can't get a distro to boot on this system from the HDD for the same timeout problem (so no tools easily available).


Neither of these two suggested commands are NCQ commands.
(Neither command is encapsulated in a RECEIVE FPDMA QUEUED,
so I'm not sure what you mean.)


Garry, I now see that:
[ 350.577181] ata1.00: exception Emask 0x1 SAct 0xffffffff SErr 0x0 action 0x6 frozen
Your port is frozen.

ata_read_log_page() calls ata_exec_internal() which calls ata_exec_internal_sg(),
which will simply return an error without sending down the command to the drive,
if the port is frozen.

Not sure why your port is frozen, mine is obviously not.

I think that it gets frozen when the internal command for read log ext times out. More below about that timeout.


ata_do_link_abort() calls ata_eh_set_pending() without activating fast drain:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-eh.c?h=v6.0#n989

So I'm not sure why your port is frozen.
(The fast drain timer does freeze the port, but it shouldn't be enabled.)
It might be worthwhile to see who freezes the port in your case.
Might come from the command timeout. John has had many problems with the
pm80xx HBA in his Arm machine from a while back. Likely not a driver issue
but a hw one... No-one seems to be able to recreate the same problem.

We need to try the HBA on our Arm board to see what happens.


Yeah, it just looks to be the longstanding issue of using this card on my arm64 machine - that is that I get IO timeouts quite regularly. I should have mentioned that yesterday. This just seems to be a driver issue.

Interestingly this read log ext always seems to timeout, so maybe I could see if there is anything specific about this command which could give a clue to the underlying issue. But I have spent much time trying to debug this issue, so not too motivated any more if I’m completely honest ...

Thanks,
John