Re: [PATCH v4] ata: libata-eh: Honor all EH scheduling requests

From: Niklas Cassel
Date: Thu Sep 07 2023 - 17:03:09 EST

Next message: Ian Rogers: "[PATCH v1] perf parse-events: Avoid erange from hex numbers"
Previous message: Fan Ni: "Re: [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD)"
In reply to: kernel test robot: "Re: [PATCH v4] ata: libata-eh: Honor all EH scheduling requests"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Sep 07, 2023 at 03:43:19PM +0800, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed "kernel_BUG_at_drivers/ata/libata-sff.c" on:
>
> commit: d3d099d5c2dd38db84abd96df39f9f0828c16b7b ("[PATCH v4] ata: libata-eh: Honor all EH scheduling requests")
> url: https://github.com/intel-lab-lkp/linux/commits/linan666-huaweicloud-com/ata-libata-eh-Honor-all-EH-scheduling-requests/20230906-164907
> base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 65d6e954e37872fd9afb5ef3fc0481bb3c2f20f4
> patch link: https://lore.kernel.org/all/20230906084212.1016634-1-linan666@xxxxxxxxxxxxxxx/
> patch subject: [PATCH v4] ata: libata-eh: Honor all EH scheduling requests
>
> in testcase: boot
>
> compiler: gcc-12
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

Unfortunately the problem reported by the kernel test robot is very real.
I could reproduce without too much effort in QEMU.

The problem is basically that we cannot simply perform a host_eh_scheduled--;
in ata_std_end_eh().

ata_std_end_eh() is called at the end of ata_scsi_port_error_handler(),
so it is called once every time ata_scsi_port_error_handler() is called.

However, ata_scsi_port_error_handler() will be called by SCSI EH each
time SCSI wakes up.

SCSI EH will sleep as long as:
if ((shost->host_failed == 0 && shost->host_eh_scheduled == 0) ||
shost->host_failed != scsi_host_busy(shost)) {
schedule();
continue;
}

The methods in libata which we use to trigger EH are:

1) ata_std_sched_eh(), which calls scsi_schedule_eh(), which does
host_eh_scheduled++;

2) ata_qc_schedule_eh(), which will end up in scsi_timeout,
which calls scsi_eh_scmd_add() which does:
host_failed++;

So before this patch, setting host_eh_scheduled = 0; in ata_std_end_eh()
makes us say that works because it only negates the EH scheduled by
ata_std_sched_eh().

However, if we do host_eh_scheduled--, then if the EH was triggered by
ata_qc_schedule_eh(), then host_eh_scheduled will decrease < 0,
which will trigger SCSI EH to wake up again :)

We could do something like only decreasing host_eh_scheduled if it is > 0.
The QCs added to EH using ata_qc_schedule_eh() will be handled by
ata_eh_finish(), which will iterate over all QCs owned by EH, and will
either fail or retry each QC. After that scsi_error_handler() has finished
the call to eh_strategy_handler() (ata_scsi_error()) it will unconditionally
set host_failed to 0:
https://github.com/torvalds/linux/blob/v6.5/drivers/scsi/scsi_error.c#L2331-L2337

So something like this on top of the patch in $subject:

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 2d5ecd68b7e0..9ab12d7f6d9f 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -952,7 +952,13 @@ EXPORT_SYMBOL_GPL(ata_std_sched_eh);
*/
void ata_std_end_eh(struct ata_port *ap)
{
- ap->scsi_host->host_eh_scheduled--;
+ struct Scsi_Host *host = ap->scsi_host;
+ unsigned long flags;
+
+ spin_lock_irqsave(host->host_lock, flags);
+ if (host->host_eh_scheduled > 0)
+ host->host_eh_scheduled--;
+ spin_unlock_irqrestore(host->host_lock, flags);
}
EXPORT_SYMBOL(ata_std_end_eh);

With that incremental patch, I can no longer reproduce the crash reported
by the kernel test robot in my QEMU setup.

It might be worth mentioning that the race window for the bug that the patch
in $subject is fixing, should be much smaller after this patch is in:
https://lore.kernel.org/linux-ide/20230907081710.4946-1-Chloe_Chen@xxxxxxxxxxxxxx/

Li Nan, perhaps you could see if you can still reproduce your original
problem with the patch from the ASMedia guys?

However, even with the ASMedia patch, it should still be theoretically
possible to get an error irq after ata_eh_reset() has called ahci_thaw(),
so I suppose that this patch still makes some sense...

Kind regards,
Niklas

Next message: Ian Rogers: "[PATCH v1] perf parse-events: Avoid erange from hex numbers"
Previous message: Fan Ni: "Re: [PATCH RFC v2 00/18] DCD: Add support for Dynamic Capacity Devices (DCD)"
In reply to: kernel test robot: "Re: [PATCH v4] ata: libata-eh: Honor all EH scheduling requests"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]