RE: questions regarding possible violation of AHCI spec in AHCIdriver
From: Jian Peng
Date: Wed Dec 08 2010 - 13:49:48 EST
So it is reasonable to add extra check in ahci_start_engine() without returning status of ST bit. If so, here is my patch
--- libahci.c.orig 2010-12-08 10:42:48.383976763 -0800
+++ libahci.c 2010-12-08 10:45:17.495156944 -0800
@@ -542,6 +542,13 @@
{
void __iomem *port_mmio = ahci_port_base(ap);
u32 tmp;
+ u8 status = readl(port_mmio + PORT_TFDATA) & 0xFF;
+
+ /* avoid race condition per spec (end of section 10.1.2) */
+ if (status & (ATA_BUSY | ATA_DRQ) ||
+ ahci_scr_read(&ap->link, SCR_STATUS, &tmp) ||
+ (tmp & 0x0f) != 0x03)
+ return;
/* start DMA */
tmp = readl(port_mmio + PORT_CMD);
Thanks,
Jian
________________________________________
From: Tejun Heo [tj@xxxxxxxxxx]
Sent: Wednesday, December 08, 2010 10:24 AM
To: Jian Peng
Cc: Robert Hancock; linux-kernel@xxxxxxxxxxxxxxx; jgarzik@xxxxxxxxx; ide
Subject: Re: questions regarding possible violation of AHCI spec in AHCI driver
Hello,
On 12/08/2010 07:14 PM, Jian Peng wrote:
> The problem happened as follow:
>
> After power up, inside ahci_init_one(), it will call ahci_power_up()
> to toggle PxCMD.SUD bit first, then HBA will send COMRESET to
> device, and device will send first D2H FIS back. Here it will call
> ahci_start_engine() to turn on PxCMD.ST to process command. In this
> case, it may run into race condition that transaction triggered by
> toggling PxCMD.SUD is not completed yet, and that is the reason why
> extra check is required by spec to guarantee that HBA already
> received FIS and in sane state.
>
> In most HBA, either staggered spin-up feature was not supported, or
> time required for transaction is less than that between two function
> calls, it may work. IMHO, this is a clear violation of spec, and not
> robust against all HBA design.
>
> The major concern is that ahci_start_engine() is used widely in EH
> and it does not return result to reflect whether ST bit was set or
> not, this may cause trouble in some cases. I am working on verifying
> those cases with different HBAs now.
I see, so it's not that the controller actually failed but you
observed the race condition. During EH, ahci_start_engine() is used
rather liberally when the driver wants the controller in working
condition. The assumption is that even if the driver tries to set ST
with the incorrect condition, the controller wouldn't go crazy where
it can't be restarted later, which _must_ be true as there's no atomic
way to check the condition and set ST.
The driver at the same time guarantees that if the whole
initialization protocol succeeds, the last ahci_start_engine() is
called after hardreset is sucessfully completed and thus all the
prerequisites are met.
So, yeap, the driver might set ST when the conditions are not met but
that can't be avoided completely anyway so the controller shouldn't go
completely bonkers for that (it's okay to fail in recoverable way) and
the driver will do the last ST setting after all the conditions are
met on the success path.
Thanks.
--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/