Re: SATA exceptions with 2.6.20-rc5
From: Robert Hancock
Date: Mon Jan 22 2007 - 20:25:09 EST
Björn Steinbrink wrote:
Running a kernel with the return statement replace by a line that prints
the irq_stat instead.
Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.
No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.
I've finally managed to reproduce this problem on my box, by doing:
watch --interval=0.1 /sbin/hdparm -I /dev/sda
on one drive and then running bonnie++ on /dev/sdb connected to the
other port on the same controller device. Usually within a few minutes
one of the IDENTIFY commands would time out in the same way you guys
have been seeing.
Through some various trials and tribulations, the only conclusion I can
come to is that this controller really doesn't like that
NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried
adding some debug code to the qc_issue function that would check to see
if the BUSY flag in altstatus went high or that register showed an
interrupt within a certain time afterwards, however that really seemed
to hose things, the system wouldn't even boot.
Try out this patch, it just calls the ata_host_intr function where
appropriate without using nv_host_intr which looks at the
NV_INT_STATUS_CK804 register. This is what the original ADMA patch from
Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for
that. With this patch I can get through a whole bonnie++ run with the
repeated IDENTIFY requests running without seeing the error.
As an aside, there seems to be some dubious code in nv_host_intr, if
ata_host_intr returns 0 for handled when a command is outstanding, it
goes and calls ata_check_status anyway. This is rather dangerous since
if an interrupt showed up right after ata_host_intr but before
ata_check_status, the ata_check_status would clear it and we would
forget about it. I tried fixing just that issue and still had this
problem however. I suspect that code is truly broken and needs further
thought, but this patch avoids calling it in the ADMA case, at any rate.
As a final aside, this is another case where the hardware docs for this
controller would really be useful, in order to know whether we are
actually supposed to be reading that register in ADMA mode or not. I
sent a query to Allen Martin at NVIDIA asking if there's a way I could
get access to the documents, but I haven't heard anything yet.
--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@xxxxxxxxxxxxx
Home Page: http://www.roberthancock.com/
--- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.000000000 -0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.000000000 -0600
@@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int
/* if in ATA register mode, use standard ata interrupt handler */
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
- u8 irq_stat = readb(host->mmio_base + NV_INT_STATUS_CK804)
- >> (NV_INT_PORT_SHIFT * i);
- handled += nv_host_intr(ap, irq_stat);
+ struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag);
+ if(qc && !(qc->tf.flags & ATA_TFLAG_POLLING))
+ handled += ata_host_intr(ap, qc);
continue;
}