Re: SATA exceptions with 2.6.20-rc5

From: Robert Hancock
Date: Mon Jan 22 2007 - 20:25:09 EST


Björn Steinbrink wrote:
Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.

No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.

I've finally managed to reproduce this problem on my box, by doing:

watch --interval=0.1 /sbin/hdparm -I /dev/sda

on one drive and then running bonnie++ on /dev/sdb connected to the other port on the same controller device. Usually within a few minutes one of the IDENTIFY commands would time out in the same way you guys have been seeing.

Through some various trials and tribulations, the only conclusion I can come to is that this controller really doesn't like that NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried adding some debug code to the qc_issue function that would check to see if the BUSY flag in altstatus went high or that register showed an interrupt within a certain time afterwards, however that really seemed to hose things, the system wouldn't even boot.

Try out this patch, it just calls the ata_host_intr function where appropriate without using nv_host_intr which looks at the NV_INT_STATUS_CK804 register. This is what the original ADMA patch from Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for that. With this patch I can get through a whole bonnie++ run with the repeated IDENTIFY requests running without seeing the error.

As an aside, there seems to be some dubious code in nv_host_intr, if ata_host_intr returns 0 for handled when a command is outstanding, it goes and calls ata_check_status anyway. This is rather dangerous since if an interrupt showed up right after ata_host_intr but before ata_check_status, the ata_check_status would clear it and we would forget about it. I tried fixing just that issue and still had this problem however. I suspect that code is truly broken and needs further thought, but this patch avoids calling it in the ADMA case, at any rate.

As a final aside, this is another case where the hardware docs for this controller would really be useful, in order to know whether we are actually supposed to be reading that register in ADMA mode or not. I sent a query to Allen Martin at NVIDIA asking if there's a way I could get access to the documents, but I haven't heard anything yet.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@xxxxxxxxxxxxx
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.000000000 -0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.000000000 -0600
@@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int

/* if in ATA register mode, use standard ata interrupt handler */
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
- u8 irq_stat = readb(host->mmio_base + NV_INT_STATUS_CK804)
- >> (NV_INT_PORT_SHIFT * i);
- handled += nv_host_intr(ap, irq_stat);
+ struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag);
+ if(qc && !(qc->tf.flags & ATA_TFLAG_POLLING))
+ handled += ata_host_intr(ap, qc);
continue;
}