Re: A question concerning time outs and possible lost interrupts

Gerard Roudier (groudier@club-internet.fr)
Mon, 14 Sep 1998 19:14:59 +0200 (MET DST)


On Mon, 14 Sep 1998, MOLNAR Ingo wrote:

> On Mon, 14 Sep 1998, Gerard Roudier wrote:
>
> > 1 - Read the Interrupt Status Register (ISTAT)
> > If completion interrupt (INTFLY)
> > 2 - Write the ISTAT to clear the interrupt condition.
> > 3 - Reread the ISTAT. This read will ensure that PCI posted writes
> > that may have occured between (1) and (2) are flushed and that the
> > Interrupt condition is actually cleared.
> > (This seems overcommitting, but hopefully it is not)
> > 4 - Scan the completion queue.
> >
> > Between (1) and (2) the controller may have written to memory some
> > completion data and these transactions may be posted.
> > The write to the ISTAT (2) may also be posted.
> > (3) ensures that all this stuff will be actually visible by the
> > corresponding parts at the moment the completion queue is scanned
> > by the C code.
>
> just out of curiousity, does the problem remain if the NCR driver is
> booted in non-ioremapped mode? in/out is slower but much more conservative
> and this should exclude lots of cache invalidation/chipset posted write
> bug possibilities.

Normal IO vs MMIO is a compilation option and MMIO is the default.

I have reread Edward's report and my understanding is that when the
problem occurs the driver seems unable to recover using a full SCSI bus
and controller reset. The driver is perhaps not that good for fine
grained recovery but it always recovers gracefully with reset for my
testings and, based on reports, I do think that it actually recovers on
reset most of the time.

This let me think that it is likely a breakage that is more serious that a
lost interrupt and that the kernel is locked for another reason. (If an
interrupt had been lost, then the scsi.c driver should have asked the
low-level driver for resetting since aborting the command does probably
not work).

Anyway, Edward's benchmark stresses highly the whole kernel and the
hardware as well as the SCSI layers. A hang of a component that does not
locks up the scsi.c timeout handling will result with pending SCSI
command being timed out within the first 15 seconds that follow the
hang-up. So, the symptom that is observed is a general symptom of
partial system lockup that can be caused by billions of causes, including
obviously SCSI layers. It should be interesting to add to the benchmarks
some network traffic in order to see how the network layers behaves
when the problem occurs.

Regards,
Gerard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/faq.html