Re: [PATCH/RFC] I/O-check interface for driver's error handling

From: Jeff Garzik
Date: Tue Mar 01 2005 - 11:39:01 EST


Hidetoshi Seto wrote:
Hi, long time no see :-)

Currently, I/O error is not a leading cause of system failure.
However, since Linux nowadays is making great progress on its
scalability, and ever larger number of PCI devices are being
connected to a single high-performance server, the risk of the
I/O error is increasing day by day.

For example, PCI parity error is one of the most common errors
in the hardware world. However, the major cause of parity error
is not hardware's error but software's - low voltage, humidity,
natural radiation... etc. Even though, some platforms are nervous
to parity error enough to shutdown the system immediately on such
error. So if device drivers can retry its transaction once results
as an error, we can reduce the risk of I/O errors.

So I'd like to suggest new interfaces that enable drivers to
check - detect error and retry their I/O transaction easily.

I have been thinking about PCI system and parity errors, and how to handle them. I do not think this is the correct approach.

A simple retry is... too simple. If you are having a massive problem on your PCI bus, more action should be taken than a retry.

In my opinion each driver needs to be aware of PCI sys/parity errs, and handle them. For network drivers, this is rather simple -- check the hardware, then restart the DMA engine. Possibly turning off TSO/checksum to guarantee that bad packets are not accepted. For SATA and SCSI drivers, this is more complex, as one must retry a number of queued disk commands, after resetting the hardware.

A new API handles none of this.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/