[RFC/PATCH, 4/4] readX_check() performance evaluation

From: Hironobu Ishii
Date: Tue Jan 27 2004 - 20:58:50 EST


This is our goal and plan.

Our goal:
- We'd like to recover from PCI-X intermittent errors
as possible as we can.
We'd like to recvoer from following errors
a) PIO read data parity errors
b) PIO write data parity erros
c) DMA read data parity errors
d) DMA write data parity erros
(PIO means that the initiator is the CPU,
DMA means that the initiator is the device.)
- If a fatal bus error like SERR occurs, we'd like to shutdown
the failed bus(, remove driver instances) and continue
system operation.
- If there are readX_check() I/F, we can safely perform
PHP(PCI Hotplug) operation.
Currently there is no reliable I/F to check the newly
installed card, so to speak, PHP is the same as the gambling.

Our Plan:
- Recover from Data parity error
Currently, we are focused on ia64 architecture.

Device setup assumption:
PCI command register/ Parity error response =1
PCI-X command register/ Data parity error recovery=1

a) PIO read data parity error recovery
(Memory mapped I/O read error recovery)

- readX_check() read the memory mapped hardware
and consume the data by additional instruction.
- If there were data parity error, machine check will
occurs. Local machine check handler check whether
the exception was caused by readX_check()
and increment a per-CPU MCA counter variable.
Then handler returns.
- readX_check check the per-CPU MCA counter variable.
When the counter is incremented, return an error.
- If readX_check() returns an error and if reading
the register has no side-effect, re-read the register
a few times to recover from the error.
If reading the register has a side-effect, driver may
reset the HW and restart operation.

b) PIO write data parity error recovery
(Memory mapped I/O write error recovery)

- The device will find a parity error.
When the device find a parity error,
it reports the error in the PCI status
register as "Detected Parity Error == 1".
After the driver did some PIO writes,
the driver can notice the error by reading
the PCI staus register.
- If writing to the register has no side-effect,
the driver can re-writeto the register.
If writing the register has a side-effect,
driver may reset the HW and restart operation.

c) DMA read data parity errors
(DMA from memory to device)

- The device will find a parity error.
When the device detected a data parity error,
it set the "Detected Parity Error == 1".

After a while, the driver check the PCI status
register, and knows that there was a data parity error.
- Because the driver can't decide which I/O failed
precisely, the recovery method will be reseting the HW
and restarting the I/O if possible.

d) DMA write data parity erros
(DMA from device to memory)

- Host bridge detect the data parity error,
and assert PERR signal on the bus.
Then the device will notice the error and
report it in PCI status register as
"Master Data Parity error == 1".
- Because the driver can't decide which I/O failed
precisely, the recovery method will be reseting the HW
and restarting the I/O if possible.

- Shutdown a bus(, remove driver instances) and continue operation.
We have no good idea currently.
Does someone have a idea?

Thanks,
Hironobu Ishii

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/