Re: [PATCH v3 1/2] x86, pci: Reset PCIe devices at boot time

From: Takao Indoh
Date: Fri Oct 12 2012 - 07:33:37 EST


(2012/10/12 2:28), Khalid Aziz wrote:
On Thu, 2012-10-11 at 15:16 +0900, Takao Indoh wrote:
(2012/10/11 5:08), Khalid Aziz wrote:
.....
+static void __init do_reset(u8 bus, u8 slot, u8 func)
+{
+ u16 ctrl;
+
+ printk(KERN_INFO "pci 0000:%02x:%02x.%d reset\n", bus, slot, func);
+
+ /* Assert Secondary Bus Reset */
+ ctrl = read_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL);
+ ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
+ write_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL, ctrl);
+
+ pci_udelay(5000);
+
+ /* De-assert Secondary Bus Reset */
+ ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
+ write_pci_config_16(bus, slot, func, PCI_BRIDGE_CONTROL, ctrl);
+
+ pci_udelay(500000);

This is 0.5 second. This will add up quickly on larger servers with
multiple busses. Is 0.5 second required by the spec?
aer_do_secondary_bus_reset() holds PCI_BRIDGE_CTL_BUS_RESET for 2 ms and
then waits another 200 ms after de-asserting it. Still long, but less
than half of the delay in above code..

According to PCIe spec, they should be more than 1ms and 100ms. The reason
why they're 2ms and 200ms in aer_do_secondary_bus_reset() is that it's a
kind of safety margin, I think.

At first these delay were 2ms and 200ms as well as
aer_do_secondary_bus_reset(), but I found a problem that on certain
machine it was not enough. I think this problem occurs because
native_io_delay() is not precise. Therefore I extended them to have more
margin.

If this delay should be shortened, I have two solutions.

1)
Call early_reset_pcie_devices() after calibrate_delay() so that we can
use precise mdelay. Personally I don't want to do this because I think
DMA should be stopped asap.

2)
Make it tuneable. For exmple, improve "reset_devices"
reset_devices=reset_delay=500
or something like that. As default, 200ms is enough as far as I tested.
But if it is not enough, we can adjust delay time using this.

Any other idea?


I don't like the idea of asking end user to determine how many msec of
delay they would need on their system for a reset. If we have to go that
route, I would rather have a default of 200 msec and then add a kernel
option like "reset_devices=reset_delay=long" which changes the delay to
500 msec.

Here is another idea. The code you currently have does:

for (each device) {
save config registers
reset
wait for 500 ms
restore config registers
}

You need to wait for 500 ms because you can not access config registers
until reset is complete. So how about this - why not just save config
registers, reset each device and then wait only once for 500 ms before
restoring config registers on all devices. Here is what this will look
like:

for (each device) {
save config registers
reset
}
wait 500 ms
for (each device) {
restore config registers
}

This is obviously more complex and requires you to allocate more space
for saving state, but it has the benefits of minimizing boot up delay
and making the delay constant irrespective of number of switches/bridges
on the system.

Does this sound reasonable?

Yep, that looks good idea. I'll update my patch with this idea.

Thanks,
Takao Indoh




We have been seeing problems with kexec/kdump kernel for quite some time
that are related to I/O devices not being quiesced before kexec. I had
added code to clear Bus Master bit to help stop runaway DMAs which
helped many cases, but obviously not all. If resetting downstream ports
helps stop runaway I/O from PCIe devices, I am all for this approach.
This patch still doesn't do anything for old PCI devices though.

This patch does not reset PCI devices because of VGA problem. As I wrote
in patch description, the monitor blacks out when VGA controller is
reset. So when VGA device is found we need to skip it. In the case of
PCIe, we can decide whether we do bus reset or not on each device, but
we cannot in the case of PCI.

OK. My concern is there are plenty of older servers out there that
potentially have the problem with kexec/kdump failing often and we are
not solving the problem for them. If we can only solve it for PCIe for
now, I am ok with starting there.

--
Khalid





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/