Re: [PATCH] Reset PCIe devices to stop ongoing DMA

From: Don Dutile
Date: Fri Apr 26 2013 - 09:42:59 EST


On 04/25/2013 11:10 PM, Takao Indoh wrote:
> (2013/04/26 3:01), Don Dutile wrote:
>> On 04/25/2013 01:11 AM, Takao Indoh wrote:
>>> (2013/04/25 4:59), Don Dutile wrote:
>>>> On 04/24/2013 12:58 AM, Takao Indoh wrote:
>>>>> This patch resets PCIe devices on boot to stop ongoing DMA. When
>>>>> "pci=pcie_reset_devices" is specified, a hot reset is triggered on each
>>>>> PCIe root port and downstream port to reset its downstream endpoint.
>>>>>
>>>>> Problem:
>>>>> This patch solves the problem that kdump can fail when intel_iommu=on is
>>>>> specified. When intel_iommu=on is specified, many dma-remapping errors
>>>>> occur in second kernel and it causes problems like driver error or PCI
>>>>> SERR, at last kdump fails. This problem is caused as follows.
>>>>> 1) Devices are working on first kernel.
>>>>> 2) Switch to second kernel(kdump kernel). The devices are still working
>>>>> and its DMA continues during this switch.
>>>>> 3) iommu is initialized during second kernel boot and ongoing DMA causes
>>>>> dma-remapping errors.
>>>>>
>>>>> Solution:
>>>>> All DMA transactions have to be stopped before iommu is initialized. By
>>>>> this patch devices are reset and in-flight DMA is stopped before
>>>>> pci_iommu_init.
>>>>>
>>>>> To invoke hot reset on an endpoint, its upstream link need to be reset.
>>>>> reset_pcie_devices() is called from fs_initcall_sync, and it finds root
>>>>> port/downstream port whose child is PCIe endpoint, and then reset link
>>>>> between them. If the endpoint is VGA device, it is skipped because the
>>>>> monitor blacks out if VGA controller is reset.
>>>>>
>>>> Couple questions wrt VGA device:
>>>> (1) Many graphics devices are multi-function, one function being VGA;
>>>> is the VGA always function 0, so this scan sees it first& doesn't
>>>> do a reset on that PCIe link? if the VGA is not function 0, won't
>>>> this logic break (will reset b/c function 0 is non-VGA graphics) ?
>>>
>>> VGA is not reset irrespective of its function number. The logic of this
>>> patch is:
>>>
>>> for_each_pci_dev(dev) {
>>> if (dev is not PCIe)
>>> continue;
>>> if (dev is not root port/downstream port) ---(1)
>>> continue;
>>> list_for_each_entry(child,&dev->subordinate->devices, bus_list) {
>>> if (child is upstream port or bridge or VGA) ---(2)
>>> continue;
>>> }
>>> do_reset_its_child(dev);
>>> }
>>>
>>> Therefore VGA itself is skipped by (1), and upstream device(root port or
>>> downstream port) of VGA is also skipped by (2).
>>>
>>>
>>>> (2) I'm hearing VGA will soon not be the a required console; this logic
>>>> assumes it is, and why it isn't blanked.
>>>> Q: Should the filter be based on a device having a device-class of display ?
>>>
>>> I want to avoid the situation that user's monitor blacks out and user
>>> cannot know what's going on. That's reason why I introduced the logic to
>>> skip VGA. As far as I tested the logic based on device-class works well,
>> sorry, I read your description, which said VGA, but your are filtering on display class,
>> which includes non-VGA as well. So, all set ... but large, (x16) non-VGA display devices
>> are probably one of the most aggressive DMA engines on a system.... and will grow as
>> asymmetric processing using GPUs gets architected into a device-agnostic manner.
>> So, this may work well for servers, which is the primary consumer/user of this feature,
>> and they typically have built-in graphics that are generally used in simple VGA mode,
>> so this may be sufficient for now.
>
> Ok, understood.
>
>
>>
>>> but I would appreciate it if there are better ways.
>>>
>> You probably don't want to hear it but....
>> a) only turn off cmd-reg master enable bit
>> b) only do reset based on a list of devices known not to
>> obey their cmd-reg master enable bit, and only do reset to those devices.
>> But, given the testing you've done so far, this optional (need cmdline) feature,
>> let's start here.
>
> Ok. Either way I think we need more testing.
>
>
>>>>
>>>>> Actually this is v8 patch but quite different from v7 and it's been so
>>>>> long since previous post, so I start over again.
>>>> Thanks for this re-start. I need to continue reviewing the rest.
>>>
>>> Thank you for your review!
>>>
>>>>
>>>> Q: Why not force IOMMU off when re-booting a kexec kernel to perform a crash
>>>> dump? After the crash dump, the system is rebooting to previous (iommu=on) setting.
>>>> That logic, along w/your previous patch to disable the IOMMU if iommu=off
>>>> is set, would remove this (relatively slow) PCI init sequencing ?
>>>
>>> To force iommu off, all ongoing DMA have to be stopped before that since
>>> they are accessing the device address, not physical address. If we disable
>>> iommu without stopping in-flihgt DMA, devices access invalid memory area
>>> and it causes memory corruption or PCI-SERR due to DMA error.
>> Right, that's a 'duh' on my part.
>> I thought 'disable iommu' == 'block all dma' and it just turns it off&
>> let's the ongoing DMA run...
>> Please ignore this question... sigh.
>>
>>>
>>> So, whether we use iommu or not in second kernel, we have to stop DMA in
>>> second kernel if iommu is used in first kernel.
>>>
>>> Thanks,
>>> Takao Indoh
>>>
>>>
>>>>
>>>>> Previous post:
>>>>> [PATCH v7 0/5] Reset PCIe devices to address DMA problem on kdump
>>>>> https://lkml.org/lkml/2012/11/26/814
>>>>>
>>>>> Signed-off-by: Takao Indoh<indou.takao@xxxxxxxxxxxxxx>
>>>>> ---
>>>>> Documentation/kernel-parameters.txt | 2 +
>>>>> drivers/pci/pci.c | 103 +++++++++++++++++++++++++++++++++++
>>>>> 2 files changed, 105 insertions(+), 0 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
>>>>> index 4609e81..2a31ade 100644
>>>>> --- a/Documentation/kernel-parameters.txt
>>>>> +++ b/Documentation/kernel-parameters.txt
>>>>> @@ -2250,6 +2250,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
>>>>> any pair of devices, possibly at the cost of
>>>>> reduced performance. This also guarantees
>>>>> that hot-added devices will work.
>>>>> + pcie_reset_devices Reset PCIe endpoint on boot by hot
>>>>> + reset
>>>>> cbiosize=nn[KMG] The fixed amount of bus space which is
>>>>> reserved for the CardBus bridge's IO window.
>>>>> The default value is 256 bytes.
>>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>>> index b099e00..42385c9 100644
>>>>> --- a/drivers/pci/pci.c
>>>>> +++ b/drivers/pci/pci.c
>>>>> @@ -3878,6 +3878,107 @@ void __weak pci_fixup_cardbus(struct pci_bus *bus)
>>>>> }
>>>>> EXPORT_SYMBOL(pci_fixup_cardbus);
>>>>>
>>>>> +/*
>>>>> + * Return true if dev is PCIe root port or downstream port whose child is PCIe
>>>>> + * endpoint except VGA device.
>>>>> + */
>>>>> +static int __init need_reset(struct pci_dev *dev)
>>>>> +{
>>>>> + struct pci_bus *subordinate;
>>>>> + struct pci_dev *child;
>>>>> +
>>>>> + if (!pci_is_pcie(dev) || !dev->subordinate ||
>>>>> + list_empty(&dev->subordinate->devices) ||
>>>>> + ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)&&
>>>>> + (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM)))
>>>>> + return 0;
>>>>> +
>>>>> + subordinate = dev->subordinate;
>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>> + if ((pci_pcie_type(child) == PCI_EXP_TYPE_UPSTREAM) ||
>>>>> + (pci_pcie_type(child) == PCI_EXP_TYPE_PCI_BRIDGE) ||
>>>>> + ((child->class>> 16) == PCI_BASE_CLASS_DISPLAY))
>>>>> + /* Don't reset switch, bridge, VGA device */
>>>>> + return 0;
>>>>> + }
>>>>> +
>>>>> + return 1;
>>>>> +}
>>>>> +
>>>>> +static void __init save_config(struct pci_dev *dev)
>> This should be renamed. it implies it is saving the config of the pdev passed in,
>> when in reality, it is saving the config of all the devices attached to this pdev.
>> i.e., save_downstream_configs()
>>
>>>>> +{
>>>>> + struct pci_bus *subordinate;
>>>>> + struct pci_dev *child;
>>>>> +
>>>>> + if (!need_reset(dev))
>>>>> + return;
>>>>> +
>>>>> + subordinate = dev->subordinate;
>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>> + dev_info(&child->dev, "save state\n");
>>>>> + pci_save_state(child);
>>>>> + }
>>>>> +}
>>>>> +
>>>>> +static void __init restore_config(struct pci_dev *dev)
>> inverse of above: restore_downstream_configs()
>>>>> +{
>>>>> + struct pci_bus *subordinate;
>>>>> + struct pci_dev *child;
>>>>> +
>>>>> + if (!need_reset(dev))
>>>>> + return;
>>>>> +
>>>>> + subordinate = dev->subordinate;
>>>>> + list_for_each_entry(child,&subordinate->devices, bus_list) {
>>>>> + dev_info(&child->dev, "restore state\n");
>>>>> + pci_restore_state(child);
>>>>> + }
>>>>> +}
>>>>> +
>>>>> +static void __init do_device_reset(struct pci_dev *dev)
>> do_downstream_device_reset() -- it's not resetting this pdev,
>> but the pdev's of the devices attached to it.
>>
> Right, I'll change them as you said.
>
>
>>>>> +{
>>>>> + u16 ctrl;
>>>>> +
>>>>> + if (!need_reset(dev))
>>>>> + return;
>>>>> +
>>>>> + dev_info(&dev->dev, "Reset Secondary bus\n");
>>>>> +
>>>>> + /* Assert Secondary Bus Reset */
>>>>> + pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&ctrl);
>>>>> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
>>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>>> +
>>>>> + msleep(2);
>>>>> +
>> This works well for x86, which uses ioport registers to access
>> these<256-offset registers, b/c the write() function can't return
>> until the write is actually completed, but for a non-x86 system,
>> with fully mmconf'd PCI space, a write() may still be a write& run
>> (sitting in CPU write-merge) buffer, so if you need a full 2ms,
>> you ought to do another read_config() to the device, to flush the write,
>> before starting the msleep(2) clock.
>>
> I didn't know that... I'll insert something like this before msleep.
>
> pci_read_config_word(dev, PCI_BRIDGE_CONTROL,&dummy);
>
> BTW, I referred aer_do_secondary_bus_reset() when I wrote this code.
> Probably need the same fix, right?
>
sounds like it.
Interested to hear from someone in non-x86-land how
they ensure pci cfg writes aren't buffered in their cpus;
maybe handled via special uncached regions on other arches.

>
>>>>> + /* De-assert Secondary Bus Reset */
>>>>> + ctrl&= ~PCI_BRIDGE_CTL_BUS_RESET;
>>>>> + pci_write_config_word(dev, PCI_BRIDGE_CONTROL, ctrl);
>>>>> +}
>>>>> +
>>>>> +static int __initdata pcie_reset_devices;
>>>>> +static int __init reset_pcie_devices(void)
>> this should be reset_pcie_endpoints() ... it's not resetting all pcie devices
>>>>> +{
>>>>> + struct pci_dev *dev = NULL;
>>>>> +
>>>>> + if (!pcie_reset_devices)
>>>>> + return 0;
>>>>> +
>>>>> + for_each_pci_dev(dev)
>>>>> + save_config(dev);
>>>>> +
>>>>> + for_each_pci_dev(dev)
>>>>> + do_device_reset(dev);
>>>>> +
>>>>> + msleep(1000);
>>>>> +
>>>>> + for_each_pci_dev(dev)
>>>>> + restore_config(dev);
>>>>> +
>> My apologies if past thread covered this sequence...
>> Why three loops through all PCIe devices on the system?
>> why not have the first for-each-pci-dev() loop filter devices
>> to be reset, and save_config those pdev's,
>> return a list of saved_pdev's; feed that list into the do_device_reset();
>> then mpsleep(), then restore the list.
>> in fact, once you create a link list of pdev's to reset,
>> just loop that list doing save, then reset; rtn the list, do msleep(),
>> then restore the config of pdevs in the list.
>> Otherwise, doing much more traversing than what's needed.
>> Doing a great deal more config-saving then needed right now as well
>> (saving all non-endpt devices that aren't reset).
>>
>
> Creating list is nice idea, thanks. I'll do.
>
> I'll have vacation next week, so I'll post v2 patch the week after
> next.
>
> Thanks,
> Takao Indoh
>
Have a good vacation! Thanks for your willingness
to respin again & make this improvement.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/