Re: [PATCH 0/3] PCI VPD access fixes

From: Babu Moger
Date: Wed Jan 13 2016 - 13:56:08 EST



Thanks Hannes. Patches look good. Tested with my LSI Logic / Symbios Logic MegaRAID SAS 2108 controller

+cc Jordan.

Jordan, If your card allows some access to vpd area, these patches might
work for you as well. Worst case, you might have to add your vendor/device ids
in the last(4/4) patch.

On 1/13/2016 2:20 AM, Hannes Reinecke wrote:
> Hi Babu,
>
> On 01/12/2016 11:15 PM, Babu Moger wrote:
>>
>>
>> On 1/12/2016 11:23 AM, Babu Moger wrote:
>>> Hannes,
>>> These patches cause some other issues. I still have to figure
>>> out what is going on here. I have seen same problem before also with your
>>> older patches. That is the reason I thought setting the vpd length to 0
>>> is an easy solution. Here are is what panic stack looks like. I am
>>> looking at it now. Let me know if you see anything obvious here.
>>>
>>> megasas: 06.807.10.00-rc1
>>> megasas: Waiting for FW to come to ready state
>>> megasas: FW now in Ready state
>>> NON-RESUMABLE ERROR: Reporting on cpu 3
>>> NON-RESUMABLE ERROR: TPC [0x0000000000730990] <msix_capability_init+0x250/0x360>
>>> NON-RESUMABLE ERROR: RAW [0003100000000001:0006ccae1076de96:0000000202000080:ffffffffffffffff
>>> NON-RESUMABLE ERROR: 0000000800030000:0000000100000000:0000000000000000:0000000000000000]
>>> NON-RESUMABLE ERROR: handle [0x0003100000000001] stick [0x0006ccae1076de96]
>>> NON-RESUMABLE ERROR: type [precise nonresumable]
>>> NON-RESUMABLE ERROR: attrs [0x02000080] < ASI sp-faulted priv >
>>> NON-RESUMABLE ERROR: raddr [0xffffffffffffffff]
>>> NON-RESUMABLE ERROR: insn effective address [0x000008510079700c]
>>> NON-RESUMABLE ERROR: size [0x8]
>>> NON-RESUMABLE ERROR: asi [0x00]
>>> CPU: 3 PID: 325 Comm: modprobe Tainted: G E 4.1.12-uek4-bm #5
>>> task: fff8000ff37bda00 ti: fff8000ff2940000 task.ti: fff8000ff2940000
>>> TSTATE: 0000000011001607 TPC: 0000000000730990 TNPC: 0000000000730994 Y: 00000000 Tainted: G E
>>> TPC: <msix_capability_init+0x250/0x360>
>>> g0: 0000000000000000 g1: fff8000ff33109c0 g2: 0000000000000000 g3: 000008510079700c
>>> g4: fff8000ff37bda00 g5: fff8000ffe58a000 g6: fff8000ff2940000 g7: 0000085100797000
>>> o0: fff8000ffd60d000 o1: 000000000000ffff o2: 000000000000ffff o3: 000000000111e678
>>> o4: 000000000000001e o5: 0000000000000000 sp: fff8000ff2942711 ret_pc: 0000000000730940
>>> RPC: <msix_capability_init+0x200/0x360>
>>> l0: fff8000ffd60d888 l1: fff8000ff28123ec l2: fff8000ffd60d888 l3: 00000000000080d0
>>> l4: 00000000010eace0 l5: 0000000000000001 l6: 0080000000000000 l7: 8000000000000000
>>> i0: fff8000ffd60d000 i1: fff8000ff28123e4 i2: 0000000000000001 i3: 0000000000000000
>>> i4: 0000000000000000 i5: 0000085100797000 i6: fff8000ff29427d1 i7: 0000000000730c08
>>> I7: <pci_enable_msix+0x168/0x1c0>
>>> Call Trace:
>>> [0000000000730c08] pci_enable_msix+0x168/0x1c0
>>> [0000000000730c80] pci_enable_msix_range+0x20/0x60
>>> [00000000100d7554] megasas_init_fw+0x374/0xb40 [megaraid_sas]
>>> [00000000100d8c80] megasas_probe_one+0x4a0/0x9a0 [megaraid_sas]
>>> [0000000000721e74] local_pci_probe+0x34/0xa0
>>> [0000000000721f88] pci_call_probe+0xa8/0xe0
>>> [0000000000722270] pci_device_probe+0x50/0x80
>>> [000000000079b660] really_probe+0x140/0x420
>>> [000000000079b984] driver_probe_device+0x44/0xa0
>>> [000000000079ba68] __driver_attach+0x88/0xa0
>>> [000000000079986c] bus_for_each_dev+0x6c/0xa0
>>> [000000000079b1fc] driver_attach+0x1c/0x40
>>> [000000000079a2bc] bus_add_driver+0x17c/0x220
>>> [000000000079c234] driver_register+0x74/0x120
>>> [000000000072235c] __pci_register_driver+0x3c/0x60
>>> [00000000100f00ac] megasas_init+0xac/0x1dc [megaraid_sas]
>>> Kernel panic - not syncing: Non-resumable error.
>>> CPU: 3 PID: 325 Comm: modprobe Tainted: G E 4.1.12-uek4-bm #5
>>> Call Trace:
>>> [00000000009c8314] panic+0xb4/0x248
>>> [0000000000429638] sun4v_nonresum_error+0xb8/0xe0
>>> [000000000040744c] sun4v_nonres_mondo+0xc8/0xd8
>>> [0000000000730990] msix_capability_init+0x250/0x360
>>> [0000000000730c08] pci_enable_msix+0x168/0x1c0
>>> [0000000000730c80] pci_enable_msix_range+0x20/0x60
>>> [00000000100d7554] megasas_init_fw+0x374/0xb40 [megaraid_sas]
>>> [00000000100d8c80] megasas_probe_one+0x4a0/0x9a0 [megaraid_sas]
>>> [0000000000721e74] local_pci_probe+0x34/0xa0
>>> [0000000000721f88] pci_call_probe+0xa8/0xe0
>>> [0000000000722270] pci_device_probe+0x50/0x80
>>> [000000000079b660] really_probe+0x140/0x420
>>> [000000000079b984] driver_probe_device+0x44/0xa0
>>> [000000000079ba68] __driver_attach+0x88/0xa0
>>> [000000000079986c] bus_for_each_dev+0x6c/0xa0
>>> [000000000079b1fc] driver_attach+0x1c/0x40
>>> Press Stop-A (L1-A) to return to the boot prom
>>> ---[ end Kernel panic - not syncing: Non-resumable error.
>>>
>>>
>>>
>>> On 1/12/2016 8:42 AM, Hannes Reinecke wrote:
>>>> Hi all,
>>>>
>>>> the current PCI VPD page access assumes that the entire possible VPD
>>>> data is readable. However, the spec only guarantees a VPD data up to
>>>> the 'end' marker, with everything beyond that being undefined.
>>>> This causes a system lockup on certain devices.
>>>>
>>>> With this patch we calculate the actual VPD size, or set it to '0'
>>>> if no valid VPD data could be read.
>>>>
>>>> Hannes Reinecke (3):
>>>> pci: Update VPD definitions
>>>> pci: Update VPD size with correct length
>>>> pci: set VPD size to '0' if PCI_VPD_FLAGS_VPD_REF_F0 is set
>>>>
>>>> drivers/pci/access.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++-
>>>> drivers/pci/pci-sysfs.c | 20 +++++++++------
>>>> include/linux/pci.h | 27 ++++++++++++++++++--
>>>> 3 files changed, 104 insertions(+), 11 deletions(-)
>>>>
>>
>> Hannes, I think the logic to to get the PCI vpd length seems fine. I see that
>> the function pci_vpd_pci22_size returns zero seeing the invalid tag. That looks
>> good. However, this card(LSI Logic / Symbios Logic MegaRAID SAS 2108 controllers)
>> locks-up when we attempt to read pci vpd area. So when the driver tries to
>> initialize the card it times out and panics eventually.
>>
> WHAT? Which brain-dead engineer designed _that_?
>
>> I tried to set up a new flag(PCI_DEV_FLAGS_VPD_ZERO) for this device using
>> DECLARE_PCI_FIXUP_CLASS_EARLY.
>>
>> Changes in pci_vpd_pci22_init
>>
>> if ((dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) ||
>> (dev->dev_flags & PCI_DEV_FLAGS_VPD_ZERO))
>> vpd->base.len = 0;
>> else
>> vpd->base.len = pci_vpd_pci22_size(dev);
>> return 0;
>>
>> Changes in driver/quirk.c
>> static void quirk_set_vpd_sero(struct pci_dev *dev)
>> {
>> dump_stack();
>> dev->dev_flags |= PCI_DEV_FLAGS_VPD_ZERO;
>> }
>>
>> DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_LSI_LOGIC, 0x0060,
>> PCI_CLASS_STORAGE_RAID, 8, quirk_set_vpd_sero);
>>
>> Even this did not help. I don't know sequence of these flags setting.
>> When we are inside the function pci_vpd_pci22_init none of these flags
>> are set up. I need to look at these early quirks again.
>>
> Yes, of course it didn't.
> Thing is, setting the attribute size to '0' doesn't mean there is no data, it just means "we don't know how much data we have".
> IE you can still read from that attribute (cf patch 3/3 in that series) and the mentioned stall is triggered.
>
> Actually, I've been thinking about this, too; with my current patchset we don't make any difference between 'no data' (eg when reading past the end) and 'invalid data' (eg when the card returns garbage or worse)
>
> So I guess I should modify my patch to return -EINVAL if the data is garbage, and integrate your patchset, too.
>
> Let me see ...
>
> Cheers,
>
> Hannes