Re: [E1000-devel] 3.11-rc4 ixgbevf: endless "Last Request of type 00to PF Nacked" messages
From: Bjorn Helgaas
Date: Tue Aug 27 2013 - 19:01:57 EST
On Fri, Aug 23, 2013 at 3:41 PM, Skidmore, Donald C
<donald.c.skidmore@xxxxxxxxx> wrote:
>> -----Original Message-----
>> From: Bjorn Helgaas [mailto:bhelgaas@xxxxxxxxxx]
>> Sent: Friday, August 23, 2013 1:43 PM
>> To: Skidmore, Donald C
>> Cc: e1000-devel@xxxxxxxxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; linux-
>> kernel@xxxxxxxxxxxxxxx; Don Dutile
>> Subject: Re: [E1000-devel] 3.11-rc4 ixgbevf: endless "Last Request of type 00
>> to PF Nacked" messages
>>
>> On Fri, Aug 23, 2013 at 2:37 PM, Skidmore, Donald C
>> <donald.c.skidmore@xxxxxxxxx> wrote:
>> >> -----Original Message-----
>> >> From: Bjorn Helgaas [mailto:bhelgaas@xxxxxxxxxx]
>> >> Sent: Friday, August 23, 2013 11:53 AM
>> >> To: Skidmore, Donald C
>> >> Cc: e1000-devel@xxxxxxxxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx;
>> >> linux- kernel@xxxxxxxxxxxxxxx; Don Dutile
>> >> Subject: Re: [E1000-devel] 3.11-rc4 ixgbevf: endless "Last Request of
>> >> type 00 to PF Nacked" messages
>> >>
>> >> On Fri, Aug 23, 2013 at 06:25:06PM +0000, Skidmore, Donald C wrote:
>> >> > > -----Original Message-----
>> >> > > From: Bjorn Helgaas [mailto:bhelgaas@xxxxxxxxxx]
>> >> > > Sent: Friday, August 23, 2013 9:53 AM
>> >> > > To: Skidmore, Donald C
>> >> > > Cc: e1000-devel@xxxxxxxxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx;
>> >> > > linux- kernel@xxxxxxxxxxxxxxx; Don Dutile
>> >> > > Subject: Re: [E1000-devel] 3.11-rc4 ixgbevf: endless "Last
>> >> > > Request of type 00 to PF Nacked" messages
>> >> > >
>> >> > > On Tue, Aug 20, 2013 at 5:37 PM, Bjorn Helgaas
>> >> > > <bhelgaas@xxxxxxxxxx>
>> >> > > wrote:
>> >> > > > On Tue, Aug 20, 2013 at 5:08 PM, Bjorn Helgaas
>> >> > > > <bhelgaas@xxxxxxxxxx>
>> >> > > wrote:
>> >> > > >> On Tue, Aug 13, 2013 at 8:23 PM, Bjorn Helgaas
>> >> > > >> <bhelgaas@xxxxxxxxxx>
>> >> > > wrote:
>> >> > > >
>> >> > > >>> I played with this a little more and found this:
>> >> > > >>>
>> >> > > >>> 1) Magma card in z420, connected to chassis containing X540:
>> >> > > >>> fails (original report)
>> >> > > >>> 2) X540 in z420, Magma card in z420, connected to empty chassis:
>> >> > > >>> fails
>> >> > > >>> 3) X540 in z420, Magma card in z420 but no cable to chassis:
>> >> > > >>> works
>> >> > > >
>> >> > > > For what it's worth, I tried config 3 again with v3.11-rc6, and
>> >> > > > it failed the same way. I haven't bothered with config 2.
>> >> > > > It's not 100% reproducible, but at least it doesn't seem
>> >> > > > related to the expansion chassis.
>> >> > > >
>> >> > > > I attached the logs from config 3 to
>> >> > > > https://bugzilla.kernel.org/show_bug.cgi?id=60776
>> >> > >
>> >> > > Is there anything I can do to help debug this? Add
>> >> > > instrumentation, etc.? It seems like I'm doing the simplest
>> >> > > possible thing -- just writing to the sysfs sriov_num_vfs file to enable
>> VFs.
>> >> > >
>> >> > > I almost think it must be related to my config somehow if nobody
>> >> > > else is seeing this, but at the same time, my config also seems
>> >> > > the simplest possible, so I don't know what I could be doing that's
>> unusual.
>> >> > >
>> >> > > Bjorn
>> >> >
>> >> > Hey Bjorn,
>> >> >
>> >> > I'm may be little confused so bear with me.
>> >> >
>> >> > Option 1 = (your normal set up), Magma card plugged to chasis, X540
>> >> > in
>> >> chasis.
>> >> > Option 2 = Magma card plugged to chasis, X540 in z420 system.
>> >> > Option 3 = Magma card UNplugged from chasis, x540 in z420 system.
>> >> >
>> >> > Options 1 & 2 - always fail
>> >> > Option 3 - sometimes fails (unsure at what rate failure occurs)
>> >> >
>> >> > Please correct me if I messed any of that up. :)
>> >>
>> >> Generally correct. I've seen failures in all three configs, so I'm
>> >> only concerned with the simplest for now (config 3, no expansion chassis).
>> >>
>> >> > Another question I have relates to the lspci output you supplied in
>> >> > the
>> >> bugzilla. I'm not seeing the VF devices (i.e. 08:10.0) did you run
>> >> lspci before you created the VF's? If so could we see one while the failure
>> was occurring?
>> >>
>> >> That's correct, I collected the lspci output before reproducing the
>> >> problem. I can't easily collect lspci afterwards because the machine
>> >> isn't responsive after the problem starts.
>> >>
>> >> > Also could you download the latest ixgbevf from source forge?
>> >> >
>> >> > https://sourceforge.net/projects/e1000/files/ixgbevf%20stable/
>> >> >
>> >> > If we add debugging messages it will be easier to patch this driver
>> >> > and it
>> >> contains our latest validated code base.
>> >>
>> >> I can do that if it turns out to be necessary. But John Haller gave
>> >> me a good clue off-list:
>> >>
>> >> John wrote:
>> >> > I assume you want the VFs to be instantiated in a VM. To do this,
>> >> > you need to blacklist the ixgbevf driver in the host (or not
>> >> > compile it into the host), or it will try to associate the driver
>> >> > in the host, rather than in the VM where you want it. Then, the VM
>> >> > needs the ixgbevf driver, which will hopefully do a better job of
>> >> > talking to the mailbox in the host. There is some work to assign
>> >> > the VF(s) to the VM, but I don't remember that offhand.
>> >>
>> >> I don't have any VMs (I started this whole thing because I was
>> >> looking at a PCI hotplug issue related to SR-IOV, so I don't really care
>> about VMs).
>> >>
>> >> So the ixgbevf driver on the *host* is claiming the new VFs, and it
>> >> sounds like maybe it can't handle that?
>> >>
>> >> Bjorn
>> >
>> > Not to speak for John, but I believe he was saying if you want to use your
>> VF's in a VM you need to make sure you don't run the ixgbevf driver on the
>> host as it will "claim" the VF's. If you are NOT running any VM's then it is
>> perfectly fine to have both ixgbe and ixgbevf loaded.
>>
>> OK. It certainly *seemed* surprising to have the ixgbevf driver blow up,
>> even if it was an error on my part to load it in the host. Just let me know if
>> there's any more testing I can do.
>>
>> Bjorn
>
> Something is leading to the mbx messages being messed up as event by the " Last Request of type 03 to PF Nacked" messages. Have you tried reseting the ixgbevf port (ethtool -r <your port>)? Is it even possible to do this as you mentioned that in the failure state the machine isn't very responsive?
>
> If it might be worthwhile to add logging into the ixgbevf and ixgbe drivers around the mbx messages, with the hope being that it would help show what is going between the two. There have been some changes in that area of the ixgbevf code as of late, so working off the latest source forge driver would the easiest for me to send you patch on. Sadly we haven't been able to recreate the failure here so it makes it rather hard to debug.
I haven't been able to reproduce the problem with the 2.10.3 ixgbevf
driver from http://sourceforge.net/projects/e1000/files/ixgbevf%20stable/
I did notice what looks like a printk format problem and what appears
to be a bare MAC address with no label:
[ 316.699504] ixgbevf: eth%d: ixgbevf_init_interrupt_scheme:
Multiqueue Disabled: Rx Queue count = 1, Tx Queue count = 1
[ 316.710897] ixgbevf: eth3: ixgbevf_probe: Intel(R) X540 Virtual Function
[ 316.717608] 08:88:ff:ff:0d:ec
Sorry for wasting so much time on something that appears to be already fixed.
Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/