Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

From: Alexander Duyck
Date: Fri Jan 19 2018 - 11:23:07 EST


On Fri, Jan 19, 2018 at 5:36 AM, Benjamin Poirier
<benjamin.poirier@xxxxxxxxx> wrote:
> On 2018/01/19 17:59, Benjamin Poirier wrote:
>> On 2018/01/18 07:51, Alexander Duyck wrote:
>> > On Wed, Jan 17, 2018 at 10:50 PM, Benjamin Poirier <bpoirier@xxxxxxxx> wrote:
>> > > It was reported that emulated e1000e devices in vmware esxi 6.5 Build
>> > > 7526125 do not link up after commit 4aea7a5c5e94 ("e1000e: Avoid receiver
>> > > overrun interrupt bursts", v4.15-rc1). Some tracing shows that after
>> > > e1000e_trigger_lsc() is called, ICR reads out as 0x0 in e1000_msix_other()
>> > > on emulated e1000e devices. In comparison, on real e1000e 82574 hardware,
>> > > icr=0x80000004 (_INT_ASSERTED | _OTHER) in the same situation.
>> >
>> > Isn't 0x80000004 (_INT_ASSERTED | _LSC)? The assumption I based my
>>
>> Yes. The numeric value is correct. I made a mistake when writing down
>> the flag names.
>>
>> > patch on was that the VMware code was sending _OTHER instead of _LSC
>> > to trigger LSC events. As such in my version of the workaround I just
>>
>> It's not so deterministic, sadly. In my tests, upon entry into
>> e1000_msix_other() after e1000e_trigger_lsc(), with no workaround patch
>> I've seen:
>> icr=0x0
>> icr=0x3d
>> Reserved RXDMT0 Reserved LSC TXDW
>> icr=0x46
>> RXO LSC TXQE
>> and someone else reported:
>> icr=0x3c
>> Reserved RXDMT0 Reserved LSC
>>
>> > went through and did the testing if the _RXO bit was set, otherwise I
>> > assumed that whatever event was received must have been meant to
>> > trigger an _LSC type event since that worked in the past.
> ^...
>
> Re-reading that part, my thoughts are that it worked in the past, before
> 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1),
> because (presumably) Other was not set in EIAC. It worked after
> 16ecba59bc33 but before 4aea7a5c5e94 ("e1000e: Avoid receiver overrun
> interrupt bursts", v4.15-rc1) because e1000_msix_other() didn't read the
> value of icr. If it had, it would've found a bogus value, which is
> what's happening after 4aea7a5c5e94. I wonder if we're not just getting
> some uninitialized value from the emulation code... which makes me kind
> of uneasy about your approach of trying to make sense of the value.
> Maybe Shrikrishna can clarify where the icr value is coming from when
> Other is set in EIAC.

For now I still say we let my current patch go as-is in order to
address the immediate issue. We can follow-up and do a more refined
version of things once we get the final word from VMware on how this
actually works. If nothing else the current patch appears to resolve
the currently reported issue and is already submitted.

I'm of the mind that we need to cut down on the code thrash. This
driver is supposed to have been in a "maintenance" mode for the last
year or so as there aren't being any new parts added is my
understanding. As-is I don't see any reason why 16ecba59bc33 ("e1000e:
Do not read ICR in Other interrupt", v4.5-rc1) was submitted or
accepted in the first place. I don't see any notes about it fixing any
bug or addressing any issue and it seems like that is the start of all
the issues we have been having recently with RXO triggering more
interrupts, various link issues, and this most recent VMware issue.

- Alex