Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.

From: Benjamin Poirier
Date: Mon Jan 22 2018 - 02:12:25 EST


On 2018/01/20 09:21, Alexander Duyck wrote:
> On Fri, Jan 19, 2018 at 2:55 PM, Benjamin Poirier
> <benjamin.poirier@xxxxxxxxx> wrote:
> > On 2018/01/20 07:45, Benjamin Poirier wrote:
> > [...]
> >> >
> >> > I'm of the mind that we need to cut down on the code thrash. This
> >> > driver is supposed to have been in a "maintenance" mode for the last
> >> > year or so as there aren't being any new parts added is my
> >> > understanding. As-is I don't see any reason why 16ecba59bc33 ("e1000e:
> >> > Do not read ICR in Other interrupt", v4.5-rc1) was submitted or
> >> > accepted in the first place. I don't see any notes about it fixing any
> >> > bug or addressing any issue and it seems like that is the start of all
> >> > the issues we have been having recently with RXO triggering more
> >> > interrupts, various link issues, and this most recent VMware issue.
> >>
> >> I'm sorry to say but you're the one who suggested that change:
> >>
> >> http://lkml.iu.edu/hypermail/linux/kernel/1510.3/03528.html
> >>
> >> On 2015/10/28 23:08, Alexander Duyck wrote:
> >> > On 10/22/2015 05:32 PM, Benjamin Poirier wrote:
> >> [...]
> >> >
> >> > I would argue your first patch probably didn't go far enough to remove dead
> >> > code. Specifically you should only ever get into this function if LSC is
> >> > set. There are no other causes that should trigger this. As such you could
> >> > probably remove the ICR read, and instead replace it with an ICR write of
> >> > the LSC bit since OTHER is already cleared via EIAC.
> >> >
> >
> > ... The assumption that "There are no other causes that should trigger
> > this." turned out to be wrong and that caused the RXO problems. Clearing
> > OTHER via EIAC is causing the problems with vmware now. I don't think
> > you foresaw those problems back in 2015 and neither did I.
>
> Well that explains why I felt like I was explaining things to a
> younger version of myself. I was a bit more relaxed in terms of being
> willing to make arbitrary changes a few years ago. I tend to be a bit
> more conservative now, at least as far as having justifications as to
> why I want to do things. With any change you end up taking on risk,
> and so usually a patch has a justification as to why you are making
> the change.
>
> Unfortunately at the time I didn't have all the information and based
> my suggestion on a bad assumption. I would guess at the time I was
> thinking of doing general code cleanup. Other drivers such as igb work
> this way, but it led us down the path we are on now where we are
> having to make one fix after another. It is leading in the opposite
> direction of maintainability as this is all becoming more complex.
> Suggesting this was a bad decision on my part at the time. I'm only
> human, I make mistakes.. :-)

Thanks for the introspection.

>
> With further review of the code I am seeing various other issues that
> could still pop up as I am not certain we should even have the "other"
> interrupt messing with the NAPI polling or packet accounting logic at
> all. The question I would have at this point is if we revert
> 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1)
> and all the related fixes for it, what do we end up with?

The patch I submitted for the current vmware issue actually finishes
reverting commit 16ecba59bc33.

I believe the relevant commits to consider are:

16ecba59bc33 e1000e: Do not read ICR in Other interrupt (v4.5-rc1)
a61cfe4ffad7 e1000e: Do not write lsc to ics in msi-x mode (v4.5-rc1)
0a8047ac68e5 e1000e: Fix msi-x interrupt automask (v4.5-rc1)

19110cfbb34d e1000e: Separate signaling for link check/link up
(v4.15-rc1)
4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts (v4.15-rc1)

4110e02eb45e e1000e: Fix e1000_check_for_copper_link_ich8lan return
value. (v4.15-rc8)

(submitted) e1000e: Remove Other from EIAC.

commit 4aea7a5c5e94 essentially reverts a61cfe4ffad7 and part of
16ecba59bc33 (ICR read). The submitted patch reverts the rest of
16ecba59bc33 (EIAC clearing of Other).

> It seems
> like the code is slowly heading back in the direction of where it was
> originally anyway since there have been a number of partial reverts.
> I'm wondering what would happen if we were to just short-cut that and
> audit the patches involved to see what we really need and don't.
>
> Your patch as proposed is essentially another step in that direction.
> I'm thinking we may want to drop my currently proposed fix for now and
> instead look at going through and figure out what changes after that
> first one are still really needed.

If the patch that I submitted for the current vmware issue is merged,
the significant commits that are left are:

0a8047ac68e5 e1000e: Fix msi-x interrupt automask (v4.5-rc1)
Fixes a problem in the irq disabling of the napi implementation.

19110cfbb34d e1000e: Separate signaling for link check/link up
(v4.15-rc1)
Fixes link flapping caused by a race condition in link
detection. It was found because the Other interrupt was being
triggered sort of spuriously by rxo.

4aea7a5c5e94 e1000e: Avoid receiver overrun interrupt bursts (v4.15-rc1)
Fixes Other interrupt bursts during sustained rxo conditions.

I think all of those are still needed, or if they're removed, they need
to be reimplemented differently.

> It doesn't look like my fix will
> make it for 4.15 anyway so we might as well focus on making certain to
> have things as solid as possible by the time 4.16-rc1 rolls around.
>
> Thanks.
>
> - Alex