Re: [Intel-wired-lan] [RFC PATCH] e1000e: Remove Other from EIAC.
From: Alexander Duyck
Date: Sat Jan 20 2018 - 12:22:20 EST
On Fri, Jan 19, 2018 at 2:55 PM, Benjamin Poirier
<benjamin.poirier@xxxxxxxxx> wrote:
> On 2018/01/20 07:45, Benjamin Poirier wrote:
> [...]
>> >
>> > I'm of the mind that we need to cut down on the code thrash. This
>> > driver is supposed to have been in a "maintenance" mode for the last
>> > year or so as there aren't being any new parts added is my
>> > understanding. As-is I don't see any reason why 16ecba59bc33 ("e1000e:
>> > Do not read ICR in Other interrupt", v4.5-rc1) was submitted or
>> > accepted in the first place. I don't see any notes about it fixing any
>> > bug or addressing any issue and it seems like that is the start of all
>> > the issues we have been having recently with RXO triggering more
>> > interrupts, various link issues, and this most recent VMware issue.
>>
>> I'm sorry to say but you're the one who suggested that change:
>>
>> http://lkml.iu.edu/hypermail/linux/kernel/1510.3/03528.html
>>
>> On 2015/10/28 23:08, Alexander Duyck wrote:
>> > On 10/22/2015 05:32 PM, Benjamin Poirier wrote:
>> [...]
>> >
>> > I would argue your first patch probably didn't go far enough to remove dead
>> > code. Specifically you should only ever get into this function if LSC is
>> > set. There are no other causes that should trigger this. As such you could
>> > probably remove the ICR read, and instead replace it with an ICR write of
>> > the LSC bit since OTHER is already cleared via EIAC.
>> >
>
> ... The assumption that "There are no other causes that should trigger
> this." turned out to be wrong and that caused the RXO problems. Clearing
> OTHER via EIAC is causing the problems with vmware now. I don't think
> you foresaw those problems back in 2015 and neither did I.
Well that explains why I felt like I was explaining things to a
younger version of myself. I was a bit more relaxed in terms of being
willing to make arbitrary changes a few years ago. I tend to be a bit
more conservative now, at least as far as having justifications as to
why I want to do things. With any change you end up taking on risk,
and so usually a patch has a justification as to why you are making
the change.
Unfortunately at the time I didn't have all the information and based
my suggestion on a bad assumption. I would guess at the time I was
thinking of doing general code cleanup. Other drivers such as igb work
this way, but it led us down the path we are on now where we are
having to make one fix after another. It is leading in the opposite
direction of maintainability as this is all becoming more complex.
Suggesting this was a bad decision on my part at the time. I'm only
human, I make mistakes.. :-)
With further review of the code I am seeing various other issues that
could still pop up as I am not certain we should even have the "other"
interrupt messing with the NAPI polling or packet accounting logic at
all. The question I would have at this point is if we revert
16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1)
and all the related fixes for it, what do we end up with? It seems
like the code is slowly heading back in the direction of where it was
originally anyway since there have been a number of partial reverts.
I'm wondering what would happen if we were to just short-cut that and
audit the patches involved to see what we really need and don't.
Your patch as proposed is essentially another step in that direction.
I'm thinking we may want to drop my currently proposed fix for now and
instead look at going through and figure out what changes after that
first one are still really needed. It doesn't look like my fix will
make it for 4.15 anyway so we might as well focus on making certain to
have things as solid as possible by the time 4.16-rc1 rolls around.
Thanks.
- Alex