RE: Re: [PATCH] ena: Speed up initialization 90x by reducing poll delays

From: Jubran, Samih
Date: Sun Apr 12 2020 - 05:37:52 EST


Hi Josh,

I wanted to let you know that we are still looking into your patch.
After some careful considerations we have decided to set the value of
ENA_POLL_US to 100us. The rationale behind this choice is that the
device might take up to 1ms to complete the reset operation and we
don't want to bombard device. We do agree with most of your patch
and we will be sending one based on it for review.

Thanks,
Sameeh

> -----Original Message-----
> From: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
> Sent: Friday, March 13, 2020 2:28 PM
> To: Jubran, Samih <sameehj@xxxxxxxxxx>
> Cc: Machulsky, Zorik <zorik@xxxxxxxxxx>; Belgazal, Netanel
> <netanel@xxxxxxxxxx>; Kiyanovski, Arthur <akiyano@xxxxxxxxxx>;
> Tzalik, Guy <gtzalik@xxxxxxxxxx>; Bshara, Saeed <saeedb@xxxxxxxxxx>;
> netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: RE: [EXTERNAL]Re: [PATCH] ena: Speed up initialization 90x by
> reducing poll delays
>
> CAUTION: This email originated from outside of the organization. Do not click
> links or open attachments unless you can confirm the sender and know the
> content is safe.
>
>
>
> On Wed, Mar 11, 2020 at 01:24:17PM +0000, Jubran, Samih wrote:
> > Hi Josh,
> >
> > Thanks for taking the time to write this patch. I have faced a bug while
> testing it that I haven't pinpointed yet the root cause of the issue, but it
> seems to me like a race in the netlink infrastructure.
> >
> > Here is the bug scenario:
> > 1. created ac c5.24xlarge instance in AWS in v_virginia region using
> > the default amazon Linux 2 AMI 2. apply your patch won top of net-next
> > v5.2 and install the kernel (currently I'm able to boot net-next v5.2
> > only, higher versions of net-next suffer from errors during boot time)
> > 3. run "rmmod ena && insmod ena.ko" twice
> >
> > Result:
> > The interface is not in up state
> >
> > Expected result:
> > The interface should be in up state
> >
> > What I know so far:
> > * ena_probe() seems to finish with no errors whatsoever
> > * adding prints / delays to ena_probe() causes the bug to vanish or
> > less likely to occur depending on the amount of delays I add
> > * ena_up() is not called at all when the bug occurs, so it's something
> > to do with netlink not invoking dev_open()
> >
> > Did you face such issues? Do you have any idea what might be causing this?
>
> I haven't observed anything like this. I didn't test with Amazon Linux 2,
> though.
>
> To rule out some possibilities, could you try disabling *all* userspace
> networking bits, so that userspace does nothing with a newly discovered
> interface, and then testing again? (The interface wouldn't be "up" in that
> case, but it should still have a link detected.)
>
> If that works, then I wonder if the userspace used in Amazon Linux 2 might
> have some kind of race where it's still using the previous incarnation of the
> device when you rmmod and insmod? Perhaps the previous delays made it
> difficult or impossible to trigger that race?
>
> - Josh Triplett