Re: [RFC PATCH 0/2] net: threadable napi poll loop
From: Paolo Abeni
Date: Wed May 11 2016 - 10:38:45 EST
On Wed, 2016-05-11 at 06:08 -0700, Eric Dumazet wrote:
> On Wed, 2016-05-11 at 11:48 +0200, Paolo Abeni wrote:
> > Hi Eric,
> > On Tue, 2016-05-10 at 15:51 -0700, Eric Dumazet wrote:
> > > On Wed, 2016-05-11 at 00:32 +0200, Hannes Frederic Sowa wrote:
> > >
> > > > Not only did we want to present this solely as a bugfix but also as as
> > > > performance enhancements in case of virtio (as you can see in the cover
> > > > letter). Given that a long time ago there was a tendency to remove
> > > > softirqs completely, we thought it might be very interesting, that a
> > > > threaded napi in general seems to be absolutely viable nowadays and
> > > > might offer new features.
> > >
> > > Well, you did not fix the bug, you worked around by adding yet another
> > > layer, with another sysctl that admins or programs have to manage.
> > >
> > > If you have a special need for virtio, do not hide it behind a 'bug fix'
> > > but add it as a features request.
> > >
> > > This ksoftirqd issue is real and a fix looks very reasonable.
> > >
> > > Please try this patch, as I had very good success with it.
> >
> > Thank you for your time and your effort.
> >
> > I tested your patch on the bare metal "single core" scenario, disabling
> > the unneeded cores with:
> > CPUS=`nproc`
> > for I in `seq 1 $CPUS`; do echo 0 > /sys/devices/system/node/node0/cpu$I/online; done
> >
> > And I got a:
> >
> > [ 86.925249] Broke affinity for irq <num>
> >
>
> Was it fatal, or simply a warning that you are removing the cpu that was
> the only allowed cpu in an affinity_mask ?
The above message is emitted with pr_notice() by the x86 version of
fixup_irqs(). It's not fatal, the host is alive and well after that. The
un-patched kernel does not emit it on cpus disabling.
I'll try to look into this later.
> Looks another bug to fix then ? We disabled CPU hotplug here at Google
> for our production, as it was notoriously buggy. No time to fix dozens
> of issues added by a crowd of developers that do not even know a cpu can
> be unplugged.
>
> Maybe some caller of local_bh_disable()/local_bh_enable() expected that
> current softirq would be processed. Obviously flaky even before the
> patches.
>
> > for each irq number generated by a network device.
> >
> > In this scenario, your patch solves the ksoftirqd issue, performing
> > comparable to the napi threaded patches (with a negative delta in the
> > noise range) and introducing a minor regression with a single flow, in
> > the noise range (3%).
> >
> > As said in a previous mail, we actually experimented something similar,
> > but it felt quite hackish.
>
> Right, we are networking guys, and we feel that messing with such core
> infra is not for us. So we feel comfortable adding a pure networking
> patch.
>
> >
> > AFAICS this patch adds three more tests in the fast path and affect all
> > other softirq use case. I'm not sure how to check for regression there.
>
> It is obvious to me that ksoftird mechanism is not working as intended.
>
> Fixing it might uncover bugs from parts of the kernel relying on the
> bug, indirectly or directly. Is it a good thing ?
>
> I can not tell before trying.
>
> Just by looking at /proc/{ksoftirqs_pid}/sched you can see the problem,
> as we normally schedule ksoftird under stress but most of the time,
> the softirq items were processed by another tasks as you found out.
>
>
> >
> > The napi thread patches are actually a new feature, that also fixes the
> > ksoftirqd issue: hunting the ksoftirqd issue has been the initial
> > trigger for this work. I'm sorry for not being clear enough in the cover
> > letter.
> >
> > The napi thread patches offer additional benefits, i.e. an additional
> > relevant gain in the described test scenario, and do not impact on other
> > subsystems/kernel entities.
> >
> > I still think they are worthy, and I bet you would disagree, but could
> > you please articulate more which parts concern you most and/or are more
> > bloated ?
>
> Just look at the added code. napi_threaded_poll() is very buggy, but
> honestly I do not want to fix the bugs you added there. If you have only
> one vcpu, how jiffies can ever change since you block BH ?
Uh, we have likely the same issue in the net_rx_action() function, which
also execute with bh disabled and check for jiffies changes even on
single core hosts ?!?
Aren't jiffies updated by the timer interrupt ? and thous even with
bh_disabled ?!?
> I was planning to remove cond_resched_softirq() that we no longer use
> after my recent changes to TCP stack,
> and you call it again (while it is obviously buggy since it does not
> check if a BH is pending, only if a thread needs the cpu)
I missed that, thank you for pointing out.
> I prefer fixing the existing code, really. It took us years to
> understand it and maybe fix it.
>
> Just think of what will happen if you have 10 devices (10 new threads in
> your model) and one cpu.
>
> Instead of the nice existing netif_rx() doing 64 packets per device
> rounds, you'll now rely on process scheduler behavior that has no such
> granularity.
>
> Adding more threads is the natural answer of userland programmers, but
> in the kernel it is not the right answer. We already have mechanism,
> just use them and fix them if they are broken.
>
> Sorry, I really do not think your patches are the way to go.
> But this thread is definitely interesting.
Oh, this is a far better comment that I would have expected ;-)
Cheers,
Paolo