Re: [PATCH v1] neighbour: Don't let neigh_forced_gc() disable preemption for long

From: Eric Dumazet
Date: Fri Dec 01 2023 - 12:36:03 EST


On Fri, Dec 1, 2023 at 6:16 PM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> Hi,
>
> On Fri, Dec 1, 2023 at 7:58 AM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> >
> > On Fri, Dec 1, 2023 at 4:16 PM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > On Fri, Dec 1, 2023 at 1:10 AM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Dec 1, 2023 at 9:39 AM Judy Hsiao <judyhsiao@xxxxxxxxxxxx> wrote:
> > > > >
> > > > > We are seeing cases where neigh_cleanup_and_release() is called by
> > > > > neigh_forced_gc() many times in a row with preemption turned off.
> > > > > When running on a low powered CPU at a low CPU frequency, this has
> > > > > been measured to keep preemption off for ~10 ms. That's not great on a
> > > > > system with HZ=1000 which expects tasks to be able to schedule in
> > > > > with ~1ms latency.
> > > >
> > > > This will not work in general, because this code runs with BH blocked.
> > > >
> > > > jiffies will stay untouched for many more ms on systems with only one CPU.
> > > >
> > > > I would rather not rely on jiffies here but ktime_get_ns() [1]
> > > >
> > > > Also if we break the loop based on time, we might be unable to purge
> > > > the last elements in gc_list.
> > > > We might need to use a second list to make sure to cycle over all
> > > > elements eventually.
> > > >
> > > >
> > > > [1]
> > > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> > > > index df81c1f0a57047e176b7c7e4809d2dae59ba6be5..e2340e6b07735db8cf6e75d23ef09bb4b0db53b4
> > > > 100644
> > > > --- a/net/core/neighbour.c
> > > > +++ b/net/core/neighbour.c
> > > > @@ -253,9 +253,11 @@ static int neigh_forced_gc(struct neigh_table *tbl)
> > > > {
> > > > int max_clean = atomic_read(&tbl->gc_entries) -
> > > > READ_ONCE(tbl->gc_thresh2);
> > > > + u64 tmax = ktime_get_ns() + NSEC_PER_MSEC;
> > >
> > > It might be nice to make the above timeout based on jiffies. On a
> > > HZ=100 system it's probably OK to keep preemption disabled for 10 ms
> > > but on a HZ=1000 system you'd want 1 ms. ...so maybe you'd want to use
> > > jiffies_to_nsecs(1)?
> >
> > I do not think so. 10ms would be awfully long.
> >
> > We have nsec based time service, why downgrading to jiffies resolution ???
>
> Well, the whole issue is that we're disabling preemption, right?
> Unless I'm mistaken, on a 1000 HZ system then a task's timeslice is
> 1ms and on a 100 HZ system then a task's timeslice is 10ms. When we
> disable preemption then the problem is that we can keep going past the
> end of our timeslice. This is bad for whatever task the system is
> trying to schedule instead of us since it will be blocked waiting for
> us to re-enable preemption.
>
> So essentially the problem here is really tied to jiffies resolution,
> right? Specifically, if jiffies is 100 Hz then it's actually
> inefficient to timeout every 1 ms--I think it would be better to use
> up our whole timeslice.

It is not because a kernel is built with HZ=100 that each thread has
to consume cpu times in 10ms slices.

Process scheduler does not use jiffies at all, but high resolution time service.

Keep in mind this code is run from soft-interrupt, not a dedicated processus.

>
>
> > > One worry might be that we disabled preemption _right before_ we were
> > > supposed to be scheduled out. In that case we'll end up blocking some
> > > other task for another full timeslice, but maybe there's not a lot we
> > > can do there?
> >
> > Can you tell us in which scenario this gc_list can be so big, other
> > than fuzzers ?
>
> The place we hit this wasn't actually with fuzzers but with normal
> usage in our labs. The only case where it was a really big problem was
> when neigh_forced_gc() was scheduled on a "little" CPU (in a
> big.LITTLE system) and that little CPU happened to be running at the
> lowest CPU frequency. Specifically Judy was testing on sc7180-trogdor
> and the lowest CPU Frequency of the "little" CPUs was 300 MHz. Since
> the littles are less powerful than the bigs, this is roughly the
> equivalent processing power of a big core running at 120 MHz.
>
> FWIW, we are apparently no longer seeing the bad latency after
> <https://crrev.com/c/4914309>, which does this:
>
> # Increase kernel neighbor table size.
> echo 1024 > /proc/sys/net/ipv4/neigh/default/gc_thresh1
> echo 4096 > /proc/sys/net/ipv4/neigh/default/gc_thresh2
> echo 8192 > /proc/sys/net/ipv4/neigh/default/gc_thresh3
> echo 1024 > /proc/sys/net/ipv6/neigh/default/gc_thresh1
> echo 4096 > /proc/sys/net/ipv6/neigh/default/gc_thresh2
> echo 8192 > /proc/sys/net/ipv6/neigh/default/gc_thresh3
>
> However, I still believe that we should land something like Judy's
> patch because, no matter what kernel tunings we have, the kernel
> shouldn't be disabling preemption for so long.

Sure, and I suggested a refinement, because as I said jiffies can
stick to a value.

Not sure why a refinement given by a network maintainer is not an option ?

I must be missing something.

>
> I will also note that, while I don't know the code at all, someone on
> our networking team commented this: High CPU usage / latency on
> neigh_cleanup_and_release is expected naturally because of our
> relatively-noisy lab environment: there are often hundreds if not
> thousands of devices + virtual devices in a single L2 network.

Well, I know this code a bit, trust me.