Re: [PATCH 0/3] softirq: uncontroversial change

From: Paolo Abeni
Date: Thu Apr 20 2023 - 13:25:09 EST

Next message: Yu Hao: "Re: BUG: divide error in ubi_attach_mtd_dev"
Previous message: Rafael J. Wysocki: "Re: [PATCH v4 6/6] thermal/drivers/intel_menlow: Make additionnal sysfs information optional"
Next in thread: Eric Dumazet: "Re: [PATCH 0/3] softirq: uncontroversial change"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,
On Thu, 2022-12-22 at 14:12 -0800, Jakub Kicinski wrote:
> Catching up on LWN I run across the article about softirq
> changes, and then I noticed fresh patches in Peter's tree.
> So probably wise for me to throw these out there.
>
> My (can I say Meta's?) problem is the opposite to what the RT
> sensitive people complain about. In the current scheme once
> ksoftirqd is woken no network processing happens until it runs.
>
> When networking gets overloaded - that's probably fair, the problem
> is that we confuse latency tweaks with overload protection. We have
> a needs_resched() in the loop condition (which is a latency tweak)
> Most often we defer to ksoftirqd because we're trying to be nice
> and let user space respond quickly, not because there is an
> overload. But the user space may not be nice, and sit on the CPU
> for 10ms+. Also the sirq's "work allowance" is 2ms, which is
> uncomfortably close to the timer tick, but that's another story.
>
> We have a sirq latency tracker in our prod kernel which catches
> 8ms+ stalls of net Tx (packets queued to the NIC but there is
> no NAPI cleanup within 8ms) and with these patches applied
> on 5.19 fully loaded web machine sees a drop in stalls from
> 1.8 stalls/sec to 0.16/sec. I also see a 50% drop in outgoing
> TCP retransmissions and ~10% drop in non-TLP incoming ones.
> This is not a network-heavy workload so most of the rtx are
> due to scheduling artifacts.
>
> The network latency in a datacenter is somewhere around neat
> 1000x lower than scheduling granularity (around 10us).
>
> These patches (patch 2 is "the meat") change what we recognize
> as overload. Instead of just checking if "ksoftirqd is woken"
> it also caps how long we consider ourselves to be in overload,
> a time limit which is different based on whether we yield due
> to real resource exhaustion vs just hitting that needs_resched().
>
> I hope the core concept is not entirely idiotic. It'd be great
> if we could get this in or fold an equivalent concept into ongoing
> work from others, because due to various "scheduler improvements"
> every time we upgrade the production kernel this problem is getting
> worse :(

Please allow me to revive this old thread.

My understanding is that we want to avoid adding more heuristics here,
preferring a consistent refactor.

I would like to propose a revert of:

4cd13c21b207 softirq: Let ksoftirqd do its job

the its follow-ups:

3c53776e29f8 Mark HI and TASKLET softirq synchronous
0f50524789fc softirq: Don't skip softirq execution when softirq thread is parking

The problem originally addressed by 4cd13c21b207 can now be tackled
with the threaded napi, available since:

29863d41bb6e net: implement threaded-able napi poll loop support

Reverting the mentioned commit should address the latency issues
mentioned by Jakub - I verified it solves a somewhat related problem in
my setup - and reduces the layering of heuristics in this area.

A refactor introducing uniform overload detection and proper resource
control will be better, but I admit it's beyond me and anyway it could
still land afterwards.

Any opinion more then welcome!

Thanks,

Paolo

Next message: Yu Hao: "Re: BUG: divide error in ubi_attach_mtd_dev"
Previous message: Rafael J. Wysocki: "Re: [PATCH v4 6/6] thermal/drivers/intel_menlow: Make additionnal sysfs information optional"
Next in thread: Eric Dumazet: "Re: [PATCH 0/3] softirq: uncontroversial change"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]