Re: softirq
From: K Prateek Nayak
Date: Thu Jun 27 2024 - 10:15:48 EST
Hello Julia,
Some data from my 3rd Generation EPYC machine.
On 6/27/2024 2:37 AM, Julia Lawall wrote:
On Wed, 26 Jun 2024, Vincent Guittot wrote:
On Wed, 26 Jun 2024 at 07:37, Julia Lawall <julia.lawall@xxxxxxxx> wrote:
Hello,
I'm not sure to understand how soft irqs work. I see the code:
open_softirq(SCHED_SOFTIRQ, sched_balance_softirq);
Intuitively, I would expect that sched_balance_softirq would be run by
ksoftirqd. That is, I would expect ksoftirqd to be scheduled
By default, sched_softirq and others run in interrupt context.
ksoftirqd is woken up only in some cases like when we spent too much
time processing softirq in interrupt context or the softirq is raised
outside interrupt context
nohz_csd_func calls raise_softirq_irqoff, which does:
inline void raise_softirq_irqoff(unsigned int nr)
{
__raise_softirq_irqoff(nr);
/*
* If we're in an interrupt or softirq, we're done
* (this also catches softirq-disabled code). We will
* actually run the softirq once we return from
* the irq or softirq.
*
* Otherwise we wake up ksoftirqd to make sure we
* schedule the softirq soon.
*/
if (!in_interrupt() && should_wake_ksoftirqd())
I think it is the !in_interrupt() check that fails. When I disable C2
(which is I/O Port based C-state on AMD) and only leave C0 (Poll loop)
and C1 (MWAIT based C-state), both of which set TIF_POLLING_NRFLAG while
idling, and I add the following log line:
trace_printk("raise_softirq_irqoff %d %lu %lu\n",
preempt_count(),
in_interrupt());
just above the "if" condition on previous line, I see:
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<idle>-0 [000] d..1. 364.875516: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 364.879504: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 365.299507: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 365.963524: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 367.291500: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 370.339504: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 371.875481: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
<idle>-0 [000] d..1. 374.875462: raise_softirq_irqoff_nohz: raise_softirq_irqoff 1 0
...
(Note, this is only for SCHED_SOFTIRQ being raised from nohz_csd_func())
Since for !CONFIG_PREEMPT_RT "should_wake_ksoftirqd()" always returns
true, we end up waking softirqd for idle load balancing. Note that
"hardirq/softirq" column is always a "." since nohz_csd_func() is
executed from "flush_smp_call_function_queue()" on the way out of
do_idle().
With C2 enabled, which is an I/O Port based C-state on AMD and does
not set TIF_POLLING_NRFLAG, I see:
# _-----=> irqs-off/BH-disabled
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / _-=> migrate-disable
# |||| / delay
# TASK-PID CPU# ||||| TIMESTAMP FUNCTION
# | | | ||||| | |
<idle>-0 [000] d.h1. 2880.497140: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.H2. 2882.193270: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65794 65792
<idle>-0 [000] d.h1. 2884.857103: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.h1. 2886.769577: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.h1. 2886.989832: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.h1. 2887.281561: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.h1. 2887.825556: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
<idle>-0 [000] d.h1. 2888.817564: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536
Raising of softirq here happens in "hardirq" context which, I believe,
will lead to SCHED_SOFTIRQ being serviced on the way out. When enabling
soft_irq_enter and soft_irq_exit tracepoints, I see:
<idle>-0 [000] d.h1. 3309.994942: raise_softirq_irqoff_nohz: raise_softirq_irqoff 65537 65536 65536
<idle>-0 [000] ..s1. 3309.994943: softirq_entry: vec=7 [action=SCHED]
<idle>-0 [000] ..s1. 3309.995026: softirq_exit: vec=7 [action=SCHED]
With the former, I do see nr_running > 1 whenever "ksoftirqd" is running
"sched_balance_domains":
ksoftirqd/0-16 [000] ..s.. 10153.805434: sched_balance_domains: nr_running: 1
ksoftirqd/168-1038 [168] ..s.. 10163.765221: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10166.761349: sched_balance_domains: nr_running: 1
ksoftirqd/120-747 [120] ..s.. 10166.809204: sched_balance_domains: nr_running: 2
ksoftirqd/132-820 [132] ..s.. 10166.813203: sched_balance_domains: nr_running: 1
ksoftirqd/246-1511 [246] ..s.. 10166.845532: sched_balance_domains: nr_running: 1
ksoftirqd/107-668 [107] ..s.. 10166.853201: sched_balance_domains: nr_running: 2
ksoftirqd/120-747 [120] ..s.. 10166.865359: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10191.273328: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10193.137307: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10235.057105: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10320.172832: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10323.708863: sched_balance_domains: nr_running: 1
ksoftirqd/0-16 [000] ..s.. 10338.912787: sched_balance_domains: nr_running: 1
--
Thanks and Regards,
Prateek
wakeup_softirqd();
}
My impression was that wakeup_softirqd was getting called.
But it is true that if the code is being executed by idle, then
in_interrupt() should be true. So perhaps it is someone else who is
waking up ksoftirqd. When I switched to __raise_softirq_irqoff, the
behavior seemed to change, but I may not have fully understood why that
happened.
(sched_switch event), then the various actions of sched_balance_softirq to
be executed, and the ksoftirqd to be unscheduled (another ksoftirqd)
event.
But in practice, I see the code of sched_balance_softirq being executed
by the idle task, before the ksoftirqd is scheduled (see core 40):
What wakes up ksoftirqd ? and which softirq finally runs in ksoftirqd ?
<idle>-0 [040] 3611.432554: softirq_entry: vec=7 [action=SCHED]
<idle>-0 [040] 3611.432554: bputs: sched_balance_softirq: starting nohz
<idle>-0 [040] 3611.432554: bputs: sched_balance_softirq: starting _nohz_idle_balance
bt.B.x-12022 [047] 3611.432554: softirq_entry: vec=1 [action=TIMER]
<idle>-0 [040] 3611.432554: bputs: _nohz_idle_balance.isra.0: searching for a cpu
bt.B.x-12033 [003] 3611.432554: softirq_entry: vec=7 [action=SCHED]
<idle>-0 [040] 3611.432554: bputs: sched_balance_softirq: ending _nohz_idle_balance
bt.B.x-12052 [011] 3611.432554: softirq_entry: vec=7 [action=SCHED]
<idle>-0 [040] 3611.432554: bputs: sched_balance_softirq: nohz returns true ending soft irq
<idle>-0 [040] 3611.432554: softirq_exit: vec=7 [action=SCHED]
For example, idle seems to be running the code in _nohz_idle_balance.
I updated the code of _nohz_idle_balance as follows:
trace_printk("searching for a cpu\n");
for_each_cpu_wrap(balance_cpu, nohz.idle_cpus_mask, this_cpu+1) {
if (!idle_cpu(balance_cpu))
continue;
trace_printk("found an idle cpu\n");
It prints searching for a cpu, but not found an idle cpu, because the
ksoftirqd on the core's runqueue makes the core not idle. This makes the
whole softirq seem fairly useless when the only idle core is the one
raising the soft irq.
The typical behavior is:
CPUA CPUB
do_idle
while (!need_resched()) {
...
kick_ilb
smp_call_function_single_async(CPUB)
send_call_function_single_ipi
raise_ipi ---------------------> cpuidle exit event
irq_handler_entry
ipi_handler
raise sched_softirq
irq_handler_exit
sorftirq_entry
sched_balance_softirq
__nohe_idle_balance
softirq_exit
cpuidle_enter event
softirq is done in the interrupt context after the irq handler and
CPUB never leaves the while (!need_resched()) loop
In your case, I suspect that you have a racing with the polling mode
and the fact that you leave the while (!need_resched()) loop and call
flush_smp_call_function_queue()
We don't use polling on arm64 so I can't even try to reproduce your case
This is with Prateek's patch. So need_resched is not true any more.
thanks,
julia
This is all for the same scenario that I have discussed previously, where
there are two sockets and an overload of on thread on one and an underload
of on thread on the other, and all the thread have been marked by numa
balancing as preferring to be where they are. Now I am trying Prateek's
patch series.
thanks,
julia