RE: [RFC PATCH v3 2/3] sched: Avoid placing RT threads on cores handling long softirqs
From: David Laight
Date: Tue Oct 04 2022 - 05:52:07 EST
From: John Stultz
> Sent: 03 October 2022 17:55
> On Wed, Sep 28, 2022 at 5:55 AM Qais Yousef <qais.yousef@xxxxxxx> wrote:
> > On 09/21/22 01:25, John Stultz wrote:
> > > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> > > index a749a8663841..1d126b8495bc 100644
> > > --- a/include/linux/interrupt.h
> > > +++ b/include/linux/interrupt.h
> > > @@ -582,6 +582,12 @@ enum
> > > * _ IRQ_POLL: irq_poll_cpu_dead() migrates the queue
> > > */
> > > #define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(RCU_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ))
> > > +/* Softirq's where the handling might be long: */
> > > +#define LONG_SOFTIRQ_MASK ((1 << NET_TX_SOFTIRQ) | \
> > > + (1 << NET_RX_SOFTIRQ) | \
> > > + (1 << BLOCK_SOFTIRQ) | \
> > > + (1 << IRQ_POLL_SOFTIRQ) | \
> > > + (1 << TASKLET_SOFTIRQ))
> > I'm not sure about the TASKLET. I can understand NET and BLOCK require high
> > throughput, hence could end up in softirq for a long time. But TASKLET seems
> > allowing sloppiness. I don't feel strongly about it, but worth debating if we
> > really need to include it.
> That's fair. Digging through the patch history in the Android trees,
> the first pass was for all softirqs but then restricted to remove
> known short-running ones.
> From the bug history and what I can directly reproduce, the net and
> block softirqs have definitely caused trouble, but I don't see a
> specific example from TASKLET, so I'm ok dropping that for now, and
> should we get specific evidence we can argue for it in a future patch.
> So I'll drop TASKLET from the list here. Thanks for the suggestion!
I've also seen the code that finally frees memory freed under rcu
take a long time.
That was a workload sending a lot of UDP/RTP from a raw socket using
IP_HDRINC - each send allocated a structure (fib?) that was freed from
the rcu (timer?) softint callback.
But, actually, one of the biggest causes of RT wakeup latency
was a normal thread looping without a cond_resched() call.
In my case some graphics driver doing page flushes of the
> > > int target = find_lowest_rq(p);
> > > @@ -1656,11 +1699,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> > > goto out_unlock;
> > >
> > > /*
> > > - * Don't bother moving it if the destination CPU is
> > > + * If cpu is non-preemptible, prefer remote cpu
> > > + * even if it's running a higher-prio task.
> > > + * Otherwise: Don't bother moving it if the destination CPU is
> > > * not running a lower priority task.
> > > */
> > > if (target != -1 &&
> > > - p->prio < cpu_rq(target)->rt.highest_prio.curr)
> > > + (may_not_preempt ||
> > > + p->prio < cpu_rq(target)->rt.highest_prio.curr))
> > > cpu = target;
> > I'm not sure this makes sense. You assume a higher priority task will cause
> > less delay than softirqs. Which I think is an optimistic assumption?
> > I think we should just mimic the same fallback behavior when we fail to find
> > a CPU that fits the capacity requirement. Keeps things more consistent IMO.
> This sounds reasonable. I do fret that long-running rt tasks are less
> common then the long running softirqs, so this may have an impact to
> the effectiveness of the patch, but I also suspect it's even more rare
> to have all the other cpus busy with rt tasks, so its probably very
I've a workload that is very much like that :-)
The same RTP audio program.
Running 35 RT threads (on 40 cpu) that all might run for
9ms in every 10ms.
The other 5 cpu might also be running RT threads since I
have to use threaded NAPI and make the NAPI threads RT
in order to avoid dropping packets.
Most of the wakeups can just wait for the previous cpu
to become available, only the sleep on a high-res timer
would benefit from changing the cpu.
The real scheduling problem wasn't actually wakeups at all.
The problem is the softint code running while the RT thread
held a cv - which stopped all the other threads in their
I had to replace all the 'hot' cv with atomic operations.
The only 'wakeup' problem I had was that cv_broadcast() woke
each RT task in turn - so if one waited for the softint code
to finish then so would all the rest.
(Fixed by using a separate cv for each thread.)
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)