Re: [RFC PATCH v3 2/3] sched: Avoid placing RT threads on cores handling long softirqs

From: John Stultz
Date: Mon Oct 03 2022 - 12:55:22 EST


On Wed, Sep 28, 2022 at 5:55 AM Qais Yousef <qais.yousef@xxxxxxx> wrote:
> On 09/21/22 01:25, John Stultz wrote:
> > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> > index a749a8663841..1d126b8495bc 100644
> > --- a/include/linux/interrupt.h
> > +++ b/include/linux/interrupt.h
> > @@ -582,6 +582,12 @@ enum
> > * _ IRQ_POLL: irq_poll_cpu_dead() migrates the queue
> > */
> > #define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(RCU_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ))
> > +/* Softirq's where the handling might be long: */
> > +#define LONG_SOFTIRQ_MASK ((1 << NET_TX_SOFTIRQ) | \
> > + (1 << NET_RX_SOFTIRQ) | \
> > + (1 << BLOCK_SOFTIRQ) | \
> > + (1 << IRQ_POLL_SOFTIRQ) | \
> > + (1 << TASKLET_SOFTIRQ))
>
> I'm not sure about the TASKLET. I can understand NET and BLOCK require high
> throughput, hence could end up in softirq for a long time. But TASKLET seems
> allowing sloppiness. I don't feel strongly about it, but worth debating if we
> really need to include it.

That's fair. Digging through the patch history in the Android trees,
the first pass was for all softirqs but then restricted to remove
known short-running ones.
>From the bug history and what I can directly reproduce, the net and
block softirqs have definitely caused trouble, but I don't see a
specific example from TASKLET, so I'm ok dropping that for now, and
should we get specific evidence we can argue for it in a future patch.

So I'll drop TASKLET from the list here. Thanks for the suggestion!

> > @@ -1284,6 +1284,16 @@ config SCHED_AUTOGROUP
> > desktop applications. Task group autogeneration is currently based
> > upon task session.
> >
> > +config RT_SOFTIRQ_OPTIMIZATION
> > + bool "Improve RT scheduling during long softirq execution"
> > + depends on SMP
>
> Not sure if we need to depend on !PREEMPT_RT. This optimization might not be
> desired for those systems. I'll defer to PREEMPT_RT folks to decide on that.

Probably a safer default to turn it off w/ PREEMPT_RT. Thanks for the
suggestion!


> > +#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
> > +/*
> > + * Return whether the task on the given cpu is currently non-preemptible
> > + * while handling a potentially long softirq, or if the task is likely
> > + * to block preemptions soon because it is a ksoftirq thread that is
> > + * handling slow softirq.
> > + */
> > +static bool task_may_preempt(struct task_struct *task, int cpu)
> > +{
> > + u32 softirqs = per_cpu(active_softirqs, cpu) |
> > + __cpu_softirq_pending(cpu);
> > + struct task_struct *cpu_ksoftirqd = per_cpu(ksoftirqd, cpu);
> > + struct task_struct *curr;
> > + struct rq *rq = cpu_rq(cpu);
> > + int ret;
> > +
> > + rcu_read_lock();
> > + curr = READ_ONCE(rq->curr); /* unlocked access */
> > + ret = !((softirqs & LONG_SOFTIRQ_MASK) &&
> > + (curr == cpu_ksoftirqd ||
> > + preempt_count() & SOFTIRQ_MASK));
> > + rcu_read_unlock();
> > + return ret;
> > +}
> > +#else
> > +static bool task_may_preempt(struct task_struct *task, int cpu)
> > +{
> > + return true;
> > +}
> > +#endif /* CONFIG_RT_SOFTIRQ_OPTIMIZATION */
> > +
> > +static bool rt_task_fits_capacity_and_may_preempt(struct task_struct *p, int cpu)
> > +{
> > + return task_may_preempt(p, cpu) && rt_task_fits_capacity(p, cpu);
> > +}
>
> Maybe better to rename to rt_task_fits_cpu() and make it generic?
>
> See below for more on that.
>
> > @@ -1641,9 +1683,10 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> > * requirement of the task - which is only important on heterogeneous
> > * systems like big.LITTLE.
> > */
> > - test = curr &&
> > - unlikely(rt_task(curr)) &&
> > - (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> > + may_not_preempt = !task_may_preempt(curr, cpu);
> > + test = (curr && (may_not_preempt ||
> > + (unlikely(rt_task(curr)) &&
> > + (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio))));
>
> I think this is unnecesary if you create new rt_task_fits_cpu() and ...
>
> >
> > if (test || !rt_task_fits_capacity(p, cpu)) {
>
> ... replace the call to rt_task_fits_capacity() with the new
> rt_task_fits_cpu()?


But is that really the same logic? Above we're doing:
if ((!task_may_preempt(curr, cpu)|| <other stuff >) ||
!rt_task_fits_capacity(p, cpu))

And you're suggestion switching it to
if (<other stuff> || !rt_task_fits_cpu(p, cpu))
which would expand to:
if( <other stuff > || !(task_may_preempt(p, cpu) &&
rt_task_fits_capacity(p, cpu)))

I worry we would be skipping the part where we compare against curr.


> > int target = find_lowest_rq(p);
> > @@ -1656,11 +1699,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> > goto out_unlock;
> >
> > /*
> > - * Don't bother moving it if the destination CPU is
> > + * If cpu is non-preemptible, prefer remote cpu
> > + * even if it's running a higher-prio task.
> > + * Otherwise: Don't bother moving it if the destination CPU is
> > * not running a lower priority task.
> > */
> > if (target != -1 &&
> > - p->prio < cpu_rq(target)->rt.highest_prio.curr)
> > + (may_not_preempt ||
> > + p->prio < cpu_rq(target)->rt.highest_prio.curr))
> > cpu = target;
>
> I'm not sure this makes sense. You assume a higher priority task will cause
> less delay than softirqs. Which I think is an optimistic assumption?
>
> I think we should just mimic the same fallback behavior when we fail to find
> a CPU that fits the capacity requirement. Keeps things more consistent IMO.

This sounds reasonable. I do fret that long-running rt tasks are less
common then the long running softirqs, so this may have an impact to
the effectiveness of the patch, but I also suspect it's even more rare
to have all the other cpus busy with rt tasks, so its probably very
unlikely.



> > }
> >
> > @@ -1901,11 +1947,11 @@ static int find_lowest_rq(struct task_struct *task)
> >
> > ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
> > task, lowest_mask,
> > - rt_task_fits_capacity);
> > + rt_task_fits_capacity_and_may_preempt);
> > } else {
> >
> > - ret = cpupri_find(&task_rq(task)->rd->cpupri,
> > - task, lowest_mask);
> > + ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
> > + task, lowest_mask, task_may_preempt);
>
> I think we can simplify the code here to call cpupri_find_fitness()
> unconditionally passing the new rt_task_fits_cpu(). rt_task_fits_capacity()
> will always return true if the static key is disabled or uclamp is not
> configured. So rt_task_fits_cpu() will effectively depend on/boil down to
> task_may_preempt() in these cases.
>
> Note that I had this called unconditionally in the past, but Steve suggested
> doing it this way, IIRC, to avoid the cost of calling the fitness function when
> we don't need to. I'm not sure it matters to be honest, but if you follow my
> suggestion you might be asked to avoid the costs for the users who don't care
> about the fitness criteria.

I'll take another pass at it and see what I can do.

Thanks so much for the feedback!
-john