RE: sched/fair: scheduler not running high priority process on idle cpu

From: David Laight
Date: Wed Jan 15 2020 - 07:44:26 EST


From: Steven Rostedt
> Sent: 14 January 2020 17:48
>
> On Tue, 14 Jan 2020 17:33:50 +0000
> David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> > I have added a cond_resched() to the offending loop, but a close look implies
> > that code is called with a lock held in another (less common) path so that
> > can't be directly committed and so CONFIG_PREEMPT won't help.
> >
> > Indeed requiring CONFIG_PREEMPT doesn't help when customers are running
> > the application, nor (probably) on AWS since I doubt it is ever the default.
> >
> > Does the same apply to non-RT tasks?
> > I can select almost any priority, but RT ones are otherwise a lot better.
> >
> > I've also seen RT processes delayed by the network stack 'bh' that runs
> > in a softint from the hardware interrupt.
> > That can take a while (clearing up tx and refilling rx) and I don't think we
> > have any control over the cpu it runs on?
>
> Yes, even with CONFIG_PREEMPT, Linux has no guarantees of latency for
> any task regardless of priority. If you have latency requirements, then
> you need to apply the PREEMPT_RT patch (which may soon make it to
> mainline this year!), which spin locks and bh wont stop a task from
> scheduling (unless they need the same lock)

We're not trying to do anything life-threatening.
So the latency requirements are only moderate - failures mess up telephone
audio quality. There is also allowance for jitter elsewhere.
OTOH not running a high priority process when there are idle cpu seems 'sub-optimal'.

Code that runs with a spin-lock held (or otherwise disables preemption)
for significant periods probably ought to be detected and warned.
I'm not sure of a suitable limit, 100us is probably excessive on x86.

IIUC PREEMPT_RT adds overhead to quite a bit of code and is unlikely
to get enabled in 'distro' kernels.
Especially since they've not enabled CONFIG_PREEMPT which probably
has a lower impact - provided the cv+mutex wakeup has been arranged
to avoid the treble process switch.

Running the driver bh (which is often significant) from a high priority
worker thread instead of a softint (which isn't much different to the
'hardint' it is scheduled from) probably doesn't cost much (in-kernel
process switches shouldn't be much more than a stack switch).
That would benefit RT processes since they could be higher
priority than the bh code.
Although you'd probably want a 'strongly preferred' cpu for them.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)