Re: Scheduling tasks on idle cpu

From: Qais Yousef
Date: Thu Apr 14 2022 - 18:26:16 EST

Next message: Linus Torvalds: "Re: [PATCH 07/10] crypto: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN"
Previous message: Sidhartha Kumar: "Re: [PATCH 2/4] selftest/vm: verify remap destination address in mremap_test"
In reply to: David Laight: "RE: Scheduling tasks on idle cpu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 04/14/22 06:09, David Laight wrote:

[...]

> > > That seems to be something different.
> > > Related to something else I've seen where a RT process is scheduled
> > > on its old cpu (to get the hot cache) but the process running on
> > > that cpu is looping in kernel - so the RT process doesn't start.
> >
> > I *think* you're hitting softirq latencies. Most likely it's the network RX
> > softirq processing the packets. If this latency is a problem, then PREEMPT_RT
> > [1] should help with this. For Android we hit this issue and there's a long
> > living out of tree patch that I'm trying to find an upstream replacement for.
>
> I suspect the costs of PREEMPT_RT would slow things down too much.

It shouldn't.. If it did it's worth reporting to the RT folks, or consider
whether some bad usage in userspace is causing the problem.

linux-rt-users mailing list is a good place to ask questions. The details are
in the link to linuxfoundation realtime wiki page.

> This test system has 40 cpu, 35 of them are RT and processing the same 'jobs'.
> It doesn't really matter if one is delayed by the network irq + softirq code.
> The problems arise if they all stop.
> The 'job' list was protected by a mutex - usually not too bad.
> But if a network irq interrupts the code while it holds the mutex then all
> the RT tasks stall until the softirq code completes.
> I've replaced the linked list with an array and used atomic_inc().

I see. So an interrupt that happens in the wrong time could block everything.

You can try 'threadirqs' kernel parameter to see if this helps. PREEMPT_RT will
help with softirq latencies too. So I think this problem should be handled by
PREEMPT_RT.

There's _probably_ room for improving how userspace manages the job list too..
Do the readers have to block?

You can use irq affinities and task affinities to ensure the two never happen
on the same cpu.

> I can imagine that a PREEMPT_RT kernel will have the same problem
> because (I think) all the spin locks get replaced by sleep locks.

I don't think so. The point of PREEMPT_RT is to not block that RT tasks. With
PREEMPT_RT + threadirqs, irqs and softirqs will run as kernel threads. I think
they run as RT tasks, so you can manage which is more important by assigning
the right priorities to your tasks vs irq/softirqs kthreads priorities.

>
> >
> > There's a new knob to reduce how long netdev spends in the loop. Might be worth
> > a try:
> >
> > https://lore.kernel.org/netdev/1492619830-7561-1-git-send-email-tedheadster@xxxxxxxxx/
> >
> > [1] https://wiki.linuxfoundation.org/realtime/start
>
> I think the patch that runs the softirq in a separate thread might help.
> But it probably needs a test to only to that if it would 'stall' a RT process.

I think people have been using this in rt-kernels for a long time now.
I believe you'd just need to be mindful about priorities since they'll run as
RT tasks.

threadirqs kernel parameter is available in mainline kernel too. But the
softirqs part still didn't get merged, last I checked which was a while ago.
So in mainline irqs will get threaded, but not softirqs - when I last checked.

You might find good info here about tuning systems for RT from Red Hat:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/interrupt_and_process_binding

There's lots of advise regarding various aspects of the system, so worth
exploring if you didn't come across it before.

>
> > > I've avoided most of the pain that caused by not using a single
> > > cv_broadcast() to wake up the 34 RT threads (in this config).
> > > (Each kernel thread seemed to wake up the next one, so the
> > > delays were cumulative.)
> > > Instead there is a separate cv for each RT thread.
> > > I actually want the 'herd of wildebeest' :-)
> >
> > It seems you have a big RT app running in userspace. I thought initially you're
> > hitting issues with random kthreads or something. If you have control over
> > these tasks, then that should be easier to handle (as you suggest at the end).
>
> I've a big app with a lot of RT threads doing network send/receive.
> (All the packets as ~200 byte UDP, 50/sec on 1000+ port numbers.)
> But there are other things going on as well.
>
> > I'm not sure about the delays when using cv_broadcast(). Could it be the way
> > this library is implemented is causing the problem rather than a kernel
> > limitation?
>
> I was definitely seeing the threads wake up one by one.
> Every 10ms one of the RT threads wakes up and then wakes up all the others.
> There weren't any 'extra' system calls, once one thread was running
> in kernel the next one got woken up.
> Most (and always) noticeable were the delays getting each cpu out
> of its sleep state.

Oh, yeah idle states and dvfs are known sources of latencies. You can prevent
the cpus from going into deep idle states.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/8/html-single/optimizing_rhel_8_for_real_time_for_low_latency_operation/index#con_power-saving-states_assembly_controlling-power-management-transitions

> But if one of the required cpu was (eg) running the softint code
> none of the latter ones would wake up.
>

[...]

> > If you make it an RT task (which I think is a good idea), then the RT scheduler
> > will handle it in the push/pull remark that seem to have started this
> > discussion and get pushed/pulled to another CPU that is running lower priority
> > task.
>
> The problem is that while I'd like this thread to start immediately
> what it is doing isn't THAT important.
> There are other things that might run on the CFS scheduler that are
> more important.
> I can make it RT for experiments.

You can isolate 35 cpus if you like to run your RT app and keep the remaining
5 cpus for everything else. Depends what else you use the system for. The red
hat guide I pasted above have a section on using isolated cpus feature.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/isolating_cpus_using_tuned-profiles-realtime

Although this seems a bit of a stretch for your use case. You can still use
irq and task affinities to ensure certain things don't happen on the same CPU.

Cheers

--
Qais Yousef

Next message: Linus Torvalds: "Re: [PATCH 07/10] crypto: Use ARCH_DMA_MINALIGN instead of ARCH_KMALLOC_MINALIGN"
Previous message: Sidhartha Kumar: "Re: [PATCH 2/4] selftest/vm: verify remap destination address in mremap_test"
In reply to: David Laight: "RE: Scheduling tasks on idle cpu"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]