RT scheduler is suboptimal when an RT thread preempts another RT in terms of choosing a core to migrate
From: Rafikov, Rustem
Date: Thu Nov 14 2019 - 19:43:50 EST
Hi,
When an RT thread preempts another RT thread it migrates the latter one to a core.
The way RT scheduler chooses a core is quite suboptimal. Let me give an example from a "production" server with 32 total physical cores.
There are SCHED_NORMAL threads (affined to particular core each) and 2+ groups of RT threads (allowed to run everywhere).
Scheduler trace showed that most cases RT scheduler preempts a normal prio thread from a core to put evicted RT one on rather than using an idle core the system had a plenty of which according the trace.
I reproduced the behavior on a vanilla 4.18.0 kernel with a micro test where I created 10 SCHED_NORMAL affined to 10 cores,
3 RT/69 with 0xFFFFFFFF affinity and a few RT/79 threads kicking off other RTs from CPUs every 5 msec.
Other cores were idle but RT/69 never migrated to them.
The problem seems to be in how mapping in cpupri structure is updated:
1) Fair scheduler does not update/read from there. So we don't know if a SCHED_NORMAL left a cpu. Well, that may be OK.
2) RT scheduler uses cpupri to find a core to migrate to, but it updates it incorrectly:
- RT->RT works fine [2]
- But RT->IDLE or RT->SCHED_NORMAL [1] is not right - in both cases it sets RT_MAX(100) which is min NORMAL!
It's totally okay to set it to RT_MAX for all of NORMALs but not for IDLE. BTW - IDLE means swapper which has pri=120 :)
See below traced with kprobes.
[1] IDLE->RT/79->IDLE
#1. <idle>-0 [001] d.h. 14717592.107294: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0001
#2. <...>-157332 [001] d... 14717592.107313: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=64 oldpri=0051
Decoding the output at #1 cpu=1 newp=14 oldpri=0001
- cpu = 1 - it happens on core 1
- newp=14 - the priority of a thread being scheduled in is 0x14 which is RT-79 (our test thread)
- oldpri=0001 - a priority of previous thread on that CPU. "1" means NORMAL in 0-101 scale. This is incorrect by itself because the core was IDLE!
Let's try to figure out why it is not '0' (IDLE) by looking at the last line - cpu=1 newp=64 oldpri=0051
- newp=64 says that the priority of a thread being scheduled in is 0x64 which min NORMAL. So, it is not 140 how we could expect when switching to IDLE thread.
- oldpri=0051 this is 81 - priority of our RT-79 thread in 0-101 scale
[2] RT/69->RT/79>RT/69
#1. <...>-158253 [001] d.h. 14723119.396120: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=14 oldpri=0047 #2. <...>-158254 [001] d... 14723119.396122: myprobe3: (cpupri_set+0x0/0x100) cpu=1 newp=1e oldpri=0051 Line #1 - "cpu=1 newp=14 oldpri=0047" switching to 0x14, RT-79 thread
- old pri currently on cpu is 0x47 in 0-101 scale OR RT-69
Line#2 - switching to 0x1e - RT-69. This is correct value of the thread being scheduled in!
- oldppri=0051 - RT-69 in 0-101 scale
Thanks,
Rustem