sched: wakeup setting TIF_NEED_RESCHED too frequently

From: Jörn Engel
Date: Tue Apr 27 2021 - 17:17:25 EST


I came across something that looks wrong and is certainly inefficient.
Here is a trace of what happened. Kernel was 5.4, so not too ancient.

+ nanoseconds before present
| + 0 for voluntary (going to sleep), 1 for involuntary (TASK_RUNNING)
| | + PID of prev task
| | | + PID of next task
| | | | + RT priority
| | | | | + comm
v v v v v v
8729751 0 6893 6894 0 one
8723802 0 6894 106051 0 one
8718671 1 106051 6899 0 other
8711411 0 6899 6900 0 one
8705643 0 6900 6902 0 one
8699444 0 6902 6903 0 one
8693987 0 6903 6906 0 one
8687832 0 6906 6904 0 one
8681188 0 6904 6901 0 one
8674083 0 6901 106051 0 one
8639249 1 106051 6893 0 other
8631961 0 6893 6894 0 one
8626684 0 6894 106051 0 one
8623375 1 106051 6899 0 other
8617434 0 6899 6900 0 one
8612341 0 6900 6901 0 one
8607215 0 6901 6902 0 one
8600824 0 6902 6903 0 one
8595214 0 6903 6906 0 one
8590161 0 6906 6904 0 one
8584825 0 6904 106051 0 one
8465395 1 106051 6893 0 other
8358696 0 6893 6903 0 one

The "one" process seems to frequently wake a bunch of threads that
do very little work before going back to sleep. Clearly inefficient,
but we are not interested in userspace problems here. The "other"
process has a thread that would like to continue running, but only gets
to run for very short periods, 3.3µs in one case. That also seems
inefficient and looks like either a scheduler or debatable policy.

My rule of thumb for task switch overhead is 1-3µs. Probably 1µs for
the raw switch, plus another 2µs for cold cache effects. If we want to
amortize that overhead a bit, time slices should be in the region of
1ms, give or take. I would also be happy if the "one" process would
have to wait 1ms before the woken threads actually get to run. So in my
book the observed behaviour is wrong, but opinions may differ.

Anyway, trying to find a cause, I noticed the following call chain:
set_nr_if_polling()
ttwu_queue_remote()
ttwu_queue()
try_to_wake_up()
default_wake_function()
curr->func()
__wake_up_common()
__wake_up_common_lock()
__wake_up()
wake_up()

Call chain above is manually created from source code. Closest sample I
caught with instrumentation is missing the leaf calls after
try_to_wake_up():
_raw_spin_unlock_irqrestore+0x1f/0x40
try_to_wake_up+0x425/0x5e0
wake_up_q+0x3f/0x80
futex_wake+0x159/0x180
do_futex+0xcd/0xba0

Afaics, the result is us setting TIF_NEED_RESCHED on any wakeup, unless
wake_list is already populated. Is that actually intentional? And is
that useful for performance of latency? I think it isn't, but I am
probably missing something here.

Jörn

--
With a PC, I always felt limited by the software available. On Unix,
I am limited only by my knowledge.
-- Peter J. Schoenster