Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints

From: chris hyser
Date: Wed Feb 19 2020 - 12:17:43 EST

On 2/19/20 6:18 AM, David Laight wrote:
From: chris hyser
Sent: 18 February 2020 23:00
All, I was asked to take a look at the original latency_nice patchset.
First, to clarify objectives, Oracle is not
interested in trading throughput for latency.
What we found is that the DB has specific tasks which do very little but
need to do this as absolutely quickly as possible, ie extreme latency
sensitivity. Second, the key to latency reduction
in the task wakeup path seems to be limiting variations of "idle cpu" search.
The latter particularly interests me as an example of "platform size
based latency" which I believe to be important given all the varying size
VMs and containers.

From my experiments there are a few things that seem to affect latency
of waking up real time (sched fifo) tasks on a normal kernel:

Sorry. I was only ever talking about sched_other as per the original patchset. I realize the term extreme latency sensitivity may have caused confusion. What that means to DB people is no doubt different than audio people. :-)

1) The time taken for the (intel x86) cpu to wakeup from monitor/mwait.
If the cpu is allowed to enter deeper sleep states this can take 900us.
Any changes to this are system-wide not process specific.

2) If the cpu an RT process last ran on (ie the one it is woken on) is
running in kernel, the process switch won't happen until cond_reshed()
is called.
On my system the code to flush the display frame buffer takes 3.3ms.
Compiling a kernel with CONFIG_PREEMPT=y will reduce this.

3) If a hardware interrupt happens just after the process is woken
then you have to wait until it finishes and any 'softint' work
that is scheduled on the same cpu finishes.
The ethernet driver transmit completions an receive ring filling
can easily take 1ms.
Booting with 'threadirq' might help this.

4) If you need to acquire a lock/futex then you need to allow for the
process that holds it being delayed by a hardware interrupt (etc).
So even if the lock is only held for a few instructions it can take
a long time to acquire.
(I need to change some linked lists to arrays indexed by an atomically
incremented global index.)

FWIW I can't imagine how a database can have anything that is that
latency sensitive.
We are doing lots of channels of audio processing and have a lot of work
to do within 10ms to avoid audible errors.

There are existing internal numbers that I will ultimately have to duplicate that show that simply short-cutting these idle cpu searches has a significant benefit on DB performance on large hardware. However that was for a different patchset involving things I don't like so I'm still exploring how to achieve similar results within the latency_nice framework.