Re: [PATCH] sched: prefer an idle cpu vs an idle sibling for BALANCE_WAKE

From: Josef Bacik
Date: Wed May 27 2015 - 16:09:48 EST


On 05/26/2015 05:31 PM, Josef Bacik wrote:
At Facebook we have a pretty heavily multi-threaded application that is
sensitive to latency. We have been pulling forward the old SD_WAKE_IDLE code
because it gives us a pretty significant performance gain (like 20%). It turns
out this is because there are cases where the scheduler puts our task on a busy
CPU when there are idle CPU's in the system. We verify this by reading the
cpu_delay_req_avg_us from the scheduler netlink stuff. With our crappy patch we
get much lower numbers vs baseline.

SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just
looking for an idle sibling, preferring affinity over all else. This is not
helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not
garuntee affinity. Fix this by first trying to find an idle sibling, and then
if the cpu is not idle fall through to the logic to find an idle cpu. With this
patch we get slightly better performance than with our forward port of
SD_WAKE_IDLE. Thanks,


I rigged up a test script to run the perf bench sched tests and give me the numbers. Here are the numbers

4.0

Messaging: 56.934 Total runtime in seconds
Pipe: 105620.762 ops/sec

4.0 + my patch

Messaging: 47.374
Pipe: 113691.199

so ~20% better performance out of the Messaging test which is sort of like HHVM and ~8% better pipe performance. This box is a 2 socket 16 core box. I've attached the script I'm using, basically I just run each thing 5 times, and for the perf bench sched pipe run I do NR_CPUS/2 instances of them in parallel.

If you are interested I'd be happy to show you numbers for our HHVM test, but they are less straightforward and require pretty pictures and a book of how to read the numbers. Thanks

Josef