At Facebook we have a pretty heavily multi-threaded application that is
sensitive to latency. We have been pulling forward the old SD_WAKE_IDLE code
because it gives us a pretty significant performance gain (like 20%). It turns
out this is because there are cases where the scheduler puts our task on a busy
CPU when there are idle CPU's in the system. We verify this by reading the
cpu_delay_req_avg_us from the scheduler netlink stuff. With our crappy patch we
get much lower numbers vs baseline.
SD_BALANCE_WAKE is supposed to find us an idle cpu to run on, however it is just
looking for an idle sibling, preferring affinity over all else. This is not
helpful in all cases, and SD_BALANCE_WAKE's job is to find us an idle cpu, not
garuntee affinity. Fix this by first trying to find an idle sibling, and then
if the cpu is not idle fall through to the logic to find an idle cpu. With this
patch we get slightly better performance than with our forward port of
SD_WAKE_IDLE. Thanks,