Re: [RFC PATCH 0/4] Gang scheduling in CFS

From: Rik van Riel
Date: Wed Jan 04 2012 - 11:47:31 EST


On 01/04/2012 09:41 AM, Avi Kivity wrote:
On 01/04/2012 12:52 PM, Nikunj A Dadhania wrote:
On Mon, 02 Jan 2012 11:37:22 +0200, Avi Kivity<avi@xxxxxxxxxx> wrote:
On 12/31/2011 04:21 AM, Nikunj A Dadhania wrote:

GangV2:
27.45% ebizzy libc-2.12.so [.] __memcpy_ssse3_back
12.12% ebizzy [kernel.kallsyms] [k] clear_page
9.22% ebizzy [kernel.kallsyms] [k] __do_page_fault
6.91% ebizzy [kernel.kallsyms] [k] flush_tlb_others_ipi
4.06% ebizzy [kernel.kallsyms] [k] get_page_from_freelist
4.04% ebizzy [kernel.kallsyms] [k] ____pagevec_lru_add

GangBase:
45.08% ebizzy [kernel.kallsyms] [k] flush_tlb_others_ipi
15.38% ebizzy libc-2.12.so [.] __memcpy_ssse3_back
7.00% ebizzy [kernel.kallsyms] [k] clear_page
4.88% ebizzy [kernel.kallsyms] [k] __do_page_fault

Looping in flush_tlb_others(). Rik, what trace an we run to find out
why PLE directed yield isn't working as expected?

I tried some experiments by adding a pause_loop_exits stat in the
kvm_vpu_stat.

(that's deprecated, we use tracepoints these days for stats)

Here are some observation related to Baseline-only(8vm case)

| ple_gap=128 | ple_gap=64 | ple_gap=256 | ple_window=2048
--------------+-------------+------------+-------------+----------------
EbzyRecords/s | 2247.50 | 2132.75 | 2086.25 | 1835.62
PauseExits | 7928154.00 | 6696342.00 | 7365999.00 | 50319582.00

With ple_window = 2048, PauseExits is more than 6times the default case

So it looks like the default is optimal, at least wrt the cases you
tested and your test workload.

It depends on the workload.

I believe ebizzy synchronously bounces messages around between
userland threads, and may benefit from lower latency preemption
and re-scheduling.

Workloads like AMQP do asynchronous messaging, and are likely
to benefit from having a lower number of switches.

I do not know which kind of workload is more prevalent.

Another worry with gang scheduling is scalability. One of
the reasons Linux scales well to larger systems is that a
lot of things are done CPU local, without communicating
things with other CPUs. Making the scheduling algorithm
system-global has the potential to add in a lot of overhead.

Likewise, removing the ability to migrate workloads to idle
CPUs is likely to hurt a lot of real world workloads.

Benchmarks don't care, because they run full-out. However,
users do not run benchmarks nearly as much as they run
actual workloads...

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/