Re: [PATCH v1] sched: fix nohz idle load balancer issues

From: Ingo Molnar
Date: Tue Sep 27 2011 - 03:35:40 EST



* Srivatsa Vaddagiri <vatsa@xxxxxxxxxxxxxxxxxx> wrote:

> While trying to test recently introduced cpu bandwidth control feature for
> non-realtime tasks, we noticed "more than expected" idle time, which
> reduced considerably when booted with nohz=off. This patch is an attempt
> to fix that discrepancy so that we see little variance in idle time between
> nohz=on and nohz=off.
>
> Test setup:
>
> Machine : 16-cpus (2 Quad-core w/ HT enabled)
> Kernel : Latest code (HEAD at 6e8d0472ea63969e2df77c7e84741495fedf7d9b) found
> at git://tesla.tglx.de/git/linux-2.6-tip
>
> Cgroups :
>
> 5 in number (/L1/L2/C1 - /L1/L2/C5), each having {2, 2, 4, 8, 16} tasks
> respectively. /L1 and /L2 were added to the hierarchy to mimic cgroup hierarchy
> created by libvirt and otherwise do not contain any tasks. Each cgroup has
> cpu.shares proportional to # of tasks in it. For ex: /L1/L2/C1's cpu.shares =
> 2 * 1024 = 2048, C3's cpu.shares = 4096 etc. Further, each task is placed in its
> own (sub-)cgroup with default shares of 1024 and a capped usage of 50% CPU.
>
> /L1/L2/C1/C1_1/Task1 -> capped at 50% cpu usage
> /L1/L2/C1/C1_2/Task2 -> capped at 50% cpu usage
> /L1/L2/C2/C2_1/Task3 -> capped at 50% cpu usage
> /L1/L2/C2/C2_2/Task3 -> capped at 50% cpu usage
> /L1/L2/C3/C3_1/Task4 -> capped at 50% cpu usage
> /L1/L2/C3/C3_2/Task4 -> capped at 50% cpu usage
> /L1/L2/C3/C3_3/Task4 -> capped at 50% cpu usage
> /L1/L2/C3/C3_4/Task4 -> capped at 50% cpu usage
> ...
> /L1/L2/C5/C5_16/Task32 -> capped at 50% cpu usage
>
> So we have 32 tasks, each capped at 50% CPU usage, run on a 16-CPU
> system, which one may expect to consume all CPU resource leaving no idle
> time. While that may be "insane" expectation, the goal is to minimize idle time
> in this situation as much as possible.
>
> I am using a slightly modified script provided at
> https://lkml.org/lkml/2011/6/7/352 for generating this test scenario -
> can make that available if required.
>
> Idle time was sampled every second (using vmstat) over a window of 60 seconds
> and was found as below:
>
> Idle time Average Std-deviation Min Max
> ============================================================
>
> nohz=off 4% 0.5% 3% 5%
> nohz=on 10% 2.4% 5% 18%
> nohz=on + patch 5.3% 1.3% 3% 9%
>
> The patch cuts down idle time significantly when kernel is booted
> with 'nohz=on' (which is good for saving power when idle).

What are the tasks doing which are running - are they plain burning
CPU time? If the tasks do something more complex, do you also have a
measure of how much work gets done by the workload, per second?

Percentual changes in that metric would be nice to include in an
additional column - that way we can see that it's not only idle
that has gone down, but workload performance has gone up too.

In fact even if there was only a CPU burning loop in the workload it
would be nice to make that somewhat more sophisticated by letting it
process some larger array that has a cache footprint. This mimics
real workloads that don't just spin burning CPU time but do real data
processing.

For any non-trivial workload it's possible to reduce idle time
without much increase in work done and in fact it's possible to
decrease idle time *and* work done - so we need to see more clearly
here and make sure it's all an improvement.

Thanks,

Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/