[RFC PATCH v1 0/4] CFS Bandwidth Control

From: Paul Turner
Date: Fri Feb 12 2010 - 21:56:07 EST

Next message: Paul: "[RFC PATCH v1 4/4] sched: unthrottle cfs_rq(s) who ran out of quotaat period refresh"
Previous message: Yinghai Lu: "[PATCH 7/8] x86: use num_processors for possible cpus"
Next in thread: Paul: "[RFC PATCH v1 4/4] sched: unthrottle cfs_rq(s) who ran out of quotaat period refresh"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

Please find attached an early patchset for an alternate approach to CFS
bandwidth provisioning. Bharata's excellent original RFC motivating discussion
on this topic can be found at: http://lkml.org/lkml/2009/6/4/24.

As discussed in reply to the latest iteration of Bharata's patchset (at:
http://lkml.org/lkml/2010/1/5/44) we have significant concerns with the
adaption of the RT-bandwidth code to the general case. In that discussion
an alternate approach as proposed in which a global per-taskgroup pool
is maintained with runqueues acquiring time directly from that (as opposed to
the RT all-rq-mapping-to-all-rq scheme). In particular we felt the original
approach did not scale well with increasing number-of-cores or small
reservations.

The skeleton of our approach is as follows:
- As above we maintain a global pool, per-tg, pool of unassigned quota. On it
we track the bandwidth period, quota per period, and runtime remaining in
the current period. As bandwidth is used within a period it is decremented
from runtime. Runtime is currently synchronized using a spinlock, in the
current implementation there's no reason this couldn't be done using
atomic ops instead however the spinlock allows for a little more flexibility
in experimentation with other schemes.
- When a cfs_rq participating in a bandwidth constrained task_group executes
it acquires time in sysctl_sched_cfs_bandwidth_slice (default currently
10ms) size chunks from the global pool, this synchronizes under rq->lock and
is part of the update_curr path.
- Throttled entities are dequeued immediately (as opposed to delaying this
operation to the put path), this avoids some potentially poor load-balancer
interactions and preserves the 'verbage' of the put_task semantic.
Throttled entities are gated from participating in the tree at the
{enqueue, dequeue}_entity level. They are also skipped for load
balance in the same manner as Bharatta's patch-series employs.

Interface:
----------
Two new cgroupfs files are added to the cpu subsystem:
- cpu.cfs_period_us : period over which bandwidth is to be regulated
- cpu.cfs_quota_us : bandwidth available for consumption per period

One important interface change that this introduces (versus the rate limits
proposal) is that the defined bandwidth becomes an absolute quantifier.

e.g. a bandwidth of 5 seconds (cpu.cfs_quota_us=5000000) on a period of 1 second
(cpu.cfs_period_us=1000000) would result in 5 wall seconds of cpu time being
consumable every 1 wall second.

Benchmarks:
-----------
In the attached benchmarks we compared the two approaches with each other and
the third 'ideal' case of using cpu affinities to restrict cpus to an
equivalent rate. Tests were performed on a 16-core AMD machine with no
restrictions to affinity [Bandwidth is cpu_rate / period ]

Pardon the wall of raw results that follows, if you prefer to view the data in
graph format it's available at:
http://docs.google.com/View?id=dgjvsb78_6dhktmzhs

Workload 1: while(1)x16
Measured: % Utilization

period=100ms, workload=while(1)x16
affinity cfs bandwidth cfs hardlimits
bandwidth %busy %busy (pct%) %busy (pct%)
1 6.28 6.67 (+6.2%) 6.87 (+9.4%)
2 12.57 13.20 (+5.0%) 11.67 (-7.2%)
3 18.80 19.42 (+3.3%) 16.13 (-14.2%)
4 25.02 25.60 (+2.3%) 19.62 (-21.6%)
5 31.32 31.74 (+1.3%) 24.05 (-23.2%)
6 37.47 37.92 (+1.2%) 30.41 (-18.8%)
7 43.71 43.95 (+0.5%) 34.11 (-22.0%)
8 50.08 50.14 (+0.1%) 39.73 (-20.7%)
9 56.24 56.15 (-0.2%) 43.63 (-22.4%)
10 62.38 62.33 (-0.1%) 47.50 (-23.9%)
11 68.58 67.81 (-1.1%) 45.09 (-34.3%)
12 74.92 72.13 (-3.7%) 47.98 (-36.0%)
13 81.21 74.11 (-8.7%) 56.62 (-30.3%)
14 87.40 84.06 (-3.8%) 51.77 (-40.8%)
15 93.69 83.56 (-10.8%) 53.61 (-42.8%)
16 99.93 99.88 (-0.1%) 61.26 (-38.7%)

period=250ms, workload=while(1)x16
affinity cfs bandwidth cfs hardlimits
bandwidth %busy %busy (pct%) %busy (pct%)
1 6.28 6.59 (+4.9%) 6.53 (+4.0%)
2 12.49 12.81 (+2.6%) 12.78 (+2.3%)
3 18.74 19.03 (+1.5%) 18.08 (-3.5%)
4 25.11 25.23 (+0.5%) 22.72 (-9.5%)
5 31.26 31.45 (+0.6%) 30.27 (-3.2%)
6 37.58 37.69 (+0.3%) 34.69 (-7.7%)
7 43.79 43.90 (+0.3%) 37.91 (-13.4%)
8 49.99 50.10 (+0.2%) 43.00 (-14.0%)
9 56.34 56.29 (-0.1%) 46.90 (-16.8%)
10 62.45 62.57 (+0.2%) 50.22 (-19.6%)
11 68.70 68.49 (-0.3%) 59.10 (-14.0%)
12 74.86 74.97 (+0.1%) 58.59 (-21.7%)
13 81.17 80.92 (-0.3%) 68.39 (-15.7%)
14 87.34 87.17 (-0.2%) 71.73 (-17.9%)
15 93.66 91.74 (-2.0%) 72.26 (-22.8%)
16 99.89 99.79 (-0.1%) 66.16 (-33.8%)

period=500ms, workload=while(1)x16
affinity cfs bandwidth cfs hardlimits
bandwidth %busy %busy (pct%) %busy (pct%)
1 6.27 6.43 (+2.6%) 6.56 (+4.6%)
2 12.48 12.69 (+1.7%) 12.68 (+1.6%)
3 19.04 18.90 (-0.7%) 18.99 (-0.3%)
4 25.05 25.12 (+0.3%) 25.22 (+0.7%)
5 31.33 31.28 (-0.2%) 31.45 (+0.4%)
6 37.39 37.66 (+0.7%) 37.70 (+0.8%)
7 43.73 43.77 (+0.1%) 43.83 (+0.2%)
8 50.13 50.02 (-0.2%) 49.70 (-0.9%)
9 56.29 56.24 (-0.1%) 56.16 (-0.2%)
10 62.47 62.54 (+0.1%) 61.07 (-2.2%)
11 68.69 68.79 (+0.1%) 67.70 (-1.4%)
12 74.98 74.89 (-0.1%) 75.10 (+0.2%)
13 81.14 81.24 (+0.1%) 72.29 (-10.9%)
14 87.43 86.91 (-0.6%) 85.79 (-1.9%)
15 93.77 93.36 (-0.4%) 82.12 (-12.4%)
16 99.90 99.89 (-0.0%) 91.27 (-8.6%)

Workload 2: sysbench prime computation (100k primes, 16 threads)
Measured: time to completion
period=100ms:
affinity cfs bandwidth cfs hardlimits
bandwidth time(s) time(s) (pct%) time(s) (pct%)
1 422.44 435.19 (+3.0%) 405.66 (-4.0%)
2 211.54 217.40 (+2.8%) 255.73 (+20.9%)
3 140.89 146.98 (+4.3%) 178.67 (+26.8%)
4 105.56 110.94 (+5.1%) 136.83 (+29.6%)
5 84.45 90.88 (+7.6%) 103.81 (+22.9%)
6 70.46 72.34 (+2.7%) 90.22 (+28.0%)
7 60.38 59.97 (-0.7%) 81.56 (+35.1%)
8 52.79 54.81 (+3.8%) 66.93 (+26.8%)
9 46.92 48.33 (+3.0%) 59.08 (+25.9%)
10 42.27 42.82 (+1.3%) 51.38 (+21.6%)
11 38.47 45.50 (+18.3%) 54.27 (+41.1%)
12 35.19 35.88 (+2.0%) 52.35 (+48.8%)
13 32.44 41.01 (+26.4%) 44.11 (+36.0%)
14 30.04 34.95 (+16.4%) 46.40 (+54.5%)
15 27.98 31.13 (+11.2%) 42.98 (+53.6%)
16 26.25 26.23 (-0.1%) 46.69 (+77.9%

period=250ms:
affinity cfs bandwidth cfs hardlimits
bandwidth time(s) time(s) (pct%) time(s) (pct%)
1 422.45 424.09 (+0.4%) 417.32 (-1.2%)
2 211.53 216.60 (+2.4%) 218.47 (+3.3%)
3 140.94 141.91 (+0.7%) 150.92 (+7.1%)
4 105.55 104.59 (-0.9%) 113.43 (+7.5%)
5 84.42 83.61 (-1.0%) 92.40 (+9.5%)
6 70.53 69.64 (-1.3%) 78.60 (+11.4%)
7 60.41 67.96 (+12.5%) 70.19 (+16.2%)
8 52.76 52.21 (-1.1%) 55.40 (+5.0%)
9 46.91 46.43 (-1.0%) 53.49 (+14.0%)
10 42.38 41.77 (-1.4%) 50.64 (+19.5%)
11 38.46 38.39 (-0.2%) 47.19 (+22.7%)
12 35.22 34.94 (-0.8%) 42.02 (+19.3%)
13 32.45 32.12 (-1.0%) 38.57 (+18.9%)
14 30.08 30.20 (+0.4%) 36.33 (+20.8%)
15 27.96 28.43 (+1.7%) 34.41 (+23.1%)
16 26.20 26.22 (+0.1%) 37.20 (+42.0%)

period=500ms
affinity cfs bandwidth cfs hardlimits
bandwidth time(s) time(s) (pct%) time(s) (pct%)
1 422.45 423.16 (+0.2%) 412.34 (-2.4%)
2 211.57 208.95 (-1.2%) 210.60 (-0.5%)
3 140.90 139.31 (-1.1%) 138.48 (-1.7%)
4 105.54 104.45 (-1.0%) 104.14 (-1.3%)
5 84.44 83.45 (-1.2%) 82.91 (-1.8%)
6 70.44 69.41 (-1.5%) 69.33 (-1.6%)
7 60.42 59.37 (-1.7%) 59.43 (-1.6%)
8 52.77 51.92 (-1.6%) 52.41 (-0.7%)
9 46.91 46.13 (-1.7%) 46.29 (-1.3%)
10 42.36 41.69 (-1.6%) 41.60 (-1.8%)
11 38.44 37.95 (-1.3%) 38.31 (-0.3%)
12 35.20 34.82 (-1.1%) 35.03 (-0.5%)
13 32.47 34.07 (+4.9%) 32.17 (-0.9%)
14 30.05 29.89 (-0.6%) 31.30 (+4.1%)
15 28.00 27.98 (-0.1%) 29.42 (+5.1%)
16 26.20 26.22 (+0.1%) 30.04 (+14.6%)

Again, This data is consumable more directly (graphed) at:
http://docs.google.com/View?id=dgjvsb78_6dhktmzhs

Todo:
-----
- hierarchal nr_tasks_running accounting:
This is a deficiency currently shared with SCHED_RT rate limiting. When
entities is throttled the running tasks it owns are not subtracted from
rq->nr_running. This then results in us missing idle_balance() due to
phantom tasks and load balancer weight per task calculations being
incorrect.

This code adds complexity which was both increasing the complexity of the
initial review for this patchset and truly probably best reviewed
independently of this feature's scope. To that end we'll post a separate
patch for this issue against the current RT rate-limiting code and merge any
converged on approach here as appropriate.

- throttle statistics:
Some statistics regarding the frequency and duration of throttling
definitely in order.

- Expiration of quota from a previous period?
The local quota assigned to a cfs_rq represents some 'slack' in the system
currently. While the rate of overal consumption by the system is capped by
the input of bandwidth into the global pool it's possible to locally exceed
bandwidth within a period if there is remaining local quota from a previous
period.

The maximum slack in the system at any time is bounded above by
weight(tg->allowed_cpus) * cfs_bandwidth_slice. There are a number of
approaches possible in managing this slack, from naively resetting local
quota_assigned at period refresh to tracking when quota was issued using
generations.

We have an experimental patch which adds quota generations (the latter) to
this series, however I just realized there's potential rq->lock inversion
within it currently so I'm going to have to defer that to the next version.

Open Questions:
---------------
- per-tg slice tunable?
Does it make sense to allow tuning of slice at the per-tg level or is global
sufficient?
- I suspect 5ms may be a more sensible/better performing default slice, but
the benchmarks above were performed at 10 so.. so be it, maybe in v2 :).
It's also possible some sort of limited step-function may make sense here
(e.g. favour repeated requests with larger slices, infrequent/likely to
yield with smaller)

Acknowledgements:
-----------------
We would like to thank Bharata B Rao and Dhaval Giani for both discussion and
their proposed patches, many elements in our patchset are directly based on
their work in this area.

Thank you to Ken Chen also for early review and comments.

Thanks for reading this far :)

- Paul and Nikhil

---

Paul Turner (4):
sched: introduce primitives to account for CFS bandwidth tracking
sched: accumulate per-cfs_rq cpu usage
sched: throttle cfs_rq entities which exceed their local quota
sched: unthrottle cfs_rq(s) who ran out of quota at period refresh

include/linux/sched.h | 4 +
init/Kconfig | 9 +
kernel/sched.c | 317 +++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched_fair.c | 206 ++++++++++++++++++++++++++++++--
kernel/sched_rt.c | 19 ---
kernel/sysctl.c | 10 ++
6 files changed, 512 insertions(+), 53 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Paul: "[RFC PATCH v1 4/4] sched: unthrottle cfs_rq(s) who ran out of quotaat period refresh"
Previous message: Yinghai Lu: "[PATCH 7/8] x86: use num_processors for possible cpus"
Next in thread: Paul: "[RFC PATCH v1 4/4] sched: unthrottle cfs_rq(s) who ran out of quotaat period refresh"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]