[RFC v2 PATCH 0/8] CFS Hard limits - v2

From: Bharata B Rao
Date: Wed Sep 30 2009 - 08:51:17 EST


Hi,

Here is the v2 post of hard limits feature for CFS group scheduler. This
RFC post mainly adds runtime borrowing feature and has a new locking scheme
to protect CFS runtime related fields.

It would be nice to have some comments on this set!

Changes
-------

RFC v2:
- Upgraded to 2.6.31.
- Added CFS runtime borrowing.
- New locking scheme
The hard limit specific fields of cfs_rq (cfs_runtime, cfs_time and
cfs_throttled) were being protected by rq->lock. This simple scheme will
not work when runtime rebalancing is introduced where it will be required
to look at these fields on other CPU's which requires us to acquire
rq->lock of other CPUs. This will not be feasible from update_curr().
Hence introduce a separate lock (rq->runtime_lock) to protect these
fields of all cfs_rq under it.
- Handle the task wakeup in a throttled group correctly.
- Make CFS_HARD_LIMITS dependent on CGROUP_SCHED (Thanks to Andrea Righi)

RFC v1:
- First version of the patches with minimal features was posted at
http://lkml.org/lkml/2009/8/25/128

RFC v0:
- The CFS hard limits proposal was first posted at
http://lkml.org/lkml/2009/6/4/24

Testing and Benchmark numbers
-----------------------------
- This patchset has seen very minimal testing on 24way machine and is expected
to have bugs. I need to test this under more test scenarios.
- I have run a few common benchmarks to see if my patches introduce any visible
overhead. I am aware that the number of runs or the combinations I have
used may not be ideal, but the intention in this early stage is to catch any
serious regressions that the patches would have introduced.
- I plan to get numbers from more benchmarks in future releases. Any inputs
on specific benchmarks to try would be helpful.

- hackbench (hackbench -pipe N)
(hackbench was run as part of a group under root group)
-----------------------------------------------------------------------
Time
-----------------------------------------------------------------
N CFS_HARD_LIMTS=n CFS_HARD_LIMTS=y CFS_HARD_LIMITS=y
(infinite runtime) (BW=450000/500000)
-----------------------------------------------------------------------
10 0.475 0.384 0.253
20 0.610 0.670 0.692
50 1.250 1.201 1.295
100 1.981 2.174 1.583
-----------------------------------------------------------------------
- BW = Bandwidth = runtime/period
- Infinite runtime means no hard limiting

- lmbench (lat_ctx -N 5 -s <size_in_kb> N)

(i) size_in_kb = 1024
-----------------------------------------------------------------------
Context switch time (us)
-----------------------------------------------------------------
N CFS_HARD_LIMTS=n CFS_HARD_LIMTS=y CFS_HARD_LIMITS=y
(infinite runtime) (BW=450000/500000)
-----------------------------------------------------------------------
10 315.87 330.19 317.04
100 675.52 699.90 698.50
500 775.01 772.86 772.30
-----------------------------------------------------------------------

(ii) size_in_kb = 2048
-----------------------------------------------------------------------
Context switch time (us)
-----------------------------------------------------------------
N CFS_HARD_LIMTS=n CFS_HARD_LIMTS=y CFS_HARD_LIMITS=y
(infinite runtime) (BW=450000/500000)
-----------------------------------------------------------------------
10 1319.01 1332.16 1328.09
100 1400.77 1372.67 1382.27
500 1479.40 1524.57 1615.84
-----------------------------------------------------------------------

- kernbench

Average Half load -j 12 Run (std deviation):
------------------------------------------------------------------------------
CFS_HARD_LIMTS=n CFS_HARD_LIMTS=y CFS_HARD_LIMITS=y
(infinite runtime) (BW=450000/500000)
------------------------------------------------------------------------------
Elapsd 5.716 (0.278711) 6.06 (0.479322) 5.41 (0.360694)
User 20.464 (2.22087) 22.978 (3.43738) 18.486 (2.60754)
System 14.82 (1.52086) 16.68 (2.3438) 13.514 (1.77074)
% CPU 615.2 (41.1667) 651.6 (43.397) 588.4 (42.0214)
CtxSwt 2727.8 (243.19) 3030.6 (425.338) 2536 (302.498)
Sleeps 4981.4 (442.337) 5532.2 (847.27) 4554.6 (510.532)
------------------------------------------------------------------------------

Average Optimal load -j 96 Run (std deviation):
------------------------------------------------------------------------------
CFS_HARD_LIMTS=n CFS_HARD_LIMTS=y CFS_HARD_LIMITS=y
(infinite runtime) (BW=450000/500000)
------------------------------------------------------------------------------
Elapsd 4.826 (0.276641) 4.776 (0.291599) 5.13 (0.50448)
User 21.278 (2.67999) 22.138 (3.2045) 21.988 (5.63116)
System 19.213 (5.38314) 19.796 (4.32574) 20.407 (8.53682)
% CPU 778.3 (184.522) 786.1 (154.295) 803.1 (244.865)
CtxSwt 2906.5 (387.799) 3052.1 (397.15) 3030.6 (765.418)
Sleeps 4576.6 (565.383) 4796 (990.278) 4576.9 (625.933)
------------------------------------------------------------------------------

Average Maximal load -j Run (std deviation):
------------------------------------------------------------------------------
CFS_HARD_LIMTS=n CFS_HARD_LIMTS=y CFS_HARD_LIMITS=y
(infinite runtime) (BW=450000/500000)
------------------------------------------------------------------------------
Elapsd 5.13 (0.530236) 5.062 (0.0408656) 4.94 (0.229891)
User 22.7293 (4.37921) 22.9973 (2.86311) 22.5507 (4.78016)
System 21.966 (6.81872) 21.9713 (4.72952) 22.0287 (7.39655)
% CPU 860 (202.295) 859.8 (164.415) 864.467 (218.721)
CtxSwt 3154.27 (659.933) 3172.93 (370.439) 3127.2 (657.224)
Sleeps 4602.6 (662.155) 4676.67 (813.274) 4489.2 (542.859)
------------------------------------------------------------------------------

Features TODO
-------------
- CFS runtime borrowing still needs some work, especially need to handle
runtime redistribution when a CPU goes offline.
- Bandwidth inheritance support (long term, not under consideration currently)
- This implementation doesn't work for user group scheduler. Since user group
scheduler will eventually go away, I don't plan to work on this.

Implementation TODO
-------------------
- It is possible to share some of the bandwidth handling code with RT, but
the intention of this post is to show the changes associated with hard limits.
Hence the sharing/cleanup will be done down the line when this patchset
itself becomes more accepatable.
- When a dequeued entity is enqueued back, I don't change its vruntime. The
entity might get undue advantage due to its old (lower) vruntime. Need to
address this.

Patches description
-------------------
This post has the following patches:

1/8 sched: Rename sched_rt_period_mask() and use it in CFS also
2/8 sched: Maintain aggregated tasks count in cfs_rq at each hierarchy level
3/8 sched: Bandwidth initialization for fair task groups
4/8 sched: Enforce hard limits by throttling
5/8 sched: Unthrottle the throttled tasks
6/8 sched: Add throttle time statistics to /proc/sched_debug
7/8 sched: CFS runtime borrowing
8/8 sched: Hard limits documentation

Documentation/scheduler/sched-cfs-hard-limits.txt | 52 ++
include/linux/sched.h | 9
init/Kconfig | 13
kernel/sched.c | 427 +++++++++++++++++++
kernel/sched_debug.c | 21
kernel/sched_fair.c | 432 +++++++++++++++++++-
kernel/sched_rt.c | 22 -
7 files changed, 932 insertions(+), 44 deletions(-)

Regards,
Bharata.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/