[PATCH V4 00/15] blk-throttle: add .high limit

From: Shaohua Li
Date: Mon Nov 14 2016 - 17:22:56 EST


Hi,

The background is we don't have an ioscheduler for blk-mq yet, so we can't
prioritize processes/cgroups. This patch set tries to add basic arbitration
between cgroups with blk-throttle. It adds a new limit io.high for
blk-throttle. It's only for cgroup2.

io.max is a hard limit throttling. cgroups with a max limit never dispatch more
IO than their max limit. While io.high is a best effort throttling. cgroups
with high limit can run above their high limit at appropriate time.
Specifically, if all cgroups reach their high limit, all cgroups can run above
their high limit. If any cgroup runs under its high limit, all other cgroups
will run according to their high limit.

An example usage is we have a high prio cgroup with high high limit and a low
prio cgroup with low high limit. If the high prio cgroup isn't running, the low
prio can run above its high limit, so we don't waste the bandwidth. When the
high prio cgroup runs and is below its high limit, low prio cgroup will run
under its high limit. This will protect high prio cgroup to get more resources.
If both cgroups reach their high limit, both can run above their high limit
(eg, fully utilize disk bandwidth). All these can't be done with io.max limit.

The implementation is simple. The disk queue has 2 states LIMIT_HIGH and
LIMIT_MAX. In each disk state, we throttle cgroups according to the limit of
the state. That is io.high limit for LIMIT_HIGH state, io.max limit for
LIMIT_MAX. The disk state can be upgraded/downgraded between
LIMIT_HIGH/LIMIT_MAX according to the rule above. Initially disk state is
LIMIT_MAX. And if no cgroup sets io.high, the disk state will remain in
LIMIT_MAX state. Users with only io.max set will find nothing changed with the
patches.

The first 8 patches implement the basic framework. Add interface, handle
upgrade and downgrade logic. The patch 8 detects a special case a cgroup is
completely idle. In this case, we ignore the cgroup's limit. The patch 9-15
adds more heuristics.

The basic framework has 2 major issues.
1. fluctuation. When the state is upgraded from LIMIT_HIGH to LIMIT_MAX, the
cgroup's bandwidth can change dramatically, sometimes in a way not expected.
For example, one cgroup's bandwidth will drop below its io.high limit very soon
after a upgrade. patch 9 has more details about the issue.
2. idle cgroup. cgroup with a io.high limit doesn't always dispatch enough IO.
In above upgrade rule, the disk will remain in LIMIT_HIGH state and all other
cgroups can't dispatch more IO above their high limit. Hence this is a waste of
disk bandwidth. patch 10 has more details about the issue.

For issue 1, we make cgroup bandwidth increase smoothly after a upgrade. This
will reduce the chance a cgroup's bandwidth drop under its high limit rapidly.
The smoothness means we could waste some bandwidth in the transition though.
But we must pay something for sharing.

The issue 2 is very hard to solve. The patch 10 uses the 'think time check'
idea borrowed from CFQ to detect idle cgroup. It's not perfect, eg, not works
well for high IO depth workloads. But it's the best I tried so far and in
practice works well. This definitively needs more tuning.

The big change in this version comes from patch 13 - 15. We add a latency
target for each cgroup. The goal is to solve issue 2. If a cgroup's average io
latency exceeds its latency target, the cgroup is considered as busy.

Please review, test and consider merge.

Thanks,
Shaohua

V3->V4:
- Add latency target for cgroup
- Fix bugs

V2->V3:
- Rebase
- Fix several bugs
- Make harddisk think time threshold bigger
http://marc.info/?l=linux-kernel&m=147552964708965&w=2

V1->V2:
- Drop io.low interface for simplicity and the interface isn't a must-have to
prioritize cgroups.
- Remove the 'trial' logic, which creates too much fluctuation
- Add a new idle cgroup detection
- Other bug fixes and improvements
http://marc.info/?l=linux-block&m=147395674732335&w=2

V1:
http://marc.info/?l=linux-block&m=146292596425689&w=2


Shaohua Li (15):
blk-throttle: prepare support multiple limits
blk-throttle: add .high interface
blk-throttle: configure bps/iops limit for cgroup in high limit
blk-throttle: add upgrade logic for LIMIT_HIGH state
blk-throttle: add downgrade logic
blk-throttle: make sure expire time isn't too big
blk-throttle: make throtl_slice tunable
blk-throttle: detect completed idle cgroup
blk-throttle: make bandwidth change smooth
blk-throttle: add a simple idle detection
blk-throttle: add interface to configure think time threshold
blk-throttle: ignore idle cgroup limit
blk-throttle: add a mechanism to estimate IO latency
blk-throttle: add interface for per-cgroup target latency
blk-throttle: add latency target support

block/bio.c | 2 +
block/blk-sysfs.c | 18 +
block/blk-throttle.c | 1035 ++++++++++++++++++++++++++++++++++++++++++---
block/blk.h | 9 +
include/linux/blk_types.h | 4 +
5 files changed, 1001 insertions(+), 67 deletions(-)

--
2.9.3