Re: [PATCHSET v3 block/for-linus] IO cost model based work-conserving porportional controller

From: Paolo Valente
Date: Thu Aug 29 2019 - 11:54:46 EST


Hi,
I see an important interface problem. Userspace has been waiting for
io.weight to become eventually the file name for setting the weight
for the proportional-share policy [1,2]. If you use that name, how
will we solve this?

Thanks,
Paolo

[1] https://github.com/systemd/systemd/issues/7057#issuecomment-522747575
[2] https://github.com/systemd/systemd/pull/13335#issuecomment-523035303

> Il giorno 29 ago 2019, alle ore 00:05, Tejun Heo <tj@xxxxxxxxxx> ha scritto:
>
> Changes from v2[2]:
>
> * Fixed a divide-by-zero bug in current_hweight().
>
> * pre_start_time and friends renamed to alloc_time and now has its own
> CONFIG option which is selected by IOCOST.
>
> Changes from v1[1]:
>
> * Prerequisite patchsets had cosmetic changes and merged. Refreshed
> on top.
>
> * Renamed from ioweight to iocost. All source code and tools are
> updated accordingly. Control knobs io.weight.qos and
> io.weight.cost_model are renamed to io.cost.qos and io.cost.model
> respectively. This is a more fitting name which won't become a
> misnomer when, for example, cost based io.max is added.
>
> * Various bug fixes and improvements. A few bugs were discovered
> while testing against high-iops nvme device. Auto parameter
> selection improved and verified across different classes of SSDs.
>
> * Dropped bpf iocost support for now.
>
> * Added coef generation script.
>
> * Verified on high-iops nvme device. Result is included below.
>
> One challenge of controlling IO resources is the lack of trivially
> observable cost metric. This is distinguished from CPU and memory
> where wallclock time and the number of bytes can serve as accurate
> enough approximations.
>
> Bandwidth and iops are the most commonly used metrics for IO devices
> but depending on the type and specifics of the device, different IO
> patterns easily lead to multiple orders of magnitude variations
> rendering them useless for the purpose of IO capacity distribution.
> While on-device time, with a lot of clutches, could serve as a useful
> approximation for non-queued rotational devices, this is no longer
> viable with modern devices, even the rotational ones.
>
> While there is no cost metric we can trivially observe, it isn't a
> complete mystery. For example, on a rotational device, seek cost
> dominates while a contiguous transfer contributes a smaller amount
> proportional to the size. If we can characterize at least the
> relative costs of these different types of IOs, it should be possible
> to implement a reasonable work-conserving proportional IO resource
> distribution.
>
> This patchset implements IO cost model based work-conserving
> proportional controller. It currently has a simple linear cost model
> builtin where each IO is classified as sequential or random and given
> a base cost accordingly and additional size-proportional cost is added
> on top. Each IO is given a cost based on the model and the controller
> issues IOs for each cgroup according to their hierarchical weight.
>
> By default, the controller adapts its overall IO rate so that it
> doesn't build up buffer bloat in the request_queue layer, which
> guarantees that the controller doesn't lose significant amount of
> total work. However, this may not provide sufficient differentiation
> as the underlying device may have a deep queue and not be fair in how
> the queued IOs are serviced. The controller provides extra QoS
> control knobs which allow tightening control feedback loop as
> necessary.
>
> For more details on the control mechanism, implementation and
> interface, please refer to the comment at the top of
> block/blk-iocost.c and Documentation/admin-guide/cgroup-v2.rst changes
> in the "blkcg: implement blk-iocost" patch.
>
> Here are some test results. Each test run goes through the following
> combinations with each combination running for a minute. All tests
> are performed against regular files on btrfs w/ deadline as the IO
> scheduler. Random IOs are direct w/ queue depth of 64. Sequential
> are normal buffered IOs.
>
> high priority (weight=500) low priority (weight=100)
>
> Rand read None
> ditto Rand read
> ditto Seq read
> ditto Rand write
> ditto Seq write
> Seq read None
> ditto Rand read
> ditto Seq read
> ditto Rand write
> ditto Seq write
> Rand write None
> ditto Rand read
> ditto Seq read
> ditto Rand write
> ditto Seq write
> Seq write None
> ditto Rand read
> ditto Seq read
> ditto Rand write
> ditto Seq write
>
> * 7200RPM SATA hard disk
> * No IO control
> https://photos.app.goo.gl/1KBHn7ykpC1LXRkB8
> * iocost, QoS: None
> https://photos.app.goo.gl/MLNQGxCtBQ8wAmjm7
> * iocost, QoS: rpct=95.00 rlat=40000 wpct=95.00 wlat=40000 min=25.00 max=200.00
> https://photos.app.goo.gl/XqXHm3Mkbm9w6Db46
> * NCQ-blacklisted SATA SSD (QD==1)
> * No IO control
> https://photos.app.goo.gl/wCTXeu2uJ6LYL4pk8
> * iocost, QoS: None
> https://photos.app.goo.gl/T2HedKD2sywQgj7R9
> * iocost, QoS: rpct=95.00 rlat=20000 wpct=95.00 wlat=20000 min=50.00 max=200.00
> https://photos.app.goo.gl/urBTV8XQc1UqPJJw7
> * SATA SSD (QD==32)
> * No IO control
> https://photos.app.goo.gl/TjEVykuVudSQcryh6
> * iocost, QoS: None
> https://photos.app.goo.gl/iyQBsky7bmM54Xiq7
> * iocost, QoS: rpct=95.00 rlat=10000 wpct=95.00 wlat=20000 min=50.00 max=400.00
> https://photos.app.goo.gl/q1a6URLDxPLMrnHy5
> * NVME SSD (ran with 8 concurrent fio jobs to achieve saturation)
> * No IO control
> https://photos.app.goo.gl/S6xjEVTJzcfb3w1j7
> * iocost, QoS: None
> https://photos.app.goo.gl/SjQUUotJBAGr7vqz7
> * iocost, QoS: rpct=95.00 rlat=5000 wpct=95.00 wlat=5000 min=1.00 max=10000.00
> https://photos.app.goo.gl/RsaYBd2muX7CegoN7
>
> Even without explicit QoS configuration, read-heavy scenarios can
> obtain acceptable differentiation. However, when write-heavy, the
> deep buffering on the device side makes it difficult to maintain
> control. With QoS parameters set, the differentiation is acceptable
> across all combinations.
>
> The implementation comes with default cost model parameters which are
> selected automatically which should provide acceptable behavior across
> most common devices. The parameters for hdd and consumer-grade SSDs
> seem pretty robust. The default parameter set and selection criteria
> for highend SSDs might need further adjustments.
>
> It is fairly easy to configure the QoS parameters and, if needed, cost
> model coefficients. We'll follow up with tooling and further
> documentation. Also, the last RFC patch in the series implements
> support for bpf-based custom cost function. Originally we thought
> that we'd need per-device-type cost functions but the simple linear
> model now seem good enough to cover all common device classes. In
> case custom cost functions become necessary, we can fully develop the
> bpf based extension and also easily add different builtin cost models.
>
> Andy Newell did the heavy lifting of analyzing IO workloads and device
> characteristics, exploring various cost models, determining the
> default model and parameters to use.
>
> Josef Bacik implemented a prototype which explored the use of
> different types of cost metrics including on-device time and Andy's
> linear model.
>
> This patchset is on top of the current block/for-next 53fc55c817c3
> ("Merge branch 'for-5.4/block' into for-next") and contains the
> following 10 patches.
>
> 0001-blkcg-pass-q-and-blkcg-into-blkcg_pol_alloc_pd_fn.patch
> 0002-blkcg-make-cpd_init_fn-optional.patch
> 0003-blkcg-separate-blkcg_conf_get_disk-out-of-blkg_conf_.patch
> 0004-block-rq_qos-add-rq_qos_merge.patch
> 0005-block-rq_qos-implement-rq_qos_ops-queue_depth_change.patch
> 0006-blkcg-s-RQ_QOS_CGROUP-RQ_QOS_LATENCY.patch
> 0007-blk-mq-add-optional-request-alloc_time_ns.patch
> 0008-blkcg-implement-blk-iocost.patch
> 0009-blkcg-add-tools-cgroup-iocost_monitor.py.patch
> 0010-blkcg-add-tools-cgroup-iocost_coef_gen.py.patch
>
> 0001-0007 are prep patches.
> 0008 implements blk-iocost.
> 0009 adds monitoring script.
> 0010 adds linear cost model coefficient generation script.
>
> The patchset is also available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-iow-v2
>
> diffstat follows, Thanks.
>
> Documentation/admin-guide/cgroup-v2.rst | 97 +
> block/Kconfig | 13
> block/Makefile | 1
> block/bfq-cgroup.c | 5
> block/blk-cgroup.c | 71
> block/blk-core.c | 4
> block/blk-iocost.c | 2395 ++++++++++++++++++++++++++++++++
> block/blk-iolatency.c | 8
> block/blk-mq.c | 13
> block/blk-rq-qos.c | 18
> block/blk-rq-qos.h | 28
> block/blk-settings.c | 2
> block/blk-throttle.c | 6
> block/blk-wbt.c | 18
> block/blk-wbt.h | 4
> include/linux/blk-cgroup.h | 4
> include/linux/blk_types.h | 3
> include/linux/blkdev.h | 13
> include/trace/events/iocost.h | 174 ++
> tools/cgroup/iocost_coef_gen.py | 178 ++
> tools/cgroup/iocost_monitor.py | 270 +++
> 21 files changed, 3272 insertions(+), 53 deletions(-)
>
> --
> tejun
>
> [1] http://lkml.kernel.org/r/20190614015620.1587672-1-tj@xxxxxxxxxx
> [2] http://lkml.kernel.org/r/20190710205128.1316483-1-tj@xxxxxxxxxx
>