Re: [RFC 0/3] block: proportional based blk-throttling

From: Shaohua Li
Date: Thu Jan 21 2016 - 17:25:55 EST

On Thu, Jan 21, 2016 at 04:10:02PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
> On Wed, Jan 20, 2016 at 09:49:16AM -0800, Shaohua Li wrote:
> > Currently we have 2 iocontrollers. blk-throttling is bandwidth based. CFQ is
> Just a nit. blk-throttle is both bw and iops based.
> > weight based. It would be great there is a unified iocontroller for the two.
> > And blk-mq doesn't support ioscheduler, leaving blk-throttling the only option
> > for blk-mq. It's time to have a scalable iocontroller supporting both
> > bandwidth/weight based control and working with blk-mq.
> >
> > blk-throttling is a good candidate, it works for both blk-mq and legacy queue.
> > It has a global lock which is scaring for scalability, but it's not terrible in
> > practice. In my test, the NVMe IOPS can reach 1M/s and I have all CPU run IO. Enabling
> > blk-throttle has around 2~3% IOPS and 10% cpu utilization impact. I'd expect
> > this isn't a big problem for today's workload. This patchset then try to make a
> > unified iocontroller. I'm leveraging blk-throttling.
> Have you tried with some level, say 5, of nesting? IIRC, how it
> implements hierarchical control is rather braindead (and yeah I'm
> responsible for the damage).

Not yet. Agree nesting increases the locking time. But my test is
already an extreme case. I had 32 threads in 2 nodes running IO and the
IOPS is 1M/s. Don't think real workload will act like this. The locking
issue definitely should be revisited in the future though.

> > The idea is pretty simple. If we know disk total bandwidth, we can calculate
> > cgroup bandwidth according to its weight. blk-throttling can use the calculated
> > bandwidth to throttle cgroup. Disk total bandwidth changes dramatically per IO
> > pattern. Long history is meaningless. The simple algorithm in patch 1 works
> > pretty well when IO pattern changes.
> So, that part is fine but I don't think it makes sense to make weight
> based control either bandwidth or iops based. The fundamental problem
> is that it's a false choice. It's like asking someone who wants a car
> to choose between accelerator and brake. It's a choice without a good
> answer. Both are wrong. Also note that there's an inherent
> difference from the currently implemented absolute limits. Absolute
> limits can be combined. Weights based on different metrics can't be.
> Even with modern SSDs, both iops and bandwidth play major roles in
> deciding how costly each IO is and I'm fairly confident that this is
> fundamental enough to be the case for quite a while. I *think* the
> cost model can be approximated from measurements. Devices are
> becoming more and more predictable in their behaviors after all. For
> weight based distribution, the unit of distribution should be IO time,
> not bandwidth or iops.

Disagree io time is a better choice. Actually I think IO time will be
the least we shoule consider for SSD. Idealy if we know each IO cost and
total disk capability, things will be easy. Unfortunately there is no
way to know IO cost. Bandwidth isn't perfect, but might be the best.

I don't know why you think devices are predictable. SSD is never
predictable. I'm not sure how you will measure IO time. Morden SSD has
large queue depth (blk-mq support 10k queue depth). That means we can
send 10k IO in several ns. Measuring IO start/finish time doesn't help
too. a 4k IO with 1 io depth might use 10us. a 4k IO with 100 io depth
might use more than 100us. The IO time will increase with higher io
depth. The fundamental problem is disk with large queue depth can buffer
infinite IO request. I think IO time only works for queue depth 1 disk.

On the other hand, how do you utilize IO time? If we use similar
algorithm like the patch set (eg, cgroup's IO time slice = cgroup_share
/ all_cgroup_share * disk_IO_time_capability), how do you get
disk_IO_time_capability? Or use CFQ alrithm (eg, switch cgroup if the
cgroup uses its IO time slice). But CFQ is known not working well with
NCQ unless idle disk, because disk with large queue depth can dispatch
all cgorup's IO immediately. Idling should be avoided of course for high
speed storage.