Re: [RFC 0/3] block: proportional based blk-throttling

From: Shaohua Li
Date: Thu Jan 21 2016 - 19:00:31 EST


Hi,
On Thu, Jan 21, 2016 at 05:41:57PM -0500, Tejun Heo wrote:
> Hello, Shaohua.
>
> On Thu, Jan 21, 2016 at 02:24:51PM -0800, Shaohua Li wrote:
> > > Have you tried with some level, say 5, of nesting? IIRC, how it
> > > implements hierarchical control is rather braindead (and yeah I'm
> > > responsible for the damage).
> >
> > Not yet. Agree nesting increases the locking time. But my test is
> > already an extreme case. I had 32 threads in 2 nodes running IO and the
> > IOPS is 1M/s. Don't think real workload will act like this. The locking
> > issue definitely should be revisited in the future though.
>
> The thing is that most of the possible contentions can be removed by
> implementing per-cpu cache which shouldn't be too difficult. 10%
> extra cost on current gen hardware is already pretty high.

I did think about this. per-cpu cache does sound straightforward, but it
could severely impact fairness. For example, we give each cpu a budget,
see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
breaks fairness very much. I have no idea how this can be fixed.

> > Disagree io time is a better choice. Actually I think IO time will be
>
> If IO time isn't the right term, let's call it IO cost. Whatever the
> term, the actual fraction of cost that each IO is incurring.
>
> > the least we shoule consider for SSD. Idealy if we know each IO cost and
> > total disk capability, things will be easy. Unfortunately there is no
> > way to know IO cost. Bandwidth isn't perfect, but might be the best.
> >
> > I don't know why you think devices are predictable. SSD is never
> > predictable. I'm not sure how you will measure IO time. Morden SSD has
> > large queue depth (blk-mq support 10k queue depth). That means we can
> > send 10k IO in several ns. Measuring IO start/finish time doesn't help
> > too. a 4k IO with 1 io depth might use 10us. a 4k IO with 100 io depth
> > might use more than 100us. The IO time will increase with higher io
> > depth. The fundamental problem is disk with large queue depth can buffer
> > infinite IO request. I think IO time only works for queue depth 1 disk.
>
> They're way more predictable than rotational devices when measured
> over a period. I don't think we'll be able to measure anything
> meaningful at individual command level but aggregate numbers should be
> fairly stable. A simple approximation of IO cost such as fixed cost
> per IO + cost proportional to IO size would do a far better job than
> just depending on bandwidth or iops and that requires approximating
> two variables over time. I'm not sure how easy / feasible that
> actually would be tho.

It still sounds like IO time, otherwise I can't imagine we can measure
the cost. If we use some sort of aggregate number, it likes a variation
of bandwidth. eg cost = bandwidth/ios.

I understand you probably want something like: get disk total resource,
predicate resource of each IO, and then use the info to arbitrate
cgroups. I don't know how it's possible. A disk which uses all its
resources can still accept new IO queuing. Maybe someday a fancy device
can export the info.

Thanks,
Shaohua