Re: [RFC 0/3] block: proportional based blk-throttling

From: Vivek Goyal
Date: Fri Jan 22 2016 - 10:52:43 EST


On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> Hello, Shaohua.
>
> On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > The thing is that most of the possible contentions can be removed by
> > > implementing per-cpu cache which shouldn't be too difficult. 10%
> > > extra cost on current gen hardware is already pretty high.
> >
> > I did think about this. per-cpu cache does sound straightforward, but it
> > could severely impact fairness. For example, we give each cpu a budget,
> > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > breaks fairness very much. I have no idea how this can be fixed.
>
> Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> worth of IO cost (or bandwidth or iops) available to the cgroup. In
> practice, this may have to be adjusted down depending on the number of
> cgroups performing active IOs. For a given cgroup, B can be
> distributed among the CPUs that are actively issuing IOs in that
> cgroup. It will degenerate to round robin of small budget if there
> are too many active for the budget available but for most cases this
> will cut down most of cross-CPU traffic.
>
> > > They're way more predictable than rotational devices when measured
> > > over a period. I don't think we'll be able to measure anything
> > > meaningful at individual command level but aggregate numbers should be
> > > fairly stable. A simple approximation of IO cost such as fixed cost
> > > per IO + cost proportional to IO size would do a far better job than
> > > just depending on bandwidth or iops and that requires approximating
> > > two variables over time. I'm not sure how easy / feasible that
> > > actually would be tho.
> >
> > It still sounds like IO time, otherwise I can't imagine we can measure
> > the cost. If we use some sort of aggregate number, it likes a variation
> > of bandwidth. eg cost = bandwidth/ios.
>
> I think cost of an IO can be approxmiated by a fixed per-IO cost +
> cost proportional to the size, so
>
> cost = F + R * size
>

Hi Tejun,

May be we can throw in a cost differentiation for IO direction also here.
This still will not take care of cost based on IO pattern, but that's
another level of complexity which can be added to keep track of IO pattern
of cgroup and bump up cost accordingly.

Here are some random thoughts basically adding some more details to your idea.
I am not sure whether it makes sense or not or how difficult it is to
implement it.

Assume we ensure fairness in a time interval of T and have total of N
tokens for IO in that time interval T. When a new inteval starts, we
distribute these N tokens to the pending cgroups based on their weight and
proportional share. And keep on distributing N tokens after each time
interval.

We will have to come up with some sort of cost matrix to determine how many
tokens should be charged per IO (cost per IO). And how to adjust that cost
dynamically.

Both N and T will be variable and will have to be adjusted continuously.
For N we could start with some initial number. If we distributed too many
tokens then device can handle in time T, then in next cycle we will have
to reduce the value of N and distribute less tokens. If we distributed
too less tokens and device is fast and finished in less time than T,
then we can start next cycle sooner and distribute more tokens for next
cycle. So based on device throughput in a certain time interval, number
of tokens issued for next cycle will vary.

Initially I guess cost could be fixed also. That is say, 5 tokens for each
IO plus 1 token for each 4KB of IO size. If we underestimate the cost of
IO, then N tokens will not be consumed in time T and next time we will
distribute less tokens. If we overestimate the cost of IO, then N tokens
will finish fast and next time we will give more. So exact cost of IO
might not be a huge factor.

Thought of writing it down, irrespective of the fact whether it made much
sense or not.

Thanks
Vivek