Re: [RFC 0/3] block: proportional based blk-throttling
From: Vivek Goyal
Date: Fri Jan 22 2016 - 15:04:48 EST
On Fri, Jan 22, 2016 at 11:45:51AM -0800, Shaohua Li wrote:
> On Fri, Jan 22, 2016 at 02:09:10PM -0500, Vivek Goyal wrote:
> > On Fri, Jan 22, 2016 at 10:00:19AM -0800, Shaohua Li wrote:
> > > On Fri, Jan 22, 2016 at 10:52:36AM -0500, Vivek Goyal wrote:
> > > > On Fri, Jan 22, 2016 at 09:48:22AM -0500, Tejun Heo wrote:
> > > > > Hello, Shaohua.
> > > > >
> > > > > On Thu, Jan 21, 2016 at 04:00:16PM -0800, Shaohua Li wrote:
> > > > > > > The thing is that most of the possible contentions can be removed by
> > > > > > > implementing per-cpu cache which shouldn't be too difficult. 10%
> > > > > > > extra cost on current gen hardware is already pretty high.
> > > > > >
> > > > > > I did think about this. per-cpu cache does sound straightforward, but it
> > > > > > could severely impact fairness. For example, we give each cpu a budget,
> > > > > > see 1MB. If a cgroup doesn't use the 1M budget, we don't hold the lock.
> > > > > > But if we have 128 CPUs, the cgroup can use 128 * 1M more budget, which
> > > > > > breaks fairness very much. I have no idea how this can be fixed.
> > > > >
> > > > > Let's say per-cgroup buffer budget B is calculated as, say, 100ms
> > > > > worth of IO cost (or bandwidth or iops) available to the cgroup. In
> > > > > practice, this may have to be adjusted down depending on the number of
> > > > > cgroups performing active IOs. For a given cgroup, B can be
> > > > > distributed among the CPUs that are actively issuing IOs in that
> > > > > cgroup. It will degenerate to round robin of small budget if there
> > > > > are too many active for the budget available but for most cases this
> > > > > will cut down most of cross-CPU traffic.
> > > > >
> > > > > > > They're way more predictable than rotational devices when measured
> > > > > > > over a period. I don't think we'll be able to measure anything
> > > > > > > meaningful at individual command level but aggregate numbers should be
> > > > > > > fairly stable. A simple approximation of IO cost such as fixed cost
> > > > > > > per IO + cost proportional to IO size would do a far better job than
> > > > > > > just depending on bandwidth or iops and that requires approximating
> > > > > > > two variables over time. I'm not sure how easy / feasible that
> > > > > > > actually would be tho.
> > > > > >
> > > > > > It still sounds like IO time, otherwise I can't imagine we can measure
> > > > > > the cost. If we use some sort of aggregate number, it likes a variation
> > > > > > of bandwidth. eg cost = bandwidth/ios.
> > > > >
> > > > > I think cost of an IO can be approxmiated by a fixed per-IO cost +
> > > > > cost proportional to the size, so
> > > > >
> > > > > cost = F + R * size
> > > > >
> > > >
> > > > Hi Tejun,
> > > >
> > > > May be we can throw in a cost differentiation for IO direction also here.
> > > > This still will not take care of cost based on IO pattern, but that's
> > > > another level of complexity which can be added to keep track of IO pattern
> > > > of cgroup and bump up cost accordingly.
> > > >
> > > > Here are some random thoughts basically adding some more details to your idea.
> > > > I am not sure whether it makes sense or not or how difficult it is to
> > > > implement it.
> > > >
> > > > Assume we ensure fairness in a time interval of T and have total of N
> > > > tokens for IO in that time interval T. When a new inteval starts, we
> > > > distribute these N tokens to the pending cgroups based on their weight and
> > > > proportional share. And keep on distributing N tokens after each time
> > > > interval.
> > > >
> > > > We will have to come up with some sort of cost matrix to determine how many
> > > > tokens should be charged per IO (cost per IO). And how to adjust that cost
> > > > dynamically.
> > > >
> > > > Both N and T will be variable and will have to be adjusted continuously.
> > > > For N we could start with some initial number. If we distributed too many
> > > > tokens then device can handle in time T, then in next cycle we will have
> > > > to reduce the value of N and distribute less tokens. If we distributed
> > > > too less tokens and device is fast and finished in less time than T,
> > > > then we can start next cycle sooner and distribute more tokens for next
> > > > cycle. So based on device throughput in a certain time interval, number
> > > > of tokens issued for next cycle will vary.
> > >
> > > Note, we don't know if we dispatch too many/too less tokens. A device
> > > with large queue depth can accept all requests. If queue depth is 1,
> > > things would be easy.
> >
> > If device accepts too many requests then we will keep on increasing tokens
> > and cgroups will keep on submitting IOs in the proportion of their weight.
> > Once queue is full, then we will hit a wall and we will start decreasing
> > number of tokens. So I guess this should still work.
>
> queue will never be full. Typical application drives < 32 IO depth. NVMe
> SSD queue can have 64k queue depth for each hardware queue according to
> the spec.
These are really deep queues. Hmm.., so what's the solution? Should we
drive shallower queue depths in an attmept to achieve fairness? I guess
that will not fly and backfire.
If we can't fill up the queue, then yes, we will keep on increasing
number of tokens and every cgroup will get to submit whatever request
they have got and effectively not get any service differentiation.
That's the age old problem we have. Trying to achieve fairness without
compromising bandwidth will only works in a limited use cases. And
queue depths like 64K make it just worse.
In that case either we driver smaller queue depths (possibly compromising
on throughput) or service different mechanism need to be in hardware and
software needs to just tag the IO with relative priority. I can't think
what other options are there.
> > One problem with deep queue depths will be though that a heavy writer
> > will be able to fill up the queue in a very short interval and block
> > small readers behind it. I guess until and unless devices start doing
> > some prioritization of IO, this problem will be hard to solve. Driving
> > smaller queue depth is not an option as it makes the bandwidth drop.
>
> yep, this is the problem. Disk accepts new requests even all pending
> requests already exhaust its resources.
> > > > Initially I guess cost could be fixed also. That is say, 5 tokens for each
> > > > IO plus 1 token for each 4KB of IO size. If we underestimate the cost of
> > > > IO, then N tokens will not be consumed in time T and next time we will
> > > > distribute less tokens. If we overestimate the cost of IO, then N tokens
> > > > will finish fast and next time we will give more. So exact cost of IO
> > > > might not be a huge factor.
> > >
> > > we still need know the R. any idea for this?
> >
> > Hmm..., thinking loud. Will following work.
> >
> > Can we keep track of average bw and average iops of the queue. And then
> > use that to come up with per IO cost and BW cost.
> >
> > Say average queue bandwidth is ABW and average IOPS is AIOPS.
> >
> > So in interval T, all cgroup cumulatively can dispatch T * ABW size IO.
> >
> > A cgroup's fractional cost of IO = IO_size/(T * ABW)
> >
> > As we are supposed to dispatch N tokens in time T, cgroups cost of IO
> > in terms of tokens will be
> >
> > Cgroup_cost_BW = (N * IO_Size)/(T * ABW)
> >
> > Similarly, a cgroup's per IO cost based on IOPS will be.
> >
> > Cgroup_Cost_IOPS = (N * 1) /(T * AIOPS)
> >
> > So per IO per cgroup we could charge following tokens.
> >
> > Charged_tokens = Cgroup_cost_BW + Cgroup_cost_IO
> >
> > As we are charging cgroup twice (once based on bandwidth and once based
> > on IOPS), may be we can half the effective cost.
> >
> > Effectivey_charged_tokens = (Cgroup_cost_BW + Cgroup_cost_IO)/2
>
> So the cost = Cgroup_cost_BW * A + Cgroup_cost_IO * B
>
> bandwidth based: A = 1, B = 0
> IOPS based: A = 0, B = 1
> The proposal: A = 1/2, B = 1/2
>
> I'm sure people will invent other A/B combinations. It's hard to say
> which one is better. Maybe we really should have simple ones first, eg,
> either bandwidth based or IOPS based, and have a knob to choose. That
> pretty much shows the powerless from kernel side, but that's something
> we can offer.
Actually I kind of like adding iops and bw cost and dividing it by two,
instead of asking user to choose between the two. As Tejun said, there
are no good answers so this is a wrong question to ask.
Thanks
Vivek