Re: [PATCH 2/5] writeback: dirty position control

From: Wu Fengguang
Date: Thu Aug 25 2011 - 21:56:29 EST

Next message: Barry Song: "Re: [PATCH 0/4 v4] pin controller subsystem v4"
Previous message: Stephen Rothwell: "linux-next: manual merge of the slave-dma tree with Linus' tree"
In reply to: Vivek Goyal: "Re: [PATCH 2/5] writeback: dirty position control"
Next in thread: Peter Zijlstra: "Re: [PATCH 2/5] writeback: dirty position control"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Aug 26, 2011 at 06:20:01AM +0800, Vivek Goyal wrote:
> On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:
>
> [..]
> > > So you are trying to make one feedback loop aware of second loop so that
> > > if second loop is unbalanced, first loop reacts to that as well and not
> > > just look at dirty_rate and write_bw. So refining new balanced rate by
> > > pos_ratio helps.
> > > write_bw
> > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > > dirty_bw
> > >
> > > Now if global dirty pages are imbalanced, balanced rate will still go
> > > down despite the fact that dirty_bw == write_bw. This will lead to
> > > further reduction in task dirty rate. Which in turn will lead to reduced
> > > number of dirty rate and should eventually lead to pos_ratio=1.
> >
> > Right, that's a good alternative viewpoint to the below one.
> >
> > write_bw
> > bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
> > dirty_bw
> >
> > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0
>
> Personally I found it much easier to understand the other representation.
> Once you have come up with equation.
>
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw
>
> Can you please put few lines of comments to explain that why above
> alone is not sufficient and we need to take pos_ratio also in to
> account to keep number of dirty pages in check. And then go onto
>
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio
>
> This kind of maintains the continuity of explanation and explains
> that why are we deviating from the theory we discussed so far.

Good point. Here is the commented code:

/*
* task_ratelimit reflects each dd's dirty rate for the past 200ms.
*/
task_ratelimit = (u64)dirty_ratelimit *
pos_ratio >> RATELIMIT_CALC_SHIFT;

/*
* A linear estimation of the "balanced" throttle rate. The theory is,
* if there are N dd tasks, each throttled at task_ratelimit, the bdi's
* dirty_rate will be measured to be (N * task_ratelimit). So the below
* formula will yield the balanced rate limit (write_bw / N).
*
* Note that the expanded form is not a pure rate feedback:
* rate_(i+1) = rate_(i) * (write_bw / dirty_rate) (1)
* but also takes pos_ratio into account:
* rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio (2)
*
* (1) is not realistic because pos_ratio also takes part in balancing
* the dirty rate. Consider the state
* pos_ratio = 0.5 (3)
* rate = 2 * (write_bw / N) (4)
* If (1) is used, it will stuck in that state! Because each dd will be
* throttled at
* task_ratelimit = pos_ratio * rate = (write_bw / N) (5)
* yielding
* dirty_rate = N * task_ratelimit = write_bw (6)
* put (6) into (1) we get
* rate_(i+1) = rate_(i) (7)
*
* So we end up using (2) to always keep
* rate_(i+1) ~= (write_bw / N) (8)
* regardless of the value of pos_ratio. As long as (8) is satisfied,
* pos_ratio is able to drive itself to 1.0, which is not only where
* the dirty count meet the setpoint, but also where the slope of
* pos_ratio is most flat and hence task_ratelimit is least fluctuated.
*/
balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
dirty_rate | 1);

> >
> > > A related question though I should have asked you this long back. How does
> > > throttling based on rate helps. Why we could not just work with two
> > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > > And then throttle task gradually to achieve smooth throttling behavior.
> > > IOW, what property does rate provide which is not available just by
> > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > > limit the way you have done for gloabl setpoint and throttle tasks
> > > accordingly?
> >
> > Good question. If we have no idea of the balanced rate at all, but
> > still want to limit dirty pages within the range [freerun, limit],
> > all we can do is to throttle the task at eg. 1TB/s at @freerun and
> > 0 at @limit. Then you get a really sharp control line which will make
> > task_ratelimit fluctuate like mad...
> >
> > So the balanced rate estimation is the key to get smooth task_ratelimit,
> > while pos_ratio is the ultimate guarantee for the dirty pages range.
>
> Ok, that makes sense. By keeping an estimation of rate at which bdi
> can write, our range of throttling goes down. Say 0 to 300MB/s instead
> of 0 to 1TB/sec and that can lead to a more smooth behavior.

Yeah exactly, and even better, we can make the slope much more flat
around the setpoint to achieve excellent smoothness in stable state :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Barry Song: "Re: [PATCH 0/4 v4] pin controller subsystem v4"
Previous message: Stephen Rothwell: "linux-next: manual merge of the slave-dma tree with Linus' tree"
In reply to: Vivek Goyal: "Re: [PATCH 2/5] writeback: dirty position control"
Next in thread: Peter Zijlstra: "Re: [PATCH 2/5] writeback: dirty position control"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]