Re: [PATCH 2/5] writeback: dirty position control

From: Jan Kara
Date: Thu Aug 18 2011 - 15:16:24 EST


On Thu 18-08-11 12:18:01, Wu Fengguang wrote:
> > > > > + * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > > + * => fast response on large errors; small oscillation near setpoint
> > > > > + */
> > > > > + setpoint = (freerun + limit) / 2;
> > > > > + x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > > + limit - setpoint + 1);
> > > > > + pos_ratio = x;
> > > > > + pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > + pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > + pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +
> > > > > + /*
> > > > > + * bdi setpoint
> > OK, so if I understand the code right, we now have basic pos_ratio based
> > on global situation. Now, in the following code, we might scale pos_ratio
> > further down, if bdi_dirty is too much over bdi's share, right?
>
> Right.
>
> > Do we also want to scale pos_ratio up, if we are under bdi's share?
>
> Yes.
>
> > If yes, do we really want to do it even if global pos_ratio < 1
> > (i.e. we are over global setpoint)?
>
> Yes. It's safe because the bdi pos_ratio scale is linear and the
> global pos_ratio scale will quickly drop to 0 near @limit, thus
> counter-acting any > 1 bdi pos_ratio.
OK. I just wanted to make sure I understand it right :-). I can see
arguments for all the different choices so let's see how it works in
practice...

> > > > > + *
> > > > > + * f(dirty) := 1.0 + k * (dirty - setpoint)
> > ^^^^^^^ bdi_dirty? ^^^ maybe I'd name it
> > bdi_setpoint to distinguish clearly from the global value.
>
> OK. I'll add a new variable bdi_setpoint, too, to make it consistent
> all over the places.
>
> > > > > + *
> > > > > + * The main bdi control line is a linear function that subjects to
> > > > > + *
> > > > > + * (1) f(setpoint) = 1.0
> > > > > + * (2) k = - 1 / (8 * write_bw) (in single bdi case)
> > > > > + * or equally: x_intercept = setpoint + 8 * write_bw
> > > > > + *
> > > > > + * For single bdi case, the dirty pages are observed to fluctuate
> > > > > + * regularly within range
> > > > > + * [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > > + * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > > + * fluctuation range for pos_ratio.
> > > > > + *
> > > > > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > > + * own size, so move the slope over accordingly.
> > > > > + */
> > > > > + if (unlikely(bdi_thresh > thresh))
> > > > > + bdi_thresh = thresh;
> > > > > + /*
> > > > > + * scale global setpoint to bdi's: setpoint *= bdi_thresh / thresh
> > > > > + */
> > > > > + x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > > + setpoint = setpoint * (u64)x >> 16;
> > > > > + /*
> > > > > + * Use span=(4*write_bw) in single bdi case as indicated by
> > > > > + * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > > + */
> > > > > + span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > > + (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > > + thresh + 1);
> > > > I think you can slightly simplify this to:
> > > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > >
> > > Good idea!
> > >
> > > > > + x_intercept = setpoint + 2 * span;
> > ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> > ~3*bdi_thresh...
>
> Right.
>
> > So maybe you should use bdi_thresh/2 in the computation of span?
>
> Given that at some configurations bdi_thresh can fluctuate to its own
> size, I guess the current slope of control line is sharp enough.
>
> Given equations
>
> span = (x_intercept - bdi_setpoint) / 2
> k = df/dx = -0.5 / span
>
> and the values
>
> span = bdi_thresh
> dx = bdi_thresh
>
> we get
>
> df = - dx / (2 * span) = - 1/2
>
> That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
> hence task ratelimit will fluctuate by -1/2. This is probably more
> than the users can tolerate already?
OK, let's try that.

> ---
> Subject: writeback: dirty position control
> Date: Wed Mar 02 16:04:18 CST 2011
>
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
>
> Old scheme is,
> |
> free run area | throttle area
> ----------------------------------------+---------------------------->
> thresh^ dirty pages
>
> New scheme is,
>
> ^ task rate limit
> |
> | *
> | *
> | *
> |[free run] * [smooth throttled]
> | *
> | *
> | *
> ..bdi->dirty_ratelimit..........*
> | . *
> | . *
> | . *
> | . *
> | . *
> +-------------------------------.-----------------------*------------>
> setpoint^ limit^ dirty pages
>
> The slope of the bdi control line should be
>
> 1) large enough to pull the dirty pages to setpoint reasonably fast
>
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
> hence task ratelimit
>
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
>
> Assume the bdi control line
>
> pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
>
> where k is the negative slope.
>
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
>
> [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
>
> we get slope
>
> k = - 1 / (8 * write_bw)
>
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
>
> x_intercept = bdi_setpoint + 8 * write_bw
>
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
>
> 1) slope of global control line => scaling to the control scope size
> 2) slope of main bdi control line => scaling to the write bandwidth
>
> so that
>
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
> pages inside the control scope
>
> - in large memory systems where the "gravity" of (1) for pulling the
> dirty pages to setpoint is too weak, (2) can back (1) up and drive
> dirty pages to bdi_setpoint ~= setpoint reasonably fast.
>
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks. In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
>
> peter: use 3rd order polynomial for the global control line
>
> CC: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
OK, I like this patch now. You can add
Acked-by: Jan Kara <jack@xxxxxxx>

Honza

> ---
> fs/fs-writeback.c | 2
> include/linux/writeback.h | 1
> mm/page-writeback.c | 212 +++++++++++++++++++++++++++++++++++-
> 3 files changed, 209 insertions(+), 6 deletions(-)
>
> --- linux-next.orig/mm/page-writeback.c 2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c 2011-08-18 12:15:24.000000000 +0800
> @@ -46,6 +46,8 @@
> */
> #define BANDWIDTH_INTERVAL max(HZ/5, 1)
>
> +#define RATELIMIT_CALC_SHIFT 10
> +
> /*
> * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
> * will look to see if it needs to force writeback or throttling.
> @@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
> return x + 1; /* Ensure that we never return 0 */
> }
>
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> + unsigned long bg_thresh)
> +{
> + return (thresh + bg_thresh) / 2;
> +}
> +
> static unsigned long hard_dirty_limit(unsigned long thresh)
> {
> return max(thresh, global_dirty_limit);
> @@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
> return bdi_dirty;
> }
>
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + * pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + * if (dirty < setpoint) scale up pos_ratio
> + * if (dirty > setpoint) scale down pos_ratio
> + *
> + * if (bdi_dirty < bdi_setpoint) scale up pos_ratio
> + * if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + * task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + * ^ pos_ratio
> + * |
> + * | |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + * | .*
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * 1.0 ................................*
> + * | . . *
> + * | . . *
> + * | . . *
> + * | . . *
> + * | . . *
> + * 0 +------------.------------------.----------------------*------------->
> + * freerun^ setpoint^ limit^ dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + * o
> + * o
> + * o [o] main control line
> + * o [*] auxiliary control line
> + * o
> + * o
> + * o
> + * o
> + * o
> + * o
> + * o--------------------- balance point, rate scale = 1
> + * | o
> + * | o
> + * | o
> + * | o
> + * | o
> + * | o
> + * | o------- connect point, rate scale = 1/2
> + * |<-- span --->| .*
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * [--------------------+-----------------------------.--------------------*]
> + * 0 bdi_setpoint x_intercept limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + * card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> + unsigned long thresh,
> + unsigned long bg_thresh,
> + unsigned long dirty,
> + unsigned long bdi_thresh,
> + unsigned long bdi_dirty)
> +{
> + unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> + unsigned long limit = hard_dirty_limit(thresh);
> + unsigned long x_intercept;
> + unsigned long setpoint; /* dirty pages' target balance point */
> + unsigned long bdi_setpoint;
> + unsigned long span;
> + long long pos_ratio; /* for scaling up/down the rate limit */
> + long x;
> +
> + if (unlikely(dirty >= limit))
> + return 0;
> +
> + /*
> + * global setpoint
> + *
> + * setpoint - dirty 3
> + * f(dirty) := 1.0 + (----------------)
> + * limit - setpoint
> + *
> + * it's a 3rd order polynomial that subjects to
> + *
> + * (1) f(freerun) = 2.0 => rampup base_rate reasonably fast
> + * (2) f(setpoint) = 1.0 => the balance point
> + * (3) f(limit) = 0 => the hard limit
> + * (4) df/dx <= 0 => negative feedback control
> + * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> + * => fast response on large errors; small oscillation near setpoint
> + */
> + setpoint = (freerun + limit) / 2;
> + x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> + limit - setpoint + 1);
> + pos_ratio = x;
> + pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> + pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> + pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> + /*
> + * We have computed basic pos_ratio above based on global situation. If
> + * the bdi is over/under its share of dirty pages, we want to scale
> + * pos_ratio further down/up. That is done by the following policies:
> + *
> + * For single bdi case, the dirty pages are observed to fluctuate
> + * regularly within range
> + * [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> + * for various filesystems, so choose a slope that can yield in a
> + * reasonable 12.5% fluctuation range for pos_ratio.
> + *
> + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> + * own size, so move the slope over accordingly and choose a slope that
> + * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
> + */
> +
> + /*
> + * bdi setpoint
> + *
> + * f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> + *
> + * x_intercept - bdi_dirty
> + * := --------------------------
> + * x_intercept - bdi_setpoint
> + *
> + * The main bdi control line is a linear function that subjects to
> + *
> + * (1) f(bdi_setpoint) = 1.0
> + * (2) k = - 1 / (8 * write_bw) (in single bdi case)
> + * or equally: x_intercept = bdi_setpoint + 8 * write_bw
> + */
> + if (unlikely(bdi_thresh > thresh))
> + bdi_thresh = thresh;
> + /*
> + * scale global setpoint to bdi's:
> + * bdi_setpoint = setpoint * bdi_thresh / thresh
> + */
> + x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> + bdi_setpoint = setpoint * (u64)x >> 16;
> + /*
> + * Use span=(4*write_bw) in single bdi case as indicated by
> + * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> + *
> + * bdi_thresh thresh - bdi_thresh
> + * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
> + * thresh thresh
> + */
> + span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
> + (u64)x >> 16;
> + x_intercept = bdi_setpoint + 2 * span;
> +
> + if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> + if (unlikely(bdi_dirty > limit))
> + return 0;
> + if (x_intercept < limit) {
> + x_intercept = limit; /* auxiliary control line */
> + bdi_setpoint += span;
> + pos_ratio >>= 1;
> + }
> + }
> + pos_ratio *= x_intercept - bdi_dirty;
> + do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> + return pos_ratio;
> +}
> +
> static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> unsigned long elapsed,
> unsigned long written)
> @@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
>
> void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> unsigned long thresh,
> + unsigned long bg_thresh,
> unsigned long dirty,
> unsigned long bdi_thresh,
> unsigned long bdi_dirty,
> @@ -629,6 +828,7 @@ snapshot:
>
> static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> unsigned long thresh,
> + unsigned long bg_thresh,
> unsigned long dirty,
> unsigned long bdi_thresh,
> unsigned long bdi_dirty,
> @@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct
> if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
> return;
> spin_lock(&bdi->wb.list_lock);
> - __bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> - start_time);
> + __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> + bdi_thresh, bdi_dirty, start_time);
> spin_unlock(&bdi->wb.list_lock);
> }
>
> @@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
> * catch-up. This avoids (excessively) small writeouts
> * when the bdi limits are ramping up.
> */
> - if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> + if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> + background_thresh))
> break;
>
> bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
> if (!bdi->dirty_exceeded)
> bdi->dirty_exceeded = 1;
>
> - bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> - bdi_thresh, bdi_dirty, start_time);
> + bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> + nr_dirty, bdi_thresh, bdi_dirty,
> + start_time);
>
> /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c 2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c 2011-08-17 20:35:34.000000000 +0800
> @@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
> static void wb_update_bandwidth(struct bdi_writeback *wb,
> unsigned long start_time)
> {
> - __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> + __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
> }
>
> /*
> --- linux-next.orig/include/linux/writeback.h 2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h 2011-08-17 20:35:34.000000000 +0800
> @@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
>
> void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> unsigned long thresh,
> + unsigned long bg_thresh,
> unsigned long dirty,
> unsigned long bdi_thresh,
> unsigned long bdi_dirty,
--
Jan Kara <jack@xxxxxxx>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/