Re: [PATCH 02/18] writeback: dirty position control

From: Nai Xia
Date: Sat Nov 12 2011 - 00:44:37 EST


Hello Fengguang,

Is this the similar idea&algo behind TCP congestion control
since 2.6.19 ?

Same situation: Multiple tcp connections contending for network
bandwidth V.S. multiple process contending for BDI bandwidth.

Same solution: Per connection(v.s. process) speed control with cubic
algorithm controlled balancing.

:-)

Then the validness and efficiency in essence has been verified
in real world for years in another similar situation. Good to see we
are going to have it in write-back too!


Thanks,
Nai


On Sun, Sep 4, 2011 at 9:53 AM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
>
> Old scheme is,
>                                          |
>                           free run area  |  throttle area
>  ----------------------------------------+---------------------------->
>                                    thresh^                  dirty pages
>
> New scheme is,
>
>  ^ task rate limit
>  |
>  |            *
>  |             *
>  |              *
>  |[free run]      *      [smooth throttled]
>  |                  *
>  |                     *
>  |                         *
>  ..bdi->dirty_ratelimit..........*
>  |                               .     *
>  |                               .          *
>  |                               .              *
>  |                               .                 *
>  |                               .                    *
>  +-------------------------------.-----------------------*------------>
>                          setpoint^                  limit^  dirty pages
>
> The slope of the bdi control line should be
>
> 1) large enough to pull the dirty pages to setpoint reasonably fast
>
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>   hence task ratelimit
>
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
>
> Assume the bdi control line
>
>        pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
>
> where k is the negative slope.
>
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
>
>        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
>
> we get slope
>
>        k = - 1 / (8 * write_bw)
>
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
>
>        x_intercept = bdi_setpoint + 8 * write_bw
>
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
>
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the writeout bandwidth
>
> so that
>
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>  pages inside the control scope
>
> - in large memory systems where the "gravity" of (1) for pulling the
>  dirty pages to setpoint is too weak, (2) can back (1) up and drive
>  dirty pages to bdi_setpoint ~= setpoint reasonably fast.
>
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
>
> Given equations
>
>        span = x_intercept - bdi_setpoint
>        k = df/dx = - 1 / span
>
> and the extremum values
>
>        span = bdi_thresh
>        dx = bdi_thresh
>
> we get
>
>        df = - dx / span = - 1.0
>
> That means, when bdi_dirty deviates bdi_thresh up, pos_ratio and hence
> task ratelimit will fluctuate by -100%.
>
> peter: use 3rd order polynomial for the global control line
>
> CC: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
> Acked-by: Jan Kara <jack@xxxxxxx>
> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> ---
>  fs/fs-writeback.c         |    2
>  include/linux/writeback.h |    1
>  mm/page-writeback.c       |  213 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 210 insertions(+), 6 deletions(-)
>
> --- linux-next.orig/mm/page-writeback.c 2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/mm/page-writeback.c      2011-08-26 15:57:34.000000000 +0800
> @@ -46,6 +46,8 @@
>  */
>  #define BANDWIDTH_INTERVAL     max(HZ/5, 1)
>
> +#define RATELIMIT_CALC_SHIFT   10
> +
>  /*
>  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>  * will look to see if it needs to force writeback or throttling.
> @@ -409,6 +411,12 @@ int bdi_set_max_ratio(struct backing_dev
>  }
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +                                          unsigned long bg_thresh)
> +{
> +       return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>        return max(thresh, global_dirty_limit);
> @@ -493,6 +501,197 @@ unsigned long bdi_dirty_limit(struct bac
>        return bdi_dirty;
>  }
>
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long bg_thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long write_bw = bdi->avg_write_bandwidth;
> +       unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long x_intercept;
> +       unsigned long setpoint;         /* dirty pages' target balance point */
> +       unsigned long bdi_setpoint;
> +       unsigned long span;
> +       long long pos_ratio;            /* for scaling up/down the rate limit */
> +       long x;
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        *
> +        *                           setpoint - dirty 3
> +        *        f(dirty) := 1.0 + (----------------)
> +        *                           limit - setpoint
> +        *
> +        * it's a 3rd order polynomial that subjects to
> +        *
> +        * (1) f(freerun)  = 2.0 => rampup dirty_ratelimit reasonably fast
> +        * (2) f(setpoint) = 1.0 => the balance point
> +        * (3) f(limit)    = 0   => the hard limit
> +        * (4) df/dx      <= 0   => negative feedback control
> +        * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +        *     => fast response on large errors; small oscillation near setpoint
> +        */
> +       setpoint = (freerun + limit) / 2;
> +       x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +                   limit - setpoint + 1);
> +       pos_ratio = x;
> +       pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +       pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +       pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +       /*
> +        * We have computed basic pos_ratio above based on global situation. If
> +        * the bdi is over/under its share of dirty pages, we want to scale
> +        * pos_ratio further down/up. That is done by the following mechanism.
> +        */
> +
> +       /*
> +        * bdi setpoint
> +        *
> +        *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +        *
> +        *                        x_intercept - bdi_dirty
> +        *                     := --------------------------
> +        *                        x_intercept - bdi_setpoint
> +        *
> +        * The main bdi control line is a linear function that subjects to
> +        *
> +        * (1) f(bdi_setpoint) = 1.0
> +        * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +        *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +        *
> +        * For single bdi case, the dirty pages are observed to fluctuate
> +        * regularly within range
> +        *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +        * for various filesystems, where (2) can yield in a reasonable 12.5%
> +        * fluctuation range for pos_ratio.
> +        *
> +        * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +        * own size, so move the slope over accordingly and choose a slope that
> +        * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh.
> +        */
> +       if (unlikely(bdi_thresh > thresh))
> +               bdi_thresh = thresh;
> +       /*
> +        * scale global setpoint to bdi's:
> +        *      bdi_setpoint = setpoint * bdi_thresh / thresh
> +        */
> +       x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +       bdi_setpoint = setpoint * (u64)x >> 16;
> +       /*
> +        * Use span=(8*write_bw) in single bdi case as indicated by
> +        * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +        *
> +        *        bdi_thresh                    thresh - bdi_thresh
> +        * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh
> +        *          thresh                            thresh
> +        */
> +       span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
> +       x_intercept = bdi_setpoint + span;
> +
> +       span >>= 1;
> +       if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +               if (unlikely(bdi_dirty > limit))
> +                       return 0;
> +               if (x_intercept < limit) {
> +                       x_intercept = limit;    /* auxiliary control line */
> +                       bdi_setpoint += span;
> +                       pos_ratio >>= 1;
> +               }
> +       }
> +       pos_ratio *= x_intercept - bdi_dirty;
> +       do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +       return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>                                       unsigned long elapsed,
>                                       unsigned long written)
> @@ -591,6 +790,7 @@ static void global_update_bandwidth(unsi
>
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>                            unsigned long thresh,
> +                           unsigned long bg_thresh,
>                            unsigned long dirty,
>                            unsigned long bdi_thresh,
>                            unsigned long bdi_dirty,
> @@ -627,6 +827,7 @@ snapshot:
>
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>                                 unsigned long thresh,
> +                                unsigned long bg_thresh,
>                                 unsigned long dirty,
>                                 unsigned long bdi_thresh,
>                                 unsigned long bdi_dirty,
> @@ -635,8 +836,8 @@ static void bdi_update_bandwidth(struct
>        if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>                return;
>        spin_lock(&bdi->wb.list_lock);
> -       __bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -                              start_time);
> +       __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +                              bdi_thresh, bdi_dirty, start_time);
>        spin_unlock(&bdi->wb.list_lock);
>  }
>
> @@ -677,7 +878,8 @@ static void balance_dirty_pages(struct a
>                 * catch-up. This avoids (excessively) small writeouts
>                 * when the bdi limits are ramping up.
>                 */
> -               if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +               if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +                                                     background_thresh))
>                        break;
>
>                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -721,8 +923,9 @@ static void balance_dirty_pages(struct a
>                if (!bdi->dirty_exceeded)
>                        bdi->dirty_exceeded = 1;
>
> -               bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -                                    bdi_thresh, bdi_dirty, start_time);
> +               bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +                                    nr_dirty, bdi_thresh, bdi_dirty,
> +                                    start_time);
>
>                /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>                 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c   2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/fs/fs-writeback.c        2011-08-26 15:57:20.000000000 +0800
> @@ -675,7 +675,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>                                unsigned long start_time)
>  {
> -       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +       __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>
>  /*
> --- linux-next.orig/include/linux/writeback.h   2011-08-26 15:57:18.000000000 +0800
> +++ linux-next/include/linux/writeback.h        2011-08-26 15:57:20.000000000 +0800
> @@ -141,6 +141,7 @@ unsigned long bdi_dirty_limit(struct bac
>
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>                            unsigned long thresh,
> +                           unsigned long bg_thresh,
>                            unsigned long dirty,
>                            unsigned long bdi_thresh,
>                            unsigned long bdi_dirty,
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/