Re: [PATCH 19/27] writeback: dirty throttle bandwidth control

From: Wu Fengguang
Date: Tue Mar 29 2011 - 17:08:40 EST


Hi,

This is the hard core of the patchset. Sorry the original changelog is
way too detail oriented. I'll try to provide a more general overview
to help understand the main ideas.

There are two major code paths in this IO-less dirty throttling scheme.

(1) on write() syscall

balance_dirty_pages(pages_dirtied)
{
task_bandwidth = bdi->base_bandwidth * pos_ratio /
sqrt(task_dirty_weight);
pause = pages_dirtied / task_bandwidth;
sleep(pause);
}

where pos_ratio is calculated in

dirty_throttle_bandwidth()
{
pos_ratio = 1.0;

if (nr_dirty < goal) scale up pos_ratio
if (nr_dirty > goal) scale down pos_ratio
if (bdi_dirty < bdi_goal) scale up pos_ratio
if (bdi_dirty > bdi_goal) scale down pos_ratio

if (nr_dirty close to dirty limit) scale down pos_ratio
if (bdi_dirty close to 0) scale up pos_ratio
}

(2) on every 100ms

bdi_update_bandwidth()
{
update bdi->base_bandwidth
update bdi->write_bandwidth
update smoothed dirty pages
update smoothed dirty threshold/limit
}

where bdi->base_bandwidth is updated in bdi_update_throttle_bandwidth()
to make sure that the bdi's
- dirty bandwidth (the rate dirty pages are created)
- write bandwidth (the rate dirty pages are cleaned)
will match if pos_ratio=1. The skeleton logic is:

bdi_update_throttle_bandwidth()
{
if (common case: 1 task writing to 1 disk)
ref_bw = bdi->write_bandwidth;
else
ref_bw = bdi->base_bandwidth * pos_ratio *
(bdi->write_bandwidth / dirty_bw);

if (dirty pages are departing from the dirty goals)
step bdi->base_bandwidth closer to ref_bw;
}

Basically on the two core functions,

- dirty_throttle_bandwidth() is made of easy to understand policies,
except that the lots of integer arithmetics are not so fun.

- bdi_update_throttle_bandwidth() is a mechanical estimation/tracking
problem that is made tricky by lots of fluctuations. It does succeed
in getting a very smooth/stable bdi->base_bandwidth on top of the much
fluctuated pos_ratio, bdi->write_bandwidth and dirty_bw.

Thanks,
Fengguang

On Thu, Mar 03, 2011 at 02:45:24PM +0800, Wu, Fengguang wrote:
> balance_dirty_pages() has been using a very simple and robust threshold
> based throttle scheme. It automatically limits the dirty rate down,
> however in a very bumpy way that constantly block the dirtier tasks for
> hundreds of milliseconds on a local ext4.
>
> The new scheme is to expand the ON/OFF threshold to a larger scope in
> which both the number of dirty pages and the dirty rate are explicitly
> controlled. The basic ideas are
>
> - position feedback control
>
> At the center of the control scope is the setpoint/goal. When the
> number of dirty pages go higher/lower than the goal, its dirty rate
> will be proportionally decreased/increased to prevent it from drifting
> away.
>
> When the dirty pages drops low to the bottom of the control scope, or
> rushes high to the upper limit, the dirty rate will quickly be scaled
> up/down, to the point of completely let go of or completely block the
> dirtier task.
>
> - rate feedback control
>
> What's the balanced dirty rate if the dirty pages are exactly at the
> goal? If there are N tasks dirtying pages on 1 disk at rate task_bw MB/s,
> then task_bw should be balanced at write_bw/N where write_bw is the
> disk's write bandwidth. We call base_bw=write_bw/(N*sqrt(N)) the
> disk's base throttle bandwidth. Each task will be allowed to dirty at
> rate task_bw=base_bw/sqrt(task_weight) where task_weight=1/N reflects
> how much dirty pages in the system are dirtied by the task. So the
> overall dirty rate dirty_bw=N*task_bw will match write_bw exactly.
>
> In practice we don't know base_bw beforehand. Because we don't know
> the exact number of N and cannot assume all tasks are equal weighted.
> So a reference bandwidth ref_bw is estimated as the target of base_bw.
> base_bw will be adjusted step by step towards ref_bw. In each step,
> ref_bw is calculated as (base_bw * pos_ratio * write_bw / dirty_bw):
> when the (unknown number of) tasks are rate limited based on previous
> (base_bw*pos_ratio*sqrt(task_weight)), if the overall dirty rate
> dirty_bw is M times write_bw, then the base_bw shall be scaled 1/M to
> match/balance dirty_bw <=> write_bw. Note that pos_ratio is the result
> of position control, it will be 1 if the dirty pages are exactly at
> the goal.
>
> The ref_bw estimation will be pretty accurate if without
> (1) noises
> (2) feedback delays between steps
> (3) the mismatch between the number of dirty and writeback events
> caused by user space truncate and file system redirty
>
> (1) can be smoothed out; (2) will decrease proportionally with the
> adjust size when base_bw gets close to ref_bw.
>
> (3) can be ultimitely fixed by accounting the truncate/redirty events.
> But for now we can rely on the robustness of base_bw update algorithms
> to deal with the mismatches: no obvious imbalance is observed in ext4
> workloads which have bursts of redirty and large dirtied:written=3:2
> ratio. In theory when the truncate/redirty makes (write_bw/dirty_bw <
> 1), ref_bw and base_bw will go low, driving up the pos_ratio which
> then corrects (pos_ratio * write_bw / dirty_bw) back to 1, thus
> balance ref_bw at some point. What's more,
> bdi_update_throttle_bandwidth() dictates that base_bw will only be
> updated when ref_bw and pos_bw=base_bw*pos_ratio are both higher or
> lower than base_bw. So the higher pos_bw will effectively stop base_bw
> from approaching the lower ref_bw.
>
> In general, it's pretty safe and robust.
> - the upper/lower bounds in the position control provides ultimate
> safeguard: in case the algorithms fly away, the worst case would be
> the dirty pages continuously hitting the bounds with big fluctuates in
> dirty rate -- basically similiar to the current state.
> - the base bandwidth update rules are accurate and robust enough for
> base_bw to quickly adapt to new workload and remain stable thereafter
> This is confirmed by a wide range of tests: base_bw only goes less
> stable when the control scope is smaller than the write bandwidth,
> in which case the pos_ratio is already fluctuating much more.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@xxxxxxxxx>
> ---
> include/linux/backing-dev.h | 10
> include/linux/writeback.h | 7
> mm/backing-dev.c | 1
> mm/page-writeback.c | 478 ++++++++++++++++++++++++++++++++++
> 4 files changed, 495 insertions(+), 1 deletion(-)
>
> --- linux-next.orig/include/linux/backing-dev.h 2011-03-03 14:44:22.000000000 +0800
> +++ linux-next/include/linux/backing-dev.h 2011-03-03 14:44:27.000000000 +0800
> @@ -76,18 +76,26 @@ struct backing_dev_info {
> struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
>
> unsigned long bw_time_stamp;
> + unsigned long dirtied_stamp;
> unsigned long written_stamp;
> unsigned long write_bandwidth;
> unsigned long avg_bandwidth;
> + unsigned long long throttle_bandwidth;
> + unsigned long long reference_bandwidth;
> + unsigned long long old_ref_bandwidth;
> unsigned long avg_dirty;
> unsigned long old_dirty;
> unsigned long dirty_threshold;
> unsigned long old_dirty_threshold;
>
> -
> struct prop_local_percpu completions;
> int dirty_exceeded;
>
> + /* last time exceeded (limit - limit/DIRTY_MARGIN) */
> + unsigned long dirty_exceed_time;
> + /* last time dropped below (background_thresh + dirty_thresh) / 2 */
> + unsigned long dirty_free_run;
> +
> unsigned int min_ratio;
> unsigned int max_ratio, max_prop_frac;
>
> --- linux-next.orig/include/linux/writeback.h 2011-03-03 14:44:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h 2011-03-03 14:44:23.000000000 +0800
> @@ -46,6 +46,13 @@ extern spinlock_t inode_lock;
> #define DIRTY_MARGIN (DIRTY_SCOPE * 4)
>
> /*
> + * The base throttle bandwidth will be 1000 times smaller than write bandwidth
> + * when there are 100 concurrent heavy dirtiers. This shift can work with up to
> + * 40 bits dirty size and 2^16 concurrent dirtiers.
> + */
> +#define BASE_BW_SHIFT 24
> +
> +/*
> * fs/fs-writeback.c
> */
> enum writeback_sync_modes {
> --- linux-next.orig/mm/page-writeback.c 2011-03-03 14:44:23.000000000 +0800
> +++ linux-next/mm/page-writeback.c 2011-03-03 14:44:27.000000000 +0800
> @@ -496,6 +496,255 @@ static unsigned long dirty_rampup_size(u
> return MIN_WRITEBACK_PAGES / 8;
> }
>
> +/*
> + * last time exceeded (limit - limit/DIRTY_MARGIN)
> + */
> +static bool dirty_exceeded_recently(struct backing_dev_info *bdi,
> + unsigned long time_window)
> +{
> + return jiffies - bdi->dirty_exceed_time <= time_window;
> +}
> +
> +/*
> + * last time dropped below (thresh - 2*thresh/DIRTY_SCOPE + thresh/DIRTY_MARGIN)
> + */
> +static bool dirty_free_run_recently(struct backing_dev_info *bdi,
> + unsigned long time_window)
> +{
> + return jiffies - bdi->dirty_free_run <= time_window;
> +}
> +
> +/*
> + * Position based bandwidth control.
> + *
> + * (1) hard dirty limiting areas
> + *
> + * The block area is required to stop large number of slow dirtiers, because
> + * the max pause area is only able to throttle a task at 1page/200ms=20KB/s.
> + *
> + * The max pause area is sufficient for normal workloads, and has the virtue
> + * of bounded latency for light dirtiers.
> + *
> + * The brake area is typically enough to hold off the dirtiers as long as the
> + * dirtyable memory is not so tight.
> + *
> + * The block area and max pause area are enforced inside the loop of
> + * balance_dirty_pages(). Others can be found in dirty_throttle_bandwidth().
> + *
> + * block area, loop until drop below the area -------------------|<===
> + * max pause area, sleep(max_pause) and return -----------|<=====>|
> + * brake area, bw scaled from 1 down to 0 ---|<=====>|
> + * --------------------------------------------------------o-------o-------o----
> + * ^ ^ ^
> + * limit - limit/DIRTY_MARGIN ---' | |
> + * limit -----------' |
> + * limit + limit/DIRTY_MARGIN -------------------'
> + *
> + * (2) global control areas
> + *
> + * The rampup area is for ramping up the base bandwidth whereas the above brake
> + * area is for scaling down the base bandwidth.
> + *
> + * The global thresh is typically equal to the above global limit. The
> + * difference is, @thresh is real-time computed from global_dirty_limits() and
> + * @limit is tracking @thresh at 100ms intervals in update_dirty_limit(). The
> + * point is to track @thresh slowly if it dropped below the number of dirty
> + * pages, so as to avoid unnecessarily entering the three areas in (1).
> + *
> + *rampup area setpoint/goal
> + *|<=======>| v
> + * |-------------------------------*-------------------------------|------------
> + * ^ ^ ^
> + * thresh - 2*thresh/DIRTY_SCOPE thresh - thresh/DIRTY_SCOPE thresh
> + *
> + * (3) bdi control areas
> + *
> + * The bdi reserve area tries to keep a reasonable number of dirty pages for
> + * preventing block queue underrun.
> + *
> + * reserve area, scale up bw as dirty pages drop low bdi_setpoint
> + * |<=============================================>| v
> + * |-------------------------------------------------------*-------|----------
> + * 0 bdi_thresh - bdi_thresh/DIRTY_SCOPE^ ^bdi_thresh
> + *
> + * (4) global/bdi control lines
> + *
> + * dirty_throttle_bandwidth() applies 2 main and 3 regional control lines for
> + * scaling up/down the base bandwidth based on the position of dirty pages.
> + *
> + * The two main control lines for the global/bdi control scopes do not end at
> + * thresh/bdi_thresh. They are centered at setpoint/bdi_setpoint and cover the
> + * whole [0, limit]. If the control line drops below 0 before reaching @limit,
> + * an auxiliary line will be setup to connect them. The below figure illustrates
> + * the main bdi control line with an auxiliary line extending it to @limit.
> + *
> + * This allows smoothly throttling down bdi_dirty back to normal if it starts
> + * high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + * card's bdi_dirty may rush to 5 times higher than bdi_setpoint.
> + * - the global/bdi dirty thresh/goal may be knocked down suddenly either on
> + * user request or on increased memory consumption.
> + *
> + * o
> + * o
> + * o [o] main control line
> + * o [*] auxiliary control line
> + * o
> + * o
> + * o
> + * o
> + * o
> + * o
> + * o--------------------- balance point, bw scale = 1
> + * | o
> + * | o
> + * | o
> + * | o
> + * | o
> + * | o
> + * | o------- connect point, bw scale = 1/2
> + * | .*
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * | . *
> + * [--------------------*-----------------------------.--------------------*]
> + * 0 bdi_setpoint bdi_origin limit
> + *
> + * The bdi control line: if (bdi_origin < limit), an auxiliary control line (*)
> + * will be setup to extend the main control line (o) to @limit.
> + */
> +static unsigned long dirty_throttle_bandwidth(struct backing_dev_info *bdi,
> + unsigned long thresh,
> + unsigned long dirty,
> + unsigned long bdi_dirty,
> + struct task_struct *tsk)
> +{
> + unsigned long limit = default_backing_dev_info.dirty_threshold;
> + unsigned long bdi_thresh = bdi->dirty_threshold;
> + unsigned long origin;
> + unsigned long goal;
> + unsigned long long span;
> + unsigned long long bw;
> +
> + if (unlikely(dirty >= limit))
> + return 0;
> +
> + /*
> + * global setpoint
> + */
> + origin = 2 * thresh;
> + goal = thresh - thresh / DIRTY_SCOPE;
> +
> + if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> + origin = limit;
> + goal = (goal + origin) / 2;
> + bw >>= 1;
> + }
> + bw = origin - dirty;
> + bw <<= BASE_BW_SHIFT;
> + do_div(bw, origin - goal + 1);
> +
> + /*
> + * brake area to prevent global dirty exceeding
> + */
> + if (dirty > limit - limit / DIRTY_MARGIN) {
> + bw *= limit - dirty;
> + do_div(bw, limit / DIRTY_MARGIN + 1);
> + }
> +
> + /*
> + * rampup area, immediately above the unthrottled free-run region.
> + * It's setup mainly to get an estimation of ref_bw for reliably
> + * ramping up the base bandwidth.
> + */
> + dirty = default_backing_dev_info.avg_dirty;
> + origin = thresh - thresh / (DIRTY_SCOPE/2) + thresh / DIRTY_MARGIN;
> + if (dirty < origin) {
> + span = (origin - dirty) * bw;
> + do_div(span, thresh / (8 * DIRTY_MARGIN) + 1);
> + bw += span;
> + }
> +
> + /*
> + * bdi setpoint
> + */
> + if (unlikely(bdi_thresh > thresh))
> + bdi_thresh = thresh;
> + goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
> + /*
> + * In JBOD case, bdi_thresh could fluctuate proportional to its own
> + * size. Otherwise the bdi write bandwidth is good for limiting the
> + * floating area, to compensate for the global control line being too
> + * flat in large memory systems.
> + */
> + span = (u64) bdi_thresh * (thresh - bdi_thresh) +
> + (2 * bdi->avg_bandwidth) * bdi_thresh;
> + do_div(span, thresh + 1);
> + origin = goal + 2 * span;
> +
> + dirty = bdi->avg_dirty;
> + if (unlikely(dirty > goal + span)) {
> + if (dirty > limit)
> + return 0;
> + if (origin < limit) {
> + origin = limit;
> + goal += span;
> + bw >>= 1;
> + }
> + }
> + bw *= origin - dirty;
> + do_div(bw, origin - goal + 1);
> +
> + /*
> + * bdi reserve area, safeguard against bdi dirty underflow and disk idle
> + */
> + origin = bdi_thresh - bdi_thresh / (DIRTY_SCOPE / 2);
> + if (bdi_dirty < origin)
> + bw = bw * origin / (bdi_dirty | 1);
> +
> + /*
> + * honour light dirtiers higher bandwidth:
> + *
> + * bw *= sqrt(1 / task_dirty_weight);
> + */
> + if (tsk) {
> + unsigned long numerator, denominator;
> + const unsigned long priority_base = 1024;
> + unsigned long priority = priority_base;
> +
> + /*
> + * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
> + * real-time tasks.
> + */
> + if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
> + priority *= 2;
> +
> + task_dirties_fraction(tsk, &numerator, &denominator);
> +
> + denominator <<= 10;
> + denominator = denominator * priority / priority_base;
> + bw *= int_sqrt(denominator / (numerator + 1)) *
> + priority / priority_base;
> + bw >>= 5 + BASE_BW_SHIFT / 2;
> + bw = (unsigned long)bw * bdi->throttle_bandwidth;
> + bw >>= 2 * BASE_BW_SHIFT - BASE_BW_SHIFT / 2;
> +
> + /*
> + * The avg_bandwidth bound is necessary because
> + * bdi_update_throttle_bandwidth() blindly sets base bandwidth
> + * to avg_bandwidth for more stable estimation, when it
> + * believes the current task is the only dirtier.
> + */
> + if (priority > priority_base)
> + return min((unsigned long)bw, bdi->avg_bandwidth);
> + }
> +
> + return bw;
> +}
> +
> static void bdi_update_dirty_smooth(struct backing_dev_info *bdi,
> unsigned long dirty)
> {
> @@ -631,6 +880,230 @@ static void bdi_update_dirty_threshold(s
> bdi->old_dirty_threshold = thresh;
> }
>
> +/*
> + * ref_bw typically fluctuates within a small range, with large isolated points
> + * from time to time. The smoothed reference_bandwidth can effectively filter
> + * out 1 such standalone point. When there comes 2+ isolated points together --
> + * observed in ext4 on sudden redirty -- reference_bandwidth may surge high and
> + * take long time to return to normal, which can mostly be counteracted by
> + * xref_bw and other update restrictions in bdi_update_throttle_bandwidth().
> + */
> +static void bdi_update_reference_bandwidth(struct backing_dev_info *bdi,
> + unsigned long ref_bw)
> +{
> + unsigned long old = bdi->old_ref_bandwidth;
> + unsigned long avg = bdi->reference_bandwidth;
> +
> + if (avg > old && old >= ref_bw && avg - old >= old - ref_bw)
> + avg -= (avg - old) >> 3;
> +
> + if (avg < old && old <= ref_bw && old - avg >= ref_bw - old)
> + avg += (old - avg) >> 3;
> +
> + bdi->reference_bandwidth = avg;
> + bdi->old_ref_bandwidth = ref_bw;
> +}
> +
> +/*
> + * Base throttle bandwidth.
> + */
> +static void bdi_update_throttle_bandwidth(struct backing_dev_info *bdi,
> + unsigned long thresh,
> + unsigned long dirty,
> + unsigned long bdi_dirty,
> + unsigned long dirtied,
> + unsigned long elapsed)
> +{
> + unsigned long limit = default_backing_dev_info.dirty_threshold;
> + unsigned long margin = limit / DIRTY_MARGIN;
> + unsigned long goal = thresh - thresh / DIRTY_SCOPE;
> + unsigned long bdi_thresh = bdi->dirty_threshold;
> + unsigned long bdi_goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
> + unsigned long long bw = bdi->throttle_bandwidth;
> + unsigned long long dirty_bw;
> + unsigned long long pos_bw;
> + unsigned long long delta;
> + unsigned long long ref_bw = 0;
> + unsigned long long xref_bw;
> + unsigned long pos_ratio;
> + unsigned long spread;
> +
> + if (dirty > limit - margin)
> + bdi->dirty_exceed_time = jiffies;
> +
> + if (dirty < thresh - thresh / (DIRTY_SCOPE/2) + margin)
> + bdi->dirty_free_run = jiffies;
> +
> + /*
> + * The dirty rate should match the writeback rate exactly, except when
> + * dirty pages are truncated before IO submission. The mismatches are
> + * hopefully small and hence ignored. So a continuous stream of dirty
> + * page trucates will result in errors in ref_bw, fortunately pos_bw
> + * can effectively stop the base bw from being driven away endlessly
> + * by the errors.
> + *
> + * It'd be nicer for the filesystems to not redirty too much pages
> + * either on IO or lock contention, or on sub-page writes. ext4 is
> + * known to redirty pages in big bursts, leading to
> + * - surges of dirty_bw, which can be mostly safeguarded by the
> + * min/max'ed xref_bw
> + * - the temporary drop of task weight and hence surge of task bw
> + * It could possibly be fixed in the FS.
> + */
> + dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
> +
> + pos_ratio = dirty_throttle_bandwidth(bdi, thresh, dirty,
> + bdi_dirty, NULL);
> + /*
> + * pos_bw = task_bw, assuming 100% task dirty weight
> + *
> + * (pos_bw > bw) means the position of the number of dirty pages is
> + * lower than the global and/or bdi setpoints. It does not necessarily
> + * mean the base throttle bandwidth is larger than its balanced value.
> + * The latter is likely only when
> + * - (position) the dirty pages are at some distance from the setpoint,
> + * - (speed) and either stands still or is departing from the setpoint.
> + */
> + pos_bw = (bw >> (BASE_BW_SHIFT/2)) * pos_ratio >>
> + (BASE_BW_SHIFT/2);
> +
> + /*
> + * A typical desktop has only 1 task writing to 1 disk, in which case
> + * the dirtier task should be throttled at the disk's write bandwidth.
> + * Note that we ignore minor dirty/writeback mismatches such as
> + * redirties and truncated dirty pages.
> + */
> + if (bdi_thresh > thresh - thresh / 16) {
> + unsigned long numerator, denominator;
> +
> + task_dirties_fraction(current, &numerator, &denominator);
> + if (numerator > denominator - denominator / 16)
> + ref_bw = bdi->avg_bandwidth << BASE_BW_SHIFT;
> + }
> + /*
> + * Otherwise there may be
> + * 1) N dd tasks writing to the current disk, or
> + * 2) X dd tasks and Y "rsync --bwlimit" tasks.
> + * The below estimation is accurate enough for (1). For (2), where not
> + * all task's dirty rate can be changed proportionally by adjusting the
> + * base throttle bandwidth, it would require multiple adjust-reestimate
> + * cycles to approach the rate matching point. Which is not a big
> + * concern as we always do small steps to approach the target. The
> + * un-controllable tasks may only slow down the progress.
> + */
> + if (!ref_bw) {
> + ref_bw = pos_ratio * bdi->avg_bandwidth;
> + do_div(ref_bw, dirty_bw | 1);
> + ref_bw = (bw >> (BASE_BW_SHIFT/2)) * (unsigned long)ref_bw >>
> + (BASE_BW_SHIFT/2);
> + }
> +
> + /*
> + * The average dirty pages typically fluctuates within this scope.
> + */
> + spread = min(bdi->write_bandwidth / 8, bdi_thresh / DIRTY_MARGIN);
> +
> + /*
> + * Update the base throttle bandwidth rigidly: eg. only try lowering it
> + * when both the global/bdi dirty pages are away from their setpoints,
> + * and are either standing still or continue departing away.
> + *
> + * The "+ avg_dirty / 256" tricks mainly help btrfs, which behaves
> + * amazingly smoothly. Its average dirty pages simply tracks more and
> + * more close to the number of dirty pages without any overshooting,
> + * thus its dirty pages may be ever moving towards the setpoint and
> + * @avg_dirty ever approaching @dirty, slower and slower, but very hard
> + * to cross it to trigger a base bandwidth update. What the trick does
> + * is "when @avg_dirty is _close enough_ to @dirty, it indicates slowed
> + * down @dirty change rate, hence the other inequalities are now a good
> + * indication of something unbalanced in the current bdi".
> + *
> + * In the cases of hitting the upper/lower margins, it's obviously
> + * necessary to adjust the (possibly very unbalanced) base bandwidth,
> + * unless the opposite margin was also been hit recently, which
> + * indicates that the dirty control scope may be smaller than the bdi
> + * write bandwidth and hence the dirty pages are quickly fluctuating
> + * between the upper/lower margins.
> + */
> + if (bw < pos_bw) {
> + if (dirty < goal &&
> + dirty <= default_backing_dev_info.avg_dirty +
> + (default_backing_dev_info.avg_dirty >> 8) &&
> + bdi->avg_dirty + spread < bdi_goal &&
> + bdi_dirty <= bdi->avg_dirty + (bdi->avg_dirty >> 8) &&
> + bdi_dirty <= bdi->old_dirty)
> + goto adjust;
> + if (dirty < thresh - thresh / (DIRTY_SCOPE/2) + margin &&
> + !dirty_exceeded_recently(bdi, HZ))
> + goto adjust;
> + }
> +
> + if (bw > pos_bw) {
> + if (dirty > goal &&
> + dirty >= default_backing_dev_info.avg_dirty -
> + (default_backing_dev_info.avg_dirty >> 8) &&
> + bdi->avg_dirty > bdi_goal + spread &&
> + bdi_dirty >= bdi->avg_dirty - (bdi->avg_dirty >> 8) &&
> + bdi_dirty >= bdi->old_dirty)
> + goto adjust;
> + if (dirty > limit - margin &&
> + !dirty_free_run_recently(bdi, HZ))
> + goto adjust;
> + }
> +
> + goto out;
> +
> +adjust:
> + /*
> + * The min/max'ed xref_bw is an effective safeguard. The most dangerous
> + * case that could unnecessarily disturb the base bandwith is: when the
> + * control scope is roughly equal to the write bandwidth, the dirty
> + * pages may rush into the upper/lower margins regularly. It typically
> + * hits the upper margin in a blink, making a sudden drop of pos_bw and
> + * ref_bw. Assume 5 points A, b, c, D, E, where b, c have the dropped
> + * down number of pages, and A, D, E are at normal level. At point b,
> + * the xref_bw will be the good A; at c, the xref_bw will be the
> + * dragged-down-by-b reference_bandwidth which is bad; at D and E, the
> + * still-low reference_bandwidth will no longer bring the base
> + * bandwidth down, as xref_bw will take the larger values from D and E.
> + */
> + if (pos_bw > bw) {
> + xref_bw = min(ref_bw, bdi->old_ref_bandwidth);
> + xref_bw = min(xref_bw, bdi->reference_bandwidth);
> + if (xref_bw > bw)
> + delta = xref_bw - bw;
> + else
> + delta = 0;
> + } else {
> + xref_bw = max(ref_bw, bdi->reference_bandwidth);
> + xref_bw = max(xref_bw, bdi->reference_bandwidth);
> + if (xref_bw < bw)
> + delta = bw - xref_bw;
> + else
> + delta = 0;
> + }
> +
> + /*
> + * Don't pursue 100% rate matching. It's impossible since the balanced
> + * rate itself is constantly fluctuating. So decrease the track speed
> + * when it gets close to the target. Also limit the step size in
> + * various ways to avoid overshooting.
> + */
> + delta >>= bw / (2 * delta + 1);
> + delta = min(delta, (u64)abs64(pos_bw - bw));
> + delta >>= 1;
> + delta = min(delta, bw / 8);
> +
> + if (pos_bw > bw)
> + bw += delta;
> + else
> + bw -= delta;
> +
> + bdi->throttle_bandwidth = bw;
> +out:
> + bdi_update_reference_bandwidth(bdi, ref_bw);
> +}
> +
> void bdi_update_bandwidth(struct backing_dev_info *bdi,
> unsigned long thresh,
> unsigned long dirty,
> @@ -640,12 +1113,14 @@ void bdi_update_bandwidth(struct backing
> static DEFINE_SPINLOCK(dirty_lock);
> unsigned long now = jiffies;
> unsigned long elapsed;
> + unsigned long dirtied;
> unsigned long written;
>
> if (!spin_trylock(&dirty_lock))
> return;
>
> elapsed = now - bdi->bw_time_stamp;
> + dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
> written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
>
> /* skip quiet periods when disk bandwidth is under-utilized */
> @@ -665,6 +1140,8 @@ void bdi_update_bandwidth(struct backing
> if (thresh) {
> update_dirty_limit(thresh, dirty);
> bdi_update_dirty_threshold(bdi, thresh, dirty);
> + bdi_update_throttle_bandwidth(bdi, thresh, dirty,
> + bdi_dirty, dirtied, elapsed);
> }
> __bdi_update_write_bandwidth(bdi, elapsed, written);
> if (thresh) {
> @@ -673,6 +1150,7 @@ void bdi_update_bandwidth(struct backing
> }
>
> snapshot:
> + bdi->dirtied_stamp = dirtied;
> bdi->written_stamp = written;
> bdi->bw_time_stamp = now;
> unlock:
> --- linux-next.orig/mm/backing-dev.c 2011-03-03 14:44:22.000000000 +0800
> +++ linux-next/mm/backing-dev.c 2011-03-03 14:44:27.000000000 +0800
> @@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
>
> bdi->write_bandwidth = INIT_BW;
> bdi->avg_bandwidth = INIT_BW;
> + bdi->throttle_bandwidth = (u64)INIT_BW << BASE_BW_SHIFT;
>
> bdi->avg_dirty = 0;
> bdi->old_dirty = 0;
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/