Re: [PATCH 0/5] IO-less dirty throttling v8

From: Wu Fengguang
Date: Wed Aug 10 2011 - 23:21:50 EST

> [...] it only deals with controlling buffered write IO and nothing
> else. So on the same block device, other direct writes might be
> going on from same group and in this scheme a user will not have any
> control.

The IO-less balance_dirty_pages() will be able to throttle DIRECT
writes. There is nothing fundamental in the way.

The basic approach will be to add a balance_dirty_pages_ratelimited_nr()
call in the DIRECT write path, and to call into balance_dirty_pages()
regardless of the various dirty thresholds.

Then the IO-less balance_dirty_pages() has all the facilities to
throttle a task at any auto-estimated or user-specified ratelimit.

> Another disadvantage is that throttling at page cache level does not
> take care of IO spikes at device level.

Yes this is a problem. But it's a problem best fixable in the IO
scheduler.. (I cannot go to details at this time, however it does
_sound_ possible to me..)

> How do you implement proportional control here? From overall bdi bandwidth
> vary per cgroup bandwidth regularly based on cgroup weight? Again the
> issue here is that it controls only buffered WRITES and nothing else and
> in this case co-ordinating with CFQ will probably be hard. So I guess
> usage of proportional IO just for buffered WRITES will have limited
> usage.

"priority" may be a more suitable phrase. It will be implemented like
this (without the user interface):

@@ -1007,6 +1001,13 @@ static void balance_dirty_pages(struct a
max_pause = bdi_max_pause(bdi, bdi_dirty);

base_rate = bdi->dirty_ratelimit;
+ /*
+ * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
+ * real-time tasks.
+ */
+ if (current->flags & PF_LESS_THROTTLE || rt_task(current))
+ base_rate *= 2;
pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
background_thresh, nr_dirty,
bdi_thresh, bdi_dirty);
That is, if start 2 dd tasks A and B with priority_B=2. Then the
resulting rate_B will be equal to 2*rate_A. The ->dirty_ratelimit will
auto adapt to rate_A or equally (write_bw/3).

The same can be applied to cgroup. One may specify the whole cgroup's
dirty rate be throttled at N times that of a normal dd in the root cgroup,
or be throttled at some absolute 10MB/s rate. The corresponding
cgroup->dirty_ratelimit will be set to (N * bdi->dirty_ratelimit) for
the former and 10MB/s for the latter.

The user can specify any combinations of "priority" and "absolute
ratelimit" for any task and/or cgroup, tasks inside cgroup, and so on.
We have very powerful (bdi or cgroup)->dirty_ratelimit adaptation
mechanism to support the combinations :)

The "priority" can even be applied to DIRECT dirtiers, _as long as_
there are other buffered dirtiers to generate enough dirty pages. It's
not as easy to apply priorities when there are only DIRECT dirtiers.
In contrast, the absolute ratelimit is always applicable to all kind
of tasks and cgroups.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at