Re: [PATCH 0/5] IO-less dirty throttling v8

From: Vivek Goyal
Date: Tue Aug 09 2011 - 10:05:24 EST

On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote:
> On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote:
> > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> > > Hi all,
> > >
> > > The _core_ bits of the IO-less balance_dirty_pages().
> > > Heavily simplified and re-commented to make it easier to review.
> > >
> > > git:// dirty-throttling-v8
> > >
> > > Only the bare minimal algorithms are presented, so you will find some rough
> > > edges in the graphs below. But it's usable :)
> > >
> > >
> > >
> > > And an introduction to the (more complete) algorithms:
> > >
> > >
> > >
> > > Questions and reviews are highly appreciated!
> >
> > Hi Wu,
> >
> > I am going through the slide number 39 where you talk about it being
> > future proof and it can be used for IO control purposes. You have listed
> > following merits of this approach.
> >
> > * per-bdi nature, works on NFS and Software RAID
> > * no delayed response (working at the right layer)
> > * no page tracking, hence decoupled from memcg
> > * no interactions with FS and CFQ
> > * get proportional IO controller for free
> > * reuse/inherit all the base facilities/functions
> >
> > I would say that it will also be a good idea to list the demerits of
> > this approach in current form and that is that it only deals with
> > controlling buffered write IO and nothing else.
> That's not a demerit - that is all it is designed to do.

It is designed to improve the existing task throttling functionality and
we are trying to extend the same to cgroups too. So if by design something
does not gel well with existing pieces, it is demerit to me. Atleast
there should be a good explanation of design intention and how it is
going to be useful.

For example, how this thing is going to gel with existing IO controller?
Are you going to create two separate mechianisms. One for control of
writes while entering the cache and other for controlling the writes
at device level?

The fact that this mechanism does not know about any other IO in the
system/cgroup is a limiting factor. From usability point of view, a
user expects any kind of IO happening from a group.

So are we planning to create a new controller? Or add additional files
in existing controller to control the per cgroup write throttling
behavior? Even if we create additional files, again then a user is
forced to put separate write policies for buffered writes and direct
writes. I was hoping a better interface would be that user puts a
policy on writes and that takes affect and a user does not have to
worry whether the applications inside the cgroup are doing buffered
writes or direct writes.

> > So on the same block device, other direct writes might be going on
> > from same group and in this scheme a user will not have any
> > control.
> But it is taken into account by the IO write throttling.

You mean blkio controller?

It does. But my complain is that we are trying to control two separate
knobs for two kind of IOs and I am trying to come up with a single

Current interface for write control in blkio controller looks like.


Once can write to this file specifying the write limit of a cgroup
on a particular device. I was hoping that buffered write limits
will come out of same limit but with these pathes looks like we
shall have to create a new interface altogether which just controls
buffered writes and nothing else and user is supposed to know what
his application is doing and try to configure the limits accordingly.

So my concern is that how the overall interface would look like and
how well it will work with existing controller and how a user is
supposed to use it.

In fact current IO controller does throttling at device level so
interface is device specific. One is supposed to know the major
and minor number of device to specify. I am not sure in this
case what one is supposed to do as it is bdi specific and for
NFS case there is no device. So one is supposed to speciy bdi or
limits are going to be global (system wide, independent of bdi
or block device)?

> > Another disadvantage is that throttling at page cache
> > level does not take care of IO spikes at device level.
> And that is handled as well.
> How? By the indirect effect other IO and IO spikes have on the
> writeback rate. That is, other IO reduces the writeback bandwidth,
> which then changes the throttling parameters via feedback loops.

Actually I was referring to effect of buffered writes on other IO
going on the device. With control being on device level, one can
tightly control the WRITEs flowing out of a cgroup to Lun and that
can help a bit knowing how bad it will be for other reads going on
the lun.

With this scheme, flusher threads can suddenly throw tons of writes
on lun and then no IO for another few seconds. So basically IO is
bursty at device level and doing control at device level can make
it more smooth.

So we have two ways to control buffered writes.

- Throttle them while entering the page cache
- Throttle them at device and feedback loop in turn throttles them at
page cache level based on dirty ratio.

Myself and Andrea had implemented first appraoch (same what Wu is
suggesting now with a different mechanism) and following was your

To me it looked like that at that point of time you preferred precise
throttling at device level and now you seem to prefer precise throttling
at page cache level?

Again, I am not against cgroup parameter based throttling at page
cache level. It simplifies the implementation and probably is good
enough for lots of people. I am only worried about that the interface
and how does it work with existing interfaces.

In absolute throttling one does not have to care about feedback or
what is the underlying bdi bandwidth. So to me these patches are
good for work conserving IO control where we want to determine how
fast we can write to device and then throttle tasks accordingly. But
in absolute throttling one specifies the upper limit and there we
don't need the mechanism to determine what the bdi badnwidth or
how many dirty pages are there and throttle tasks accordingly.

> The buffered write throttle is designed to reduce the page cache
> dirtying rate to the current cleaning rate of the backing device
> is. Increase the cleaning rate (i.e. device is otherwise idle) and
> it will throttle less. Decrease the cleaning rate (i.e. other IO
> spikes or block IO throttle activates) and it will throttle more.
> We have to do vary buffered write throttling like this to adapt to
> changing IO workloads (e.g. someone starting a read-heavy workload
> will slow down writeback rate, so we need to throttle buffered
> writes more aggressively), so it has to be independent of any sort
> of block layer IO controller.
> Simply put: the block IO controller still has direct control over
> the rate at which buffered writes drain out of the system. The
> IO-less write throttle simply limits the rate at which buffered
> writes come into the system to match whatever the IO path allows to
> drain out....

Ok, this makes sense. So it goes back to the previous design where
absolute cgroup based control happens at device level and IO less
throttle implements the feedback loop to slow down the writes into
page cache. That makes sense. But Wu's slides suggest that one can
directly implement cgroup based IO control in IO less throttling
and that's where I have concerns.

Anyway this stuff shall have to be made cgroup aware so that tasks
of different groups can see different throttling depending on how
much IO that group is able to do at device level.

> > Now I think one could probably come up with more sophisticated scheme
> > where throttling is done at bdi level but is also accounted at device
> > level at IO controller. (Something similar I had done in the past but
> > Dave Chinner did not like it).
> I don't like it because it is solution to a specific problem and
> requires complex coupling across multiple layers of the system. We
> are trying to move away from that throttling model. More
> fundamentally, though, is that it is not a general solution to the
> entire class of "IO writeback rate changed" problems that buffered
> write throttling needs to solve.
> > Anyway, keeping track of per cgroup rate and throttling accordingly
> > can definitely help implement an algorithm for per cgroup IO control.
> > We probably just need to find a reasonable way to account all this
> > IO to end device so that we have control of all kind of IO of a cgroup.
> > How do you implement proportional control here? From overall bdi bandwidth
> > vary per cgroup bandwidth regularly based on cgroup weight? Again the
> > issue here is that it controls only buffered WRITES and nothing else and
> > in this case co-ordinating with CFQ will probably be hard. So I guess
> > usage of proportional IO just for buffered WRITES will have limited
> > usage.
> The whole point of doing the throttling this way is that we don't
> need any sort of special connection between block IO throttling and
> page cache (buffered write) throttling. We significantly reduce the
> coupling between the layers by relying on feedback-driven control
> loops to determine the buffered write throttling thresholds
> adaptively. IOWs, the IO-less write throttling at the page cache
> will adjust automatically to whatever throughput the block IO
> throttling allows async writes to achieve.

This is good. But that's not the impression one gets from Wu's slides.

> However, before we have a "finished product", there is still another
> piece of the puzzle to be put in place - memcg-aware buffered
> writeback. That is, having a flusher thread do work on behalf of
> memcg in the IO context of the memcg. Then the IO controller just
> sees a stream of async writes in the context of the memcg the
> buffered writes came from in the first place. The block layer
> throttles them just like any other IO in the IO context of the
> memcg...

Yes that is still a piece remaining. I was hoping that Greg Thelen will
be able to extend his patches to submit writes in the context of
per cgroup flusher/worker threads and solve this problem.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at