Re: [PATCH 0/5] IO-less dirty throttling v8

From: Greg Thelen
Date: Wed Aug 10 2011 - 03:41:33 EST

On Aug 9, 2011 7:04 AM, "Vivek Goyal" <vgoyal@xxxxxxxxxx> wrote:
> On Tue, Aug 09, 2011 at 03:55:51PM +1000, Dave Chinner wrote:
> > On Mon, Aug 08, 2011 at 10:01:27PM -0400, Vivek Goyal wrote:
> > > On Sat, Aug 06, 2011 at 04:44:47PM +0800, Wu Fengguang wrote:
> > > > Hi all,
> > > >
> > > > The _core_ bits of the IO-less balance_dirty_pages().
> > > > Heavily simplified and re-commented to make it easier to review.
> > > >
> > > >   git:// dirty-throttling-v8
> > > >
> > > > Only the bare minimal algorithms are presented, so you will find some rough
> > > > edges in the graphs below. But it's usable :)
> > > >
> > > >
> > > >
> > > > And an introduction to the (more complete) algorithms:
> > > >
> > > >
> > > >
> > > > Questions and reviews are highly appreciated!
> > >
> > > Hi Wu,
> > >
> > > I am going through the slide number 39 where you talk about it being
> > > future proof and it can be used for IO control purposes. You have listed
> > > following merits of this approach.
> > >
> > > * per-bdi nature, works on NFS and Software RAID
> > > * no delayed response (working at the right layer)
> > > * no page tracking, hence decoupled from memcg
> > > * no interactions with FS and CFQ
> > > * get proportional IO controller for free
> > > * reuse/inherit all the base facilities/functions
> > >
> > > I would say that it will also be a good idea to list the demerits of
> > > this approach in current form and that is that it only deals with
> > > controlling buffered write IO and nothing else.
> >
> > That's not a demerit - that is all it is designed to do.
> It is designed to improve the existing task throttling functionality and
> we are trying to extend the same to cgroups too. So if by design something
> does not gel well with existing pieces, it is demerit to me. Atleast
> there should be a good explanation of design intention and how it is
> going to be useful.
> For example, how this thing is going to gel with existing IO controller?
> Are you going to create two separate mechianisms. One for control of
> writes while entering the cache and other for controlling the writes
> at device level?
> The fact that this mechanism does not know about any other IO in the
> system/cgroup is a limiting factor. From usability point of view, a
> user expects any kind of IO happening from a group.
> So are we planning to create a new controller? Or add additional files
> in existing controller to control the per cgroup write throttling
> behavior? Even if we create additional files, again then a user is
> forced to put separate write policies for buffered writes and direct
> writes. I was hoping a better interface would be that user puts a
> policy on writes and that takes affect and a user does not have to
> worry whether the applications inside the cgroup are doing buffered
> writes or direct writes.
> >
> > > So on the same block device, other direct writes might be going on
> > > from same group and in this scheme a user will not have any
> > > control.
> >
> > But it is taken into account by the IO write throttling.
> You mean blkio controller?
> It does. But my complain is that we are trying to control two separate
> knobs for two kind of IOs and I am trying to come up with a single
> knob.
> Current interface for write control in blkio controller looks like.
> blkio.throtl.write_bps_device
> Once can write to this file specifying the write limit of a cgroup
> on a particular device. I was hoping that buffered write limits
> will come out of same limit but with these pathes looks like we
> shall have to create a new interface altogether which just controls
> buffered writes and nothing else and user is supposed to know what
> his application is doing and try to configure the limits accordingly.
> So my concern is that how the overall interface would look like and
> how well it will work with existing controller and how a user is
> supposed to use it.
> In fact current IO controller does throttling at device level so
> interface is device specific. One is supposed to know the major
> and minor number of device to specify. I am not sure in this
> case what one is supposed to do as it is bdi specific and for
> NFS case there is no device. So one is supposed to speciy bdi or
> limits are going to be global (system wide, independent of bdi
> or block device)?
> >
> > > Another disadvantage is that throttling at page cache
> > > level does not take care of IO spikes at device level.
> >
> > And that is handled as well.
> >
> > How? By the indirect effect other IO and IO spikes have on the
> > writeback rate. That is, other IO reduces the writeback bandwidth,
> > which then changes the throttling parameters via feedback loops.
> Actually I was referring to effect of buffered writes on other IO
> going on the device. With control being on device level, one can
> tightly control the WRITEs flowing out of a cgroup to Lun and that
> can help a bit knowing how bad it will be for other reads going on
> the lun.
> With this scheme, flusher threads can suddenly throw tons of writes
> on lun and then no IO for another few seconds. So basically IO is
> bursty at device level and doing control at device level can make
> it more smooth.
> So we have two ways to control buffered writes.
> - Throttle them while entering the page cache
> - Throttle them at device and feedback loop in turn throttles them at
>  page cache level based on dirty ratio.
> Myself and Andrea had implemented first appraoch (same what Wu is
> suggesting now with a different mechanism) and following was your
> response.
> To me it looked like that at that point of time you preferred precise
> throttling at device level and now you seem to prefer precise throttling
> at page cache level?
> Again, I am not against cgroup parameter based throttling at page
> cache level. It simplifies the implementation and probably is good
> enough for lots of people. I am only worried about that the interface
> and how does it work with existing interfaces.
> In absolute throttling one does not have to care about feedback or
> what is the underlying bdi bandwidth. So to me these patches are
> good for work conserving IO control where we want to determine how
> fast we can write to device and then throttle tasks accordingly. But
> in absolute throttling one specifies the upper limit and there we
> don't need the mechanism to determine what the bdi badnwidth or
> how many dirty pages are there and throttle tasks accordingly.
> >
> > The buffered write throttle is designed to reduce the page cache
> > dirtying rate to the current cleaning rate of the backing device
> > is. Increase the cleaning rate (i.e. device is otherwise idle) and
> > it will throttle less. Decrease the cleaning rate (i.e. other IO
> > spikes or block IO throttle activates) and it will throttle more.
> >
> > We have to do vary buffered write throttling like this to adapt to
> > changing IO workloads (e.g.  someone starting a read-heavy workload
> > will slow down writeback rate, so we need to throttle buffered
> > writes more aggressively), so it has to be independent of any sort
> > of block layer IO controller.
> >
> > Simply put: the block IO controller still has direct control over
> > the rate at which buffered writes drain out of the system. The
> > IO-less write throttle simply limits the rate at which buffered
> > writes come into the system to match whatever the IO path allows to
> > drain out....
> Ok, this makes sense. So it goes back to the previous design where
> absolute cgroup based control happens at device level and IO less
> throttle implements the feedback loop to slow down the writes into
> page cache. That makes sense. But Wu's slides suggest that one can
> directly implement cgroup based IO control in IO less throttling
> and that's where I have concerns.
> Anyway this stuff shall have to be made cgroup aware so that tasks
> of different groups can see different throttling depending on how
> much IO that group is able to do at device level.
> >
> > > Now I think one could probably come up with more sophisticated scheme
> > > where throttling is done at bdi level but is also accounted at device
> > > level at IO controller. (Something similar I had done in the past but
> > > Dave Chinner did not like it).
> >
> > I don't like it because it is solution to a specific problem and
> > requires complex coupling across multiple layers of the system. We
> > are trying to move away from that throttling model. More
> > fundamentally, though, is that it is not a general solution to the
> > entire class of "IO writeback rate changed" problems that buffered
> > write throttling needs to solve.
> >
> > > Anyway, keeping track of per cgroup rate and throttling accordingly
> > > can definitely help implement an algorithm for per cgroup IO control.
> > > We probably just need to find a reasonable way to account all this
> > > IO to end device so that we have control of all kind of IO of a cgroup.
> > > How do you implement proportional control here? From overall bdi bandwidth
> > > vary per cgroup bandwidth regularly based on cgroup weight? Again the
> > > issue here is that it controls only buffered WRITES and nothing else and
> > > in this case co-ordinating with CFQ will probably be hard. So I guess
> > > usage of proportional IO just for buffered WRITES will have limited
> > > usage.
> >
> > The whole point of doing the throttling this way is that we don't
> > need any sort of special connection between block IO throttling and
> > page cache (buffered write) throttling. We significantly reduce the
> > coupling between the layers by relying on feedback-driven control
> > loops to determine the buffered write throttling thresholds
> > adaptively. IOWs, the IO-less write throttling at the page cache
> > will adjust automatically to whatever throughput the block IO
> > throttling allows async writes to achieve.
> This is good. But that's not the impression one gets from Wu's slides.
> >
> > However, before we have a "finished product", there is still another
> > piece of the puzzle to be put in place - memcg-aware buffered
> > writeback. That is, having a flusher thread do work on behalf of
> > memcg in the IO context of the memcg. Then the IO controller just
> > sees a stream of async writes in the context of the memcg the
> > buffered writes came from in the first place. The block layer
> > throttles them just like any other IO in the IO context of the
> > memcg...
> Yes that is still a piece remaining. I was hoping that Greg Thelen will
> be able to extend his patches to submit writes in the context of
> per cgroup flusher/worker threads and solve this problem.
> Thanks
> Vivek

Are you suggesting multiple flushers per bdi (one per cgroup)?  I
thought the point of IO less was to one issue buffered writes from a
single thread.

Note: I have rebased the memcg writeback code to latest mmotm and am
testing it now.  These patches do not introduce additional threads;
the existing bdi flusher threads are used with an optional memcg
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at