Re: IO scheduler based IO controller V10

From: Vivek Goyal
Date: Mon Oct 05 2009 - 13:12:39 EST


On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
>
> Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> > > Hi,
> > >
> > > Munehiro Ikeda <m-ikeda@xxxxxxxxxxxxx> wrote:
> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > > > going through the request based dm-multipath paper. Will it make sense
> > > > > to implement request based dm-ioband? So basically we implement all the
> > > > > group scheduling in CFQ and let dm-ioband implement a request function
> > > > > to take the request and break it back into bios. This way we can keep
> > > > > all the group control at one place and also meet most of the requirements.
> > > > >
> > > > > So request based dm-ioband will have a request in hand once that request
> > > > > has passed group control and prio control. Because dm-ioband is a device
> > > > > mapper target, one can put it on higher level devices (practically taking
> > > > > CFQ at higher level device), and provide fairness there. One can also
> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > > > them to use the IO scheduler.)
> > > > >
> > > > > I am sure that will be many issues but one big issue I could think of that
> > > > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > > > from one queue (in case of idling) and that would kill parallelism at
> > > > > higher layer and throughput will suffer on many of the dm/md configurations.
> > > > >
> > > > > Thanks
> > > > > Vivek
> > > >
> > > > As long as using CFQ, your idea is reasonable for me. But how about for
> > > > other IO schedulers? In my understanding, one of the keys to guarantee
> > > > group isolation in your patch is to have per-group IO scheduler internal
> > > > queue even with as, deadline, and noop scheduler. I think this is
> > > > great idea, and to implement generic code for all IO schedulers was
> > > > concluded when we had so many IO scheduler specific proposals.
> > > > If we will still need per-group IO scheduler internal queues with
> > > > request-based dm-ioband, we have to modify elevator layer. It seems
> > > > out of scope of dm.
> > > > I might miss something...
> > >
> > > IIUC, the request based device-mapper could not break back a request
> > > into bio, so it could not work with block devices which don't use the
> > > IO scheduler.
> > >
> >
> > I think current request based multipath drvier does not do it but can't it
> > be implemented that requests are broken back into bio?
>
> I guess it would be hard to implement it, and we need to hold requests
> and throttle them at there and it would break the ordering by CFQ.
>
> > Anyway, I don't feel too strongly about this approach as it might
> > introduce more serialization at higher layer.
>
> Yes, I know it.
>
> > > How about adding a callback function to the higher level controller?
> > > CFQ calls it when the active queue runs out of time, then the higer
> > > level controller use it as a trigger or a hint to move IO group, so
> > > I think a time-based controller could be implemented at higher level.
> > >
> >
> > Adding a call back should not be a big issue. But that means you are
> > planning to run only one group at higher layer at one time and I think
> > that's the problem because than we are introducing serialization at higher
> > layer. So any higher level device mapper target which has multiple
> > physical disks under it, we might be underutilizing these even more and
> > take a big hit on overall throughput.
> >
> > The whole design of doing proportional weight at lower layer is optimial
> > usage of system.
>
> But I think that the higher level approch makes easy to configure
> against striped software raid devices.

How does it make easier to configure in case of higher level controller?

In case of lower level design, one just have to create cgroups and assign
weights to cgroups. This mininum step will be required in higher level
controller also. (Even if you get rid of dm-ioband device setup step).

> If one would like to
> combine some physical disks into one logical device like a dm-linear,
> I think one should map the IO controller on each physical device and
> combine them into one logical device.
>

In fact this sounds like a more complicated step where one has to setup
one dm-ioband device on top of each physical device. But I am assuming
that this will go away once you move to per reuqest queue like implementation.

I think it should be same in principal as my initial implementation of IO
controller on request queue and I stopped development on it because of FIFO
dispatch.

So you seem to be suggesting that you will move dm-ioband to request queue
so that setting up additional device setup is gone. You will also enable
it to do time based groups policy, so that we don't run into issues on
seeky media. Will also enable dispatch from one group only at a time so
that we don't run into isolation issues and can do time accounting
accruately.

If yes, then that has the potential to solve the issue. At higher layer one
can think of enabling size of IO/number of IO policy both for proportional
BW and max BW type of control. At lower level one can enable pure time
based control on seeky media.

I think this will still left with the issue of prio with-in group as group
control is separate and you will not be maintatinig separate queues for
each process. Similarly you will also have isseus with read vs write
ratios as IO schedulers underneath change.

So I will be curious to see that implementation.

> > > My requirements for IO controller are:
> > > - Implement s a higher level controller, which is located at block
> > > layer and bio is grabbed in generic_make_request().
> >
> > How are you planning to handle the issue of buffered writes Andrew raised?
>
> I think that it would be better to use the higher-level controller
> along with the memory controller and have limits memory usage for each
> cgroup. And as Kamezawa-san said, having limits of dirty pages would
> be better, too.
>

Ok. So if we plan to co-mount memory controller with per memory group
dirty_ratio implemented, that can work with both higher level as well as
low level controller. Not sure if we also require some kind of a per
memory group flusher thread infrastructure also to make sure higher weight
group gets more job done.

> > > - Can work with any type of IO scheduler.
> > > - Can work with any type of block devices.
> > > - Support multiple policies, proportional wegiht, max rate, time
> > > based, ans so on.
> > >
> > > The IO controller mini-summit will be held in next week, and I'm
> > > looking forard to meet you all and discuss about IO controller.
> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> >
> > Is there a new version of dm-ioband now where you have solved the issue of
> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> > trying to run some tests and come up with numbers so that we have more
> > clear picture of pros/cons.
>
> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> dm-ioband handles sync/async IO requests separately and
> the write-starve-read issue you pointed out is fixed. I would
> appreciate it if you would try them.
> http://sourceforge.net/projects/ioband/files/

Cool. Will get to testing it.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/