[RFC PATCH 0/8] blk-throttle: Throttle buffered WRITE in balance_dirty_pages()

From: Vivek Goyal
Date: Fri Jun 03 2011 - 17:09:22 EST


Hi,

I have been trying to find ways to solve two problems with block IO controller
cgroups.

- Current throttling logic in IO controller does not throttle buffered WRITES.
Well it does throttle all the WRITEs at device and by that time buffered
WRITE have lost the submitter's context and most of the IO comes in
flusher thread's context at device. Hence currently buffered write
throttling is not supported.

- All WRITEs are throttled at device level and this can easily lead to
filesystem serialization.

One simple example is that if a process writes some pages to cache and
then does fsync(), and process gets throttled then it locks up the
filesystem. With ext4, I noticed that even a simple "ls" does not make
progress. The reason boils down to the fact that filesystems are not
aware of cgroups and one of the things which get serialized is journalling
in ordered mode.

So even if we do something to carry submitter's cgroup information
to device and do throttling there, it will lead to serialization of
filesystems and is not a good idea.

So how to go about fixing it. There seem to be two options.

- Throttling should still be done at device level. Make filesystems aware
of cgroups so that multiple transactions can make progress in parallel
(per cgroup) and there are no shared resources across cgroups in
filesystems which can lead to serialization.

- Throttle WRITEs while they are entering the cache and not after that.
Something like balance_dirty_pages(). Direct IO is still throttled
at device level. That way, we can avoid these journalling related
serialization issues w.r.t trottling.

But the big issue with this approach is that we control the IO rate
entering into the cache and not IO rate at the device. That way it
can happen that flusher later submits lots of WRITEs to device and
we will see a periodic IO spike on end node.

So this mechanism helps a bit but is not the complete solution. It
can primarily help those folks which have the system resources and
plenty of IO bandwidth available but they don't want to give it to
customer because it is not a premium customer etc.

Option 1 seem to be really hard to fix. Filesystems have not been written
keeping cgroups in mind. So I am really skeptical that I can convince file
system designers to make fundamental changes in filesystems and journalling
code to make them cgroup aware.

Hence with this patch series I have implemented option 2. Option 2 is not
the best solution but atleast it gives us some control then not having any
control on buffered writes. Andrea Righi did similar patches in the past
here.

https://lkml.org/lkml/2011/2/28/115

This patch series had issues w.r.t to interaction between bio and task
throttling, so I redid it.

Design
------

IO controller already has the capability to keep track of IO rates of
a group and enqueue the bio in internal queues if group exceeds the
rate and dispatch these bios later.

This patch series also introduce the capability to throttle a dirtying
task in balance_dirty_pages_ratelimited_nr(). Now no WRITES except
direct WRITES will be throttled at device level. If a dirtying task
exceeds its configured IO rate, it is put on a group wait queue and
woken up when it can dirty more pages.

No new interface has been introduced and both direct IO as well as buffered
IO make use of common IO rate limit.

How To
=====
- Create a cgroup and limit it to 1MB/s for writes.
echo "8:16 1024000" > /cgroup/blk/test1/blkio.throttle.write_bps_device

- Launch dd thread in the cgroup
dd if=/dev/zero of=zerofile bs=4K count=1K

1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 4.00428 s, 1.0 MB/s

Any feedback is welcome.

Thanks
Vivek

Vivek Goyal (8):
blk-throttle: convert wait routines to return jiffies to wait
blk-throttle: do not enforce first queued bio check in
tg_wait_dispatch
blk-throttle: use IO size and direction as parameters to wait
routines
blk-throttle: Specify number of IOs during dispatch update
blk-throttle: Get rid of extend slice trace message
blk-throttle: core logic to throttle task while dirtying pages
blk-throttle: Do not throttle WRITEs at device level except direct IO
blk-throttle: enable throttling of task while dirtying pages

block/blk-cgroup.c | 6 +-
block/blk-cgroup.h | 2 +-
block/blk-throttle.c | 506 +++++++++++++++++++++++++++++++++++---------
block/cfq-iosched.c | 2 +-
block/cfq.h | 6 +-
fs/direct-io.c | 1 +
include/linux/blk_types.h | 2 +
include/linux/blkdev.h | 5 +
mm/page-writeback.c | 3 +
9 files changed, 421 insertions(+), 112 deletions(-)

--
1.7.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/