[RFC] IO scheduler based IO controller V9
From: Vivek Goyal
Date: Fri Aug 28 2009 - 17:32:41 EST
Hi All,
Here is the V9 of the IO controller patches generated on top of 2.6.31-rc7.
For ease of patching, a consolidated patch is available here.
http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v9.patch
Changes from V8
===============
- Implemented bdi like congestion semantics for io group also. Now once an
io group gets congested, we don't clear the congestion flag until number
of requests goes below nr_congestion_off.
This helps in getting rid of Buffered write performance regression we
were observing with io controller patches.
Gui, can you please test it and see if this version is better in terms
of your buffered write tests.
- Moved some of the functions from blk-core.c to elevator-fq.c. This reduces
CONFIG_GROUP_IOSCHED ifdefs in blk-core.c and code looks little more clean.
- Fixed issue of add_front where we go left on rb-tree if add_front is
specified in case of preemption.
- Requeue async ioq after one round of dispatch. This helps emulationg
CFQ behavior.
- Pulled in v11 of io tracking patches and modified config option so that if
CONFIG_TRACK_ASYNC_CONTEXT is not enabled, blkio is not compiled in.
- Fixed some block tracepoints which were broken because of per group request
list changes.
- Fixed some logging messages.
- Got rid of extra call to update_prio as pointed out by Jerome and Gui.
- Merged the fix from jerome for a crash while chaning prio.
- Got rid of redundant slice_start assignment as pointed by Gui.
- Merged a elv_ioq_nr_dispatched() cleanup from Gui.
- Fixed a compilation issue if CONFIG_BLOCK=n.
What problem are we trying to solve
===================================
Provide group IO scheduling feature in Linux along the lines of other resource
controllers like cpu.
IOW, provide facility so that a user can group applications using cgroups and
control the amount of disk time/bandwidth received by a group based on its
weight.
How to solve the problem
=========================
Different people have solved the issue differetnly. At least there are now
three patchsets available (including this one).
IO throttling
-------------
This is a bandwidth controller which keeps track of IO rate of a group and
throttles the process in the group if it exceeds the user specified limit.
dm-ioband
---------
This is a proportional bandwidth controller implemented as device mapper
driver and provides fair access in terms of amount of IO done (not in terms
of disk time as CFQ does).
So one will setup one or more dm-ioband devices on top of physical/logical
block device, configure the ioband device and pass information like grouping
etc. Now this device will keep track of bios flowing through it and control
the flow of bios based on group policies.
IO scheduler based IO controller
--------------------------------
Here I have viewed the problem of IO contoller as hierarchical group scheduling (along the lines of CFS group scheduling) issue. Currently one can view linux
IO schedulers as flat where there is one root group and all the IO belongs to
that group.
This patchset basically modifies IO schedulers to also support hierarchical
group scheduling. CFQ already provides fairness among different processes. I
have extended it support group IO schduling. Also took some of the code out
of CFQ and put in a common layer so that same group scheduling code can be
used by noop, deadline and AS to support group scheduling.
Pros/Cons
=========
There are pros and cons to each of the approach. Following are some of the
thoughts.
- IO throttling is a max bandwidth controller and not a proportional one.
Additionaly it provides fairness in terms of amount of IO done (and not in
terms of disk time as CFQ does).
Personally, I think that proportional weight controller is useful to more
people than just max bandwidth controller. In addition, IO scheduler based
controller can also be enhanced to do max bandwidth control, if need be.
- dm-ioband also provides fairness in terms of amount of IO done not in terms
of disk time. So a seeky process can still run away with lot more disk time.
Now this is an interesting question that how fairness among groups should be
viewed and what is more relevant. Should fairness be based on amount of IO
done or amount of disk time consumed as CFQ does. IO scheduler based
controller provides fairness in terms of disk time used.
- IO throttling and dm-ioband both are second level controller. That is these
controllers are implemented in higher layers than io schedulers. So they
control the IO at higher layer based on group policies and later IO
schedulers take care of dispatching these bios to disk.
Implementing a second level controller has the advantage of being able to
provide bandwidth control even on logical block devices in the IO stack
which don't have any IO schedulers attached to these. But they can also
interefere with IO scheduling policy of underlying IO scheduler and change
the effective behavior. Following are some of the issues which I think
should be visible in second level controller in one form or other.
Prio with-in group
------------------
A second level controller can potentially interefere with behavior of
different prio processes with-in a group. bios are buffered at higher layer
in single queue and release of bios is FIFO and not proportionate to the
ioprio of the process. This can result in a particular prio level not
getting fair share.
Buffering at higher layer can delay read requests for more than slice idle
period of CFQ (default 8 ms). That means, it is possible that we are waiting
for a request from the queue but it is buffered at higher layer and then idle
timer will fire. It means that queue will losse its share at the same time
overall throughput will be impacted as we lost those 8 ms.
Read Vs Write
-------------
Writes can overwhelm readers hence second level controller FIFO release
will run into issue here. If there is a single queue maintained then reads
will suffer large latencies. If there separate queues for reads and writes
then it will be hard to decide in what ratio to dispatch reads and writes as
it is IO scheduler's decision to decide when and how much read/write to
dispatch. This is another place where higher level controller will not be in
sync with lower level io scheduler and can change the effective policies of
underlying io scheduler.
Fairness in terms of disk time / size of IO
---------------------------------------------
An higher level controller will most likely be limited to providing fairness
in terms of size of IO done and will find it hard to provide fairness in
terms of disk time used (as CFQ provides between various prio levels). This
is because only IO scheduler knows how much disk time a queue has used.
Not sure how useful it is to have fairness in terms of secotrs as CFQ has
been providing fairness in terms of disk time. So a seeky application will
still run away with lot of disk time and bring down the overall throughput
of the the disk more than usual.
CFQ IO context Issues
---------------------
Buffering at higher layer means submission of bios later with the help of
a worker thread. This changes the io context information at CFQ layer which
assigns the request to submitting thread. Change of io context info again
leads to issues of idle timer expiry and issue of a process not getting fair
share and reduced throughput.
Throughput with noop, deadline and AS
---------------------------------------------
I think an higher level controller will result in reduced overall throughput
(as compared to io scheduler based io controller) and more seeks with noop,
deadline and AS.
The reason being, that it is likely that IO with-in a group will be related
and will be relatively close as compared to IO across the groups. For example,
thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
control, IO from various groups will go into a single queue at lower level
controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
G4....) causing more seeks and reduced throughput. (Agreed that merging will
help up to some extent but still....).
Instead, in case of lower level controller, IO scheduler maintains one queue
per group hence there is no interleaving of IO between groups. And if IO is
related with-in group, then we shoud get reduced number/amount of seek and
higher throughput.
Latency can be a concern but that can be controlled by reducing the time
slice length of the queue.
- IO scheduler based controller has the limitation that it works only with the
bottom most devices in the IO stack where IO scheduler is attached. Now the
question comes that how important/relevant it is to control bandwidth at
higher level logical devices also. The actual contention for resources is
at the leaf block device so it probably makes sense to do any kind of
control there and not at the intermediate devices. Secondly probably it
also means better use of available resources.
For example, assume a user has created a linear logical device lv0 using
three underlying disks sda, sdb and sdc. Also assume there are two tasks
T1 and T2 in two groups doing IO on lv0. Also assume that weights of groups
are in the ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.
T1 T2
\ /
lv0
/ | \
sda sdb sdc
Now if IO control is done at lv0 level, then if T1 is doing IO to only sda,
and T2's IO is going to sdc. In this case there is no need of resource
management as both the IOs don't have any contention where it matters. If we
try to do IO control at lv0 device, it will not be an optimal usage of
resources and will bring down overall throughput.
IMHO, IO scheduler based IO controller is a reasonable approach to solve the
problem of group bandwidth control, and can do hierarchical IO scheduling
more tightly and efficiently. But I am all ears to alternative approaches and
suggestions how doing things can be done better.
TODO
====
- code cleanups, testing, bug fixing, optimizations, benchmarking etc...
- More testing to make sure there are no regressions in CFQ.
Open Issues
===========
- Currently for async requests like buffered writes, we get the io group
information from the page instead of the task context. How important it is
to determine the context from page?
Can we put all the pdflush threads into a separate group and control system
wide buffered write bandwidth. Any buffered writes submitted by the process
directly will any way go to right group.
If it is acceptable then we can drop all the code associated with async io
context and that should simplify the patchset a lot.
Testing
=======
I have divided testing results in three sections.
- Latency
- Throughput and Fairness
- Group Fairness
Because I have enhanced CFQ to also do group scheduling, one of the concerns
has been that existing CFQ should not regress at least in flat setup. If
one creates groups and puts tasks in those, then this is new environment and
some properties can change because groups have this additional requirement
of providing isolation also.
Environment
==========
A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
Latency Testing
++++++++++++++++
Test1: fsync-test with torture test from linus as background writer
------------------------------------------------------------
I looked at Ext3 fsync latency thread and picked fsync-test from Theodore Ts'o
and torture test from Linus as background writer to see how are the fsync
completion latencies. Following are the results.
Vanilla CFQ IOC IOC (with map async)
=========== ================= ====================
fsync time: 0.2515 fsync time: 0.8580 fsync time: 0.0531
fsync time: 0.1082 fsync time: 0.1408 fsync time: 0.8907
fsync time: 0.2106 fsync time: 0.3228 fsync time: 0.2709
fsync time: 0.2591 fsync time: 0.0978 fsync time: 0.3198
fsync time: 0.2776 fsync time: 0.3035 fsync time: 0.0886
fsync time: 0.2530 fsync time: 0.0903 fsync time: 0.3035
fsync time: 0.2271 fsync time: 0.2712 fsync time: 0.0961
fsync time: 0.1057 fsync time: 0.3357 fsync time: 0.1048
fsync time: 0.1699 fsync time: 0.3175 fsync time: 0.2582
fsync time: 0.1923 fsync time: 0.2964 fsync time: 0.0876
fsync time: 0.1805 fsync time: 0.0971 fsync time: 0.2546
fsync time: 0.2944 fsync time: 0.2728 fsync time: 0.3059
fsync time: 0.1420 fsync time: 0.1079 fsync time: 0.2973
fsync time: 0.2650 fsync time: 0.3103 fsync time: 0.2032
fsync time: 0.1581 fsync time: 0.1987 fsync time: 0.2926
fsync time: 0.2656 fsync time: 0.3048 fsync time: 0.1934
fsync time: 0.2666 fsync time: 0.3092 fsync time: 0.2954
fsync time: 0.1272 fsync time: 0.0165 fsync time: 0.2952
fsync time: 0.2655 fsync time: 0.2827 fsync time: 0.2394
fsync time: 0.0147 fsync time: 0.0068 fsync time: 0.0454
fsync time: 0.2296 fsync time: 0.2923 fsync time: 0.2936
fsync time: 0.0069 fsync time: 0.3021 fsync time: 0.0397
fsync time: 0.2668 fsync time: 0.1032 fsync time: 0.2762
fsync time: 0.1932 fsync time: 0.0962 fsync time: 0.2946
fsync time: 0.1895 fsync time: 0.3545 fsync time: 0.0774
fsync time: 0.2577 fsync time: 0.2406 fsync time: 0.3027
fsync time: 0.4935 fsync time: 0.7193 fsync time: 0.2984
fsync time: 0.2804 fsync time: 0.3251 fsync time: 0.1057
fsync time: 0.2685 fsync time: 0.1001 fsync time: 0.3145
fsync time: 0.1946 fsync time: 0.2525 fsync time: 0.2992
IOC--> With IO controller patches applied. CONFIG_TRACK_ASYNC_CONTEXT=n
IOC(map async) --> IO controller patches with CONFIG_TRACK_ASYNC_CONTEXT=y
If CONFIG_TRACK_ASYNC_CONTEXT=y, async requests are mapped to the group based
on cgroup info stored in page otherwise these are mapped to the cgroup
submitting task belongs to.
Notes:
- It looks like that max fsync time is a bit higher with IO controller
patches. Wil dig more into it later.
Test2: read small files with multiple sequential readers (10) runnning
======================================================================
Took Ingo's small file reader test and ran it while 10 sequential readers
were running.
Vanilla CFQ IOC (flat) IOC (10 readers in 10 groups)
0.12 seconds 0.11 seconds 1.62 seconds
0.05 seconds 0.05 seconds 1.18 seconds
0.05 seconds 0.05 seconds 1.17 seconds
0.03 seconds 0.04 seconds 1.18 seconds
1.15 seconds 1.17 seconds 1.29 seconds
1.18 seconds 1.16 seconds 1.17 seconds
1.17 seconds 1.16 seconds 1.17 seconds
1.18 seconds 1.15 seconds 1.28 seconds
1.17 seconds 1.15 seconds 1.17 seconds
1.16 seconds 1.18 seconds 1.18 seconds
1.15 seconds 1.15 seconds 1.17 seconds
1.17 seconds 1.15 seconds 1.18 seconds
1.17 seconds 1.15 seconds 1.17 seconds
1.17 seconds 1.16 seconds 1.18 seconds
1.17 seconds 1.15 seconds 1.17 seconds
0.04 seconds 0.04 seconds 1.18 seconds
1.17 seconds 1.16 seconds 1.17 seconds
1.18 seconds 1.15 seconds 1.17 seconds
1.18 seconds 1.15 seconds 1.28 seconds
1.18 seconds 1.15 seconds 1.18 seconds
1.17 seconds 1.16 seconds 1.18 seconds
1.17 seconds 1.18 seconds 1.17 seconds
1.17 seconds 1.15 seconds 1.17 seconds
1.16 seconds 1.16 seconds 1.17 seconds
1.17 seconds 1.15 seconds 1.17 seconds
1.16 seconds 1.15 seconds 1.17 seconds
1.15 seconds 1.15 seconds 1.18 seconds
1.18 seconds 1.16 seconds 1.17 seconds
1.16 seconds 1.16 seconds 1.17 seconds
1.17 seconds 1.16 seconds 1.17 seconds
1.16 seconds 1.16 seconds 1.17 seconds
In third column, 10 readers have been put into 10 groups instead of running
into root group. Small file reader runs in to root group.
Notes: It looks like that here read latencies remain same as with vanilla CFQ.
Test3: read small files with multiple writers (8) runnning
==========================================================
Again running small file reader test with 8 buffered writers running with
prio 0 to 7.
Latency results are in seconds. Tried to capture the output with multiple
configurations of IO controller to see the effect.
Vanilla IOC IOC IOC IOC IOC IOC
(flat)(groups) (groups) (map) (map) (map)
(f=0) (f=1) (flat) (groups) (groups)
(f=0) (f=1)
0.25 0.03 0.31 0.25 0.29 1.25 0.39
0.27 0.28 0.28 0.30 0.41 0.90 0.80
0.25 0.24 0.23 0.37 0.27 1.17 0.24
0.14 0.14 0.14 0.13 0.15 0.10 1.11
0.14 0.16 0.13 0.16 0.15 0.06 0.58
0.16 0.11 0.15 0.12 0.19 0.05 0.14
0.03 0.17 0.12 0.17 0.04 0.12 0.12
0.13 0.13 0.13 0.14 0.03 0.05 0.05
0.18 0.13 0.17 0.09 0.09 0.05 0.07
0.11 0.18 0.16 0.18 0.14 0.05 0.12
0.28 0.14 0.15 0.15 0.13 0.02 0.04
0.16 0.14 0.14 0.12 0.15 0.00 0.13
0.14 0.13 0.14 0.13 0.13 0.02 0.02
0.13 0.11 0.12 0.14 0.15 0.06 0.01
0.27 0.28 0.32 0.24 0.25 0.01 0.01
0.14 0.15 0.18 0.15 0.13 0.06 0.02
0.15 0.13 0.13 0.13 0.13 0.00 0.04
0.15 0.13 0.15 0.14 0.15 0.01 0.05
0.11 0.17 0.15 0.13 0.13 0.02 0.00
0.17 0.13 0.17 0.12 0.18 0.39 0.01
0.18 0.16 0.14 0.16 0.14 0.89 0.47
0.13 0.13 0.14 0.04 0.12 0.64 0.78
0.16 0.15 0.19 0.11 0.16 0.67 1.17
0.04 0.12 0.14 0.04 0.18 0.67 0.63
0.03 0.13 0.17 0.11 0.15 0.61 0.69
0.15 0.16 0.13 0.14 0.13 0.77 0.66
0.12 0.12 0.15 0.11 0.13 0.92 0.73
0.15 0.12 0.15 0.16 0.13 0.70 0.73
0.11 0.13 0.15 0.10 0.18 0.73 0.82
0.16 0.19 0.15 0.16 0.14 0.71 0.74
0.28 0.05 0.26 0.22 0.17 2.91 0.79
0.13 0.05 0.14 0.14 0.14 0.44 0.65
0.16 0.22 0.18 0.13 0.26 0.31 0.65
0.10 0.13 0.12 0.11 0.16 0.25 0.66
0.13 0.14 0.16 0.15 0.12 0.17 0.76
0.19 0.11 0.12 0.14 0.17 0.20 0.71
0.16 0.15 0.14 0.15 0.11 0.19 0.68
0.13 0.13 0.13 0.13 0.16 0.04 0.78
0.14 0.16 0.15 0.17 0.15 1.20 0.80
0.17 0.13 0.14 0.18 0.14 0.76 0.63
f(0/1)--> refers to "fairness" tunable. This is new tunable part of CFQ. It
set, we wait for requests from one queue to finish before new
queue is scheduled in.
group ---> writers are running into individual groups and not in root group.
map---> buffered writes are mapped to group using info stored in page.
Notes: Except the case of column 6 and 7 when writeres are in separate groups
and we are mapping their writes to respective group, latencies seem to be
fine. I think the latencies are higher for the last two cases because
now the reader can't preempt the writer.
root
/ \ \ \
R G1 G2 G3
| | |
W W W
Test4: Random Reader test in presece of 4 sequential readers and 4 buffered
writers
============================================================================
Used fio to this time to run one random reader and see how does it fair in
the presence of 4 sequential readers and 4 writers.
I have just pasted the output of random reader from fio.
Vanilla Kernel, Three runs
--------------------------
read : io=20,512KiB, bw=349KiB/s, iops=10, runt= 60075msec
clat (usec): min=944, max=2,675K, avg=93715.04, stdev=305815.90
read : io=13,696KiB, bw=233KiB/s, iops=7, runt= 60035msec
clat (msec): min=2, max=1,812, avg=140.26, stdev=382.55
read : io=13,824KiB, bw=235KiB/s, iops=7, runt= 60185msec
clat (usec): min=766, max=2,025K, avg=139310.55, stdev=383647.54
IO controller kernel, Three runs
--------------------------------
read : io=10,304KiB, bw=175KiB/s, iops=5, runt= 60083msec
clat (msec): min=2, max=2,654, avg=186.59, stdev=524.08
read : io=10,176KiB, bw=173KiB/s, iops=5, runt= 60054msec
clat (usec): min=792, max=2,567K, avg=188841.70, stdev=517154.75
read : io=11,040KiB, bw=188KiB/s, iops=5, runt= 60003msec
clat (usec): min=779, max=2,625K, avg=173915.56, stdev=508118.60
Notes:
- Looks like vanilla CFQ gives a bit more disk access to random reader. Will
dig into it.
Throughput and Fairness
+++++++++++++++++++++++
Test5: Bandwidth distribution between 4 sequential readers and 4 buffered
writers
==========================================================================
Used fio to launch 4 sequential readers and 4 buffered writers and watched
how BW is distributed.
Vanilla kernel, Three sets
--------------------------
read : io=962MiB, bw=16,818KiB/s, iops=513, runt= 60008msec
read : io=969MiB, bw=16,920KiB/s, iops=516, runt= 60077msec
read : io=978MiB, bw=17,063KiB/s, iops=520, runt= 60096msec
read : io=922MiB, bw=16,106KiB/s, iops=491, runt= 60057msec
write: io=235MiB, bw=4,099KiB/s, iops=125, runt= 60049msec
write: io=226MiB, bw=3,944KiB/s, iops=120, runt= 60049msec
write: io=215MiB, bw=3,747KiB/s, iops=114, runt= 60049msec
write: io=207MiB, bw=3,606KiB/s, iops=110, runt= 60049msec
READ: io=3,832MiB, aggrb=66,868KiB/s, minb=16,106KiB/s, maxb=17,063KiB/s,
mint=60008msec, maxt=60096msec
WRITE: io=882MiB, aggrb=15,398KiB/s, minb=3,606KiB/s, maxb=4,099KiB/s,
mint=60049msec, maxt=60049msec
read : io=1,002MiB, bw=17,513KiB/s, iops=534, runt= 60020msec
read : io=979MiB, bw=17,085KiB/s, iops=521, runt= 60080msec
read : io=953MiB, bw=16,637KiB/s, iops=507, runt= 60092msec
read : io=920MiB, bw=16,057KiB/s, iops=490, runt= 60108msec
write: io=215MiB, bw=3,560KiB/s, iops=108, runt= 63289msec
write: io=136MiB, bw=2,361KiB/s, iops=72, runt= 60502msec
write: io=127MiB, bw=2,101KiB/s, iops=64, runt= 63289msec
write: io=233MiB, bw=3,852KiB/s, iops=117, runt= 63289msec
READ: io=3,855MiB, aggrb=67,256KiB/s, minb=16,057KiB/s, maxb=17,513KiB/s,
mint=60020msec, maxt=60108msec
WRITE: io=711MiB, aggrb=11,771KiB/s, minb=2,101KiB/s, maxb=3,852KiB/s,
mint=60502msec, maxt=63289msec
read : io=985MiB, bw=17,179KiB/s, iops=524, runt= 60149msec
read : io=974MiB, bw=17,025KiB/s, iops=519, runt= 60002msec
read : io=962MiB, bw=16,772KiB/s, iops=511, runt= 60170msec
read : io=932MiB, bw=16,280KiB/s, iops=496, runt= 60057msec
write: io=177MiB, bw=2,933KiB/s, iops=89, runt= 63094msec
write: io=152MiB, bw=2,637KiB/s, iops=80, runt= 60323msec
write: io=240MiB, bw=3,983KiB/s, iops=121, runt= 63094msec
write: io=147MiB, bw=2,439KiB/s, iops=74, runt= 63094msec
READ: io=3,855MiB, aggrb=67,174KiB/s, minb=16,280KiB/s, maxb=17,179KiB/s,
mint=60002msec, maxt=60170msec
WRITE: io=715MiB, aggrb=11,877KiB/s, minb=2,439KiB/s, maxb=3,983KiB/s,
mint=60323msec, maxt=63094msec
IO controller kernel three sets
-------------------------------
read : io=944MiB, bw=16,483KiB/s, iops=503, runt= 60055msec
read : io=941MiB, bw=16,433KiB/s, iops=501, runt= 60073msec
read : io=900MiB, bw=15,713KiB/s, iops=479, runt= 60040msec
read : io=866MiB, bw=15,112KiB/s, iops=461, runt= 60086msec
write: io=244MiB, bw=4,262KiB/s, iops=130, runt= 60040msec
write: io=177MiB, bw=3,085KiB/s, iops=94, runt= 60042msec
write: io=158MiB, bw=2,758KiB/s, iops=84, runt= 60041msec
write: io=180MiB, bw=3,137KiB/s, iops=95, runt= 60040msec
READ: io=3,651MiB, aggrb=63,718KiB/s, minb=15,112KiB/s, maxb=16,483KiB/s,
mint=60040msec, maxt=60086msec
WRITE: io=758MiB, aggrb=13,243KiB/s, minb=2,758KiB/s, maxb=4,262KiB/s,
mint=60040msec, maxt=60042msec
read : io=960MiB, bw=16,734KiB/s, iops=510, runt= 60137msec
read : io=917MiB, bw=16,001KiB/s, iops=488, runt= 60122msec
read : io=897MiB, bw=15,683KiB/s, iops=478, runt= 60004msec
read : io=908MiB, bw=15,824KiB/s, iops=482, runt= 60149msec
write: io=209MiB, bw=3,563KiB/s, iops=108, runt= 61400msec
write: io=177MiB, bw=3,030KiB/s, iops=92, runt= 61400msec
write: io=200MiB, bw=3,409KiB/s, iops=104, runt= 61400msec
write: io=204MiB, bw=3,489KiB/s, iops=106, runt= 61400msec
READ: io=3,682MiB, aggrb=64,194KiB/s, minb=15,683KiB/s, maxb=16,734KiB/s,
mint=60004msec, maxt=60149msec
WRITE: io=790MiB, aggrb=13,492KiB/s, minb=3,030KiB/s, maxb=3,563KiB/s,
mint=61400msec, maxt=61400msec
read : io=968MiB, bw=16,867KiB/s, iops=514, runt= 60158msec
read : io=925MiB, bw=16,135KiB/s, iops=492, runt= 60142msec
read : io=875MiB, bw=15,286KiB/s, iops=466, runt= 60003msec
read : io=872MiB, bw=15,221KiB/s, iops=464, runt= 60049msec
write: io=213MiB, bw=3,720KiB/s, iops=113, runt= 60162msec
write: io=203MiB, bw=3,536KiB/s, iops=107, runt= 60163msec
write: io=208MiB, bw=3,620KiB/s, iops=110, runt= 60162msec
write: io=203MiB, bw=3,538KiB/s, iops=107, runt= 60163msec
READ: io=3,640MiB, aggrb=63,439KiB/s, minb=15,221KiB/s, maxb=16,867KiB/s,
mint=60003msec, maxt=60158msec
WRITE: io=827MiB, aggrb=14,415KiB/s, minb=3,536KiB/s, maxb=3,720KiB/s,
mint=60162msec, maxt=60163msec
Notes: It looks like vanilla CFQ favors readers a bit more over writers as
compared to io controller cfq. Will dig into it.
Test6: Bandwidth distribution between readers of diff prio
==========================================================
Using fio, ran 8 readers of prio 0 to 7 and let it run for 30 seconds and
watched for overall throughput and who got how much IO done.
Vanilla kernel, Three sets
---------------------------
read : io=454MiB, bw=15,865KiB/s, iops=484, runt= 30004msec
read : io=382MiB, bw=13,330KiB/s, iops=406, runt= 30086msec
read : io=325MiB, bw=11,330KiB/s, iops=345, runt= 30074msec
read : io=294MiB, bw=10,253KiB/s, iops=312, runt= 30062msec
read : io=238MiB, bw=8,321KiB/s, iops=253, runt= 30048msec
read : io=145MiB, bw=5,061KiB/s, iops=154, runt= 30032msec
read : io=99MiB, bw=3,456KiB/s, iops=105, runt= 30021msec
read : io=67,040KiB, bw=2,280KiB/s, iops=69, runt= 30108msec
READ: io=2,003MiB, aggrb=69,767KiB/s, minb=2,280KiB/s, maxb=15,865KiB/s,
mint=30004msec, maxt=30108msec
read : io=450MiB, bw=15,727KiB/s, iops=479, runt= 30001msec
read : io=371MiB, bw=12,966KiB/s, iops=395, runt= 30040msec
read : io=325MiB, bw=11,321KiB/s, iops=345, runt= 30099msec
read : io=296MiB, bw=10,332KiB/s, iops=315, runt= 30086msec
read : io=238MiB, bw=8,319KiB/s, iops=253, runt= 30056msec
read : io=152MiB, bw=5,290KiB/s, iops=161, runt= 30070msec
read : io=100MiB, bw=3,483KiB/s, iops=106, runt= 30020msec
read : io=68,832KiB, bw=2,340KiB/s, iops=71, runt= 30118msec
READ: io=2,000MiB, aggrb=69,631KiB/s, minb=2,340KiB/s, maxb=15,727KiB/s,
mint=30001msec, maxt=30118msec
read : io=450MiB, bw=15,691KiB/s, iops=478, runt= 30068msec
read : io=369MiB, bw=12,882KiB/s, iops=393, runt= 30032msec
read : io=364MiB, bw=12,732KiB/s, iops=388, runt= 30015msec
read : io=283MiB, bw=9,889KiB/s, iops=301, runt= 30002msec
read : io=228MiB, bw=7,935KiB/s, iops=242, runt= 30091msec
read : io=144MiB, bw=5,018KiB/s, iops=153, runt= 30103msec
read : io=97,760KiB, bw=3,327KiB/s, iops=101, runt= 30083msec
read : io=66,784KiB, bw=2,276KiB/s, iops=69, runt= 30046msec
READ: io=1,999MiB, aggrb=69,625KiB/s, minb=2,276KiB/s, maxb=15,691KiB/s,
mint=30002msec, maxt=30103msec
IO controller kernel, Three sets
--------------------------------
read : io=404MiB, bw=14,103KiB/s, iops=430, runt= 30072msec
read : io=344MiB, bw=11,999KiB/s, iops=366, runt= 30035msec
read : io=294MiB, bw=10,257KiB/s, iops=313, runt= 30052msec
read : io=254MiB, bw=8,888KiB/s, iops=271, runt= 30021msec
read : io=238MiB, bw=8,311KiB/s, iops=253, runt= 30086msec
read : io=177MiB, bw=6,202KiB/s, iops=189, runt= 30001msec
read : io=158MiB, bw=5,517KiB/s, iops=168, runt= 30118msec
read : io=99MiB, bw=3,464KiB/s, iops=105, runt= 30107msec
READ: io=1,971MiB, aggrb=68,604KiB/s, minb=3,464KiB/s, maxb=14,103KiB/s,
mint=30001msec, maxt=30118msec
read : io=375MiB, bw=13,066KiB/s, iops=398, runt= 30110msec
read : io=326MiB, bw=11,409KiB/s, iops=348, runt= 30003msec
read : io=308MiB, bw=10,758KiB/s, iops=328, runt= 30066msec
read : io=256MiB, bw=8,937KiB/s, iops=272, runt= 30091msec
read : io=232MiB, bw=8,088KiB/s, iops=246, runt= 30041msec
read : io=192MiB, bw=6,695KiB/s, iops=204, runt= 30077msec
read : io=144MiB, bw=5,014KiB/s, iops=153, runt= 30051msec
read : io=96,224KiB, bw=3,281KiB/s, iops=100, runt= 30026msec
READ: io=1,928MiB, aggrb=67,145KiB/s, minb=3,281KiB/s, maxb=13,066KiB/s,
mint=30003msec, maxt=30110msec
read : io=405MiB, bw=14,162KiB/s, iops=432, runt= 30021msec
read : io=354MiB, bw=12,386KiB/s, iops=378, runt= 30007msec
read : io=303MiB, bw=10,567KiB/s, iops=322, runt= 30062msec
read : io=261MiB, bw=9,126KiB/s, iops=278, runt= 30040msec
read : io=228MiB, bw=7,946KiB/s, iops=242, runt= 30048msec
read : io=178MiB, bw=6,222KiB/s, iops=189, runt= 30074msec
read : io=152MiB, bw=5,286KiB/s, iops=161, runt= 30093msec
read : io=99MiB, bw=3,446KiB/s, iops=105, runt= 30110msec
READ: io=1,981MiB, aggrb=68,996KiB/s, minb=3,446KiB/s, maxb=14,162KiB/s,
mint=30007msec, maxt=30110msec
Notes:
- It looks like overall throughput is 1-3% less in case of io controller.
- Bandwidth distribution between various prio levels has changed a bit. CFQ
seems to have 100ms slice length for prio4 and then this slice increases
by 20% for each prio level as prio increases and decreases by 20% as prio
levels decrease. So Io controller does not seem to be doing too bad as in
meeting that distribution.
Group Fairness
+++++++++++++++
Test7 (Isolation between two KVM virtual machines)
==================================================
Created two KVM virtual machines. Partitioned a disk on host in two partitions
and gave one partition to each virtual machine. Put both the virtual machines
in two different cgroup of weight 1000 and 500 each. Virtual machines created
ext3 file system on the partitions exported from host and did buffered writes.
Host seems writes as synchronous and virtual machine with higher weight gets
double the disk time of virtual machine of lower weight. Used deadline
scheduler in this test case.
Some more details about configuration are in documentation patch.
Test8 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)
Higher weight dd finishes first and at that point of time my script takes
care of reading cgroup files io.disk_time and io.disk_sectors for both the
groups and display the results.
dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &
group1 time=8:16 2452 group1 sectors=8:16 457856
group2 time=8:16 1317 group2 sectors=8:16 247008
234179072 bytes (234 MB) copied, 3.90912 s, 59.9 MB/s
234179072 bytes (234 MB) copied, 5.15548 s, 45.4 MB/s
First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.
This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk_time and io.disk_sector
files in cgroup. More about it in documentation file.
Test9 (Reader Vs Buffered Writes)
================================
Buffered writes can be problematic and can overwhelm readers, especially with
noop and deadline. IO controller can provide isolation between readers and
buffered (async) writers.
First I ran the test without io controller to see the severity of the issue.
Ran a hostile writer and then after 10 seconds started a reader and then
monitored the completion time of reader. Reader reads a 256 MB file. Tested
this with noop scheduler.
sample script
------------
sync
echo 3 > /proc/sys/vm/drop_caches
time dd if=/dev/zero of=/mnt/sdb/reader-writer-zerofile bs=4K count=2097152
conv=fdatasync &
sleep 10
time dd if=/mnt/sdb/256M-file of=/dev/null &
Results
-------
8589934592 bytes (8.6 GB) copied, 106.045 s, 81.0 MB/s (Writer)
268435456 bytes (268 MB) copied, 96.5237 s, 2.8 MB/s (Reader)
Now it was time to test io controller whether it can provide isolation between
readers and writers with noop. I created two cgroups of weight 1000 each and
put reader in group1 and writer in group 2 and ran the test again. Upon
comletion of reader, my scripts read io.disk_time and io.disk_sectors cgroup
files to get an estimate how much disk time each group got and how many
sectors each group did IO for.
For more accurate accounting of disk time for buffered writes with queuing
hardware I had to set /sys/block/<disk>/queue/iosched/fairness to "1".
sample script
-------------
echo $$ > /cgroup/bfqio/test2/tasks
dd if=/dev/zero of=/mnt/$BLOCKDEV/testzerofile bs=4K count=2097152 &
sleep 10
echo noop > /sys/block/$BLOCKDEV/queue/scheduler
echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness
echo $$ > /cgroup/bfqio/test1/tasks
dd if=/mnt/$BLOCKDEV/256M-file of=/dev/null &
wait $!
# Some code for reading cgroup files upon completion of reader.
-------------------------
Results
=======
268435456 bytes (268 MB) copied, 6.92248 s, 38.8 MB/s
group1 time=8:16 3185 group1 sectors=8:16 524824
group2 time=8:16 3190 group2 sectors=8:16 503848
Note, reader finishes now much lesser time and both group1 and group2
got almost 3 seconds of disk time. Hence io-controller provides isolation
from buffered writes.
Test10 (AIO)
===========
AIO reads
-----------
Set up two fio, AIO read jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.
---------------------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight
fio_args="--ioengine=libaio --rw=read --size=512M --direct=1"
echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness
echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &
echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
----------------------------------------------------------------
test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.
Results
------
test1 statistics: time=8:16 17955 sectors=8:16 1049656 dq=8:16 2
test2 statistics: time=8:16 9217 sectors=8:16 602592 dq=8:16 1
Above shows that by the time first fio (higher weight), finished, group
test1 got 17686 ms of disk time and group test2 got 9036 ms of disk time.
similarly the statistics for number of sectors transferred are also shown.
Note that disk time given to group test1 is almost double of group2 disk
time.
AIO writes
----------
Set up two fio, AIO direct write jobs in two cgroup with weight 1000 and 500
respectively. I am using cfq scheduler. Following are some lines from my test
script.
------------------------------------------------
echo 1000 > /cgroup/bfqio/test1/io.weight
echo 500 > /cgroup/bfqio/test2/io.weight
fio_args="--ioengine=libaio --rw=write --size=512M --direct=1"
echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness
echo $$ > /cgroup/bfqio/test1/tasks
fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
--output=/mnt/$BLOCKDEV/fio1/test1.log
--exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &
echo $$ > /cgroup/bfqio/test2/tasks
fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
--output=/mnt/$BLOCKDEV/fio2/test2.log &
-------------------------------------------------
test1 and test2 are two groups with weight 1000 and 500 respectively.
"read-and-display-group-stats.sh" is one small script which reads the
test1 and test2 cgroup files to determine how much disk time each group
got till first fio job finished.
Following are the results.
test1 statistics: time=8:16 25452 sectors=8:16 1049664 dq=8:16 2
test2 statistics: time=8:16 12939 sectors=8:16 532184 dq=8:16 4
Above shows that by the time first fio (higher weight), finished, group
test1 got almost double the disk time of group test2.
Test11 (Fairness for async writes, Buffered Write Vs Buffered Write)
===================================================================
Fairness for async writes is tricky and biggest reason is that async writes
are cached in higher layers (page cahe) as well as possibly in file system
layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
in proportional manner.
For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.
IOW, the core problem with buffered write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.
In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.
Previous versions of the patches were posted here.
------------------------------------------------
(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580
(V5) http://lkml.org/lkml/2009/6/19/279
(V6) http://lkml.org/lkml/2009/7/2/369
(V7) http://lkml.org/lkml/2009/7/24/253
(V8) http://lkml.org/lkml/2009/8/16/204
Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/