[PATCH][RFC][12+2][v3] A expanded CFQ scheduler for cgroups

From: Satoshi UCHIDA
Date: Wed Nov 12 2008 - 03:19:51 EST



This patchset expands traditional CFQ scheduler in order to support cgroups,
and improves old version.

Improvements are as following.

* Modularizing our new CFQ scheduler.
The expanded CFQ scheduler is registered/unregistered as new I/O
elevator scheduler called "cfq-cgroups". By this, the traditional CFQ
scheduler, which does not handle cgroups, and our new CFQ scheduler, which
handles cgroups, can be used at the same time for different devices.

* Allowing to set parameter per device.
The expanded CFQ scheduler allows users to set parameter per device.
By this, users can decide share (priority) per device.

--- Optional functions ---

* Adding a validation flag for 'think time'. (Opt-1 patch)
CFQ show poor scalability. One of its causes is the think time.
The think time is used to improve the I/O performance by handling queues
with poor I/O as IDLE class. However, when many tasks have I/O requests,
think time for their tasks became long and then all queues are handled as
IDLE class. As a result, dispatching I/O requests is dispersed, and then
the I/O performance falls. The think time valid flag controls think time
judgment.

* Adding ioprio class for cgroups. (Opt-2 patch)
The previous expanded CFQ scheduler can not implement ioprio class.
This optional patch implements its proto-type. This patch gives a basic
service tree control for ioprio class of cgroups and does not give preempt
function, completed function and so on yet.



1. Introduction.

This patchset introduce "Yet Another" I/O bandwidth controlling
subsystem for cgroups based on CFQ (called 2 layer CFQ).

The idea of 2 layer CFQ is to build fairness control per group on the top of
existing CFQ control.
We added a new data structure called CFQ driver data on the top of
cfqd in order to control I/O bandwidth for cgroups.
CFQ driver data control cfq_datas by service tree (rb-tree) and
CFQ algorithm when synchronous I/O.
An active cfqd controls queue for cfq by service tree.
Namely, the CFQ meta-data control traditional CFQ data.
the CFQ data runs conventionally.

cfqdd cfqdd (cfqmd = cfq driver data)
| |
cfqc -- cfqd ----- cfqd (cfqd = cfq data,
| | cfqc = cfq cgroup data)
cfqc --[cfqd]----- cfqd
^
|
conventional control.

This patchset is against 2.6.28-rc2


2. Build

i. Apply this patchset (series 01 - 12) to kernel 2.6.28-rc2.

If you want to use optional functions, apply opt-1/opt-2 patches
to kernel 2.6.28-rc2.

ii. Build kernel with IOSCHED_CFQ_CGROUP=y option.

iii. Restart new kernel.


3. Usage of 2 layer CFQ

* Preparation for using 2 layer CFQ

i. Mount cfq_cgroup special device to device directory.
ex.
mkdir /dev/cgroup
mount -t cgroup -o cfq cfq /dev/cgroup

ii. Change elevator scheduler for device to "cfq-cgroups"
ex.
echo cfq-cgorups > /sys/block/sda/queue/scheduler


* Usage of grouping control.
- Create a new group.
Make a new directory under /dev/cgroup.
For example, the following command generates a 'test1' group.
mkdir /dev/cgroup/test1

- Insert a task to a group.
Write process id(pid) on "tasks" entry in the corresponding group.
For example, the following command sets task with pid 1100 into test1
group.
echo 1100 > /dev/cgroup/test1/tasks

New child tasks of this task is also inserted into test1 group.

- Change I/O priorities of a group.
Write priority on "cfq.ioprio" entry in the corresponding group.
For example, the following command sets priority of rank 2 to 'test1'
group.

echo 2 > /dev/cgroup/test1/cfq.ioprio

I/O priority for cgroups takes the value from 0 to 7. It is same as
existing per-task CFQ.

If you want to change only I/O priority of a specific device and group,
add its device name as a second parameter.
For example, the following command sets priority of rank 2 to 'test1'
group for 'sda' device.

echo 2 sda > /dev/cgroup/test1/cfq.ioprio


If you want to change I/O priority of a specific device and group via
sysfs. If you can change its priority, Add its path for cgroup as a
second parameter.
For example, the following command sets priority of rank 2 to 'test1'
group for 'sda' device via sysfs.

echo 2 /test1 > /sys/block/sda/queue/iosched/ioprio

If you can change parameters of cfq_data (slice_sync, back_seek_penalty
and so on) for a specific device and group.
If you write only one parameter via sysfs, its setting reflects all
groups.

If you set elevator scheduler as cfq-cgroups, I/O priorities of its
new device set a default priority with groups. If you want to change
this default priority, write priority and "default" as second parameter
on "cfq.ioprio" entry in the corresponding group.
For example,

echo 2 default > /dev/cgroup/test1/cfq.ioprio

- Change I/O priority of task
Use existing "ionice" command.


4. Usage of Optional Functions.

i. Usage of a validation flag for 'think time'

This parameter can use via sysfs as similar as other cfq data parameter.
Its entry name is 'ttime_valid'.

This flag is decide to check think time.
The value 0 is always handled queues as idle class.
In practice, idie_window flag is clear.
The value 1 is handled as same as traditional CFQ.
The value 2 makes the think time invalid.


ii. Usage of ioprio class for cgroups.

The ioprio class use via cgroupfs as similar as ioprio.
Its entry name is 'cfq.ioprio_class'

The values of ioprio class are as same as I/O class of traditional CFQ.
0: IOPRIO_CLASS_NONE (is equal to IOPRIO_CLASS_BE)
1: IOPRIO_CLASS_RT
2: IOPRIO_CLASS_BE
3: IOPRIO_CLASS_IDLE


5. Future work.
We must implement the follows.
* Handle buffered I/O.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/