[PATCH 0/5 v3] SYNC_NOIDLE preemption for ancestor cgroups

From: Jan Kara
Date: Tue Jan 12 2016 - 10:27:31 EST


Hello Jens,

This is v3 of the patch series.

Changes since v2:
* added acks from Tejun
* fixed some typos pointed out by Tejun
* fixed condition checking for too long think time in patch 1/5

Changes since v1:
* added missing export of cgroup_is_descendant()

If you find the series too hasty for the current merge window, just postpone it
for the next one.

---

Recently we have been debugging regression of basically any IO workload
when systemd started enabling blkio controller for user sessions (due to
delegation feature). Now using blkio controller certainly has its costs
but some of the hits seemed just too heavy - e.g. dbench4 throughput
dropped from ~150 MB/s to ~26 MB/s for ext4 with barrier=0 mount option on
an ordinary SATA drive. The reason for the drop is visible in the following
blktrace:

0.000383426 5122 A WS 27691328 + 8 <- (259,851968) 21473600
0.000384039 5122 Q WS 27691328 + 8 [jbd2/sdb3-8]
0.000385944 5122 G WS 27691328 + 8 [jbd2/sdb3-8]
0.000386315 5122 P N [jbd2/sdb3-8]
...
0.000394031 5122 A WS 27691384 + 8 <- (259,851968) 21473656
0.000394210 5122 Q WS 27691384 + 8 [jbd2/sdb3-8]
0.000394569 5122 M WS 27691384 + 8 [jbd2/sdb3-8]
0.000395239 5122 I WS 27691328 + 64 [jbd2/sdb3-8]
0.000396572 0 m N cfq5122SN / insert_request
0.000397389 0 m N cfq5122SN / add_to_rr
0.000398458 5122 U N [jbd2/sdb3-8] 1

<<< Here we wait for 7.5 ms for idle timer on dbench sync-noidle queue to fire

0.008001111 0 m N cfq idle timer fired
0.008003152 0 m N cfq5174SN /dbench slice expired t=0
0.008004871 0 m N /dbench served: vt=24796020 min_vt=24771438
0.008006508 0 m N cfq5174SN /dbench sl_used=2 disp=1 charge=2 iops=0 sect=24
0.008007509 0 m N cfq5174SN /dbench del_from_rr
0.008008197 0 m N /dbench del_from_rr group
0.008008771 0 m N cfq schedule dispatch
0.008013506 0 m N cfq workload slice:16
0.008014979 0 m N cfq5122SN / set_active wl_class:0 wl_type:1
0.008017229 0 m N cfq5122SN / fifo= (null)
0.008018149 0 m N cfq5122SN / dispatch_insert
0.008019863 0 m N cfq5122SN / dispatched a request
0.008020829 0 m N cfq5122SN / activate rq, drv=1
0.008021578 389 D WS 27691328 + 64 [kworker/5:1H]
0.008491262 0 C WS 27691328 + 64 [0]
0.008498654 0 m N cfq5122SN / complete rqnoidle 1
0.008500202 0 m N cfq5122SN / set_slice=19
0.008501797 0 m N cfq5122SN / arm_idle: 2 group_idle: 0
0.008502073 0 m N cfq schedule dispatch
0.008517281 5122 A WS 27691392 + 8 <- (259,851968) 21473664
0.008517627 5122 Q WS 27691392 + 8 [jbd2/sdb3-8]
0.008519126 5122 G WS 27691392 + 8 [jbd2/sdb3-8]
0.008519534 5122 I WS 27691392 + 8 [jbd2/sdb3-8]
0.008520560 0 m N cfq5122SN / insert_request
0.008521908 0 m N cfq5122SN / dispatch_insert
0.008522798 0 m N cfq5122SN / dispatched a request
0.008523558 0 m N cfq5122SN / activate rq, drv=1
0.008523841 5122 D WS 27691392 + 8 [jbd2/sdb3-8]
0.008718527 0 C WS 27691392 + 8 [0]
0.008721911 0 m N cfq5122SN / complete rqnoidle 1
0.008723186 0 m N cfq5122SN / arm_idle: 2 group_idle: 0
0.008723578 0 m N cfq schedule dispatch
0.009062333 5174 A WS 23276680 + 24 <- (259,851968) 17058952
0.009062950 5174 Q WS 23276680 + 24 [dbench4]
0.009065427 5174 G WS 23276680 + 24 [dbench4]
0.009065717 5174 P N [dbench4]
0.009067472 5174 I WS 23276680 + 24 [dbench4]
0.009069038 0 m N cfq5174SN /dbench insert_request
0.009069913 0 m N cfq5174SN /dbench add_to_rr
0.009071190 5174 U N [dbench4] 1

<<<< Here we wait another 7 ms for idle timer on jbd2 sync-noidle queue to fire

0.016001504 0 m N cfq idle timer fired
0.016002924 0 m N cfq5122SN / slice expired t=0
0.016004424 0 m N / served: vt=24783779 min_vt=24771488
0.016005888 0 m N cfq5122SN / sl_used=2 disp=2 charge=2 iops=0 sect=72
0.016006635 0 m N cfq5122SN / del_from_rr
0.016007152 0 m N / del_from_rr group
0.016007613 0 m N cfq schedule dispatch
0.016014571 0 m N cfq workload slice:24
0.016015679 0 m N cfq5174SN /dbench set_active wl_class:0 wl_type:1
0.016016794 0 m N cfq5174SN /dbench fifo= (null)
0.016017652 0 m N cfq5174SN /dbench dispatch_insert
0.016018883 0 m N cfq5174SN /dbench dispatched a request
0.016019714 0 m N cfq5174SN /dbench activate rq, drv=1
0.016019973 382 D WS 23276680 + 24 [kworker/6:1H]
0.016347056 0 C WS 23276680 + 24 [0]
0.016357022 0 m N cfq5174SN /dbench complete rqnoidle 1
0.016358509 0 m N cfq5174SN /dbench set_slice=24
0.016360127 0 m N cfq5174SN /dbench arm_idle: 2 group_idle: 0
0.016360508 0 m N cfq schedule dispatch
...

When dbench isn't in a separate cgroup, dbench and jbd2 sync-noidle queues just
freely preempt each other. When dbench gets contained in a dedicated blkio
cgroup, preemption is not allowed and the throughput dropped.

The idling is happening because we want to provide separation of IO between
different blkio cgroups and thus we idle to avoid starving one cgroup where
process is submitting only dependent IO. I am of the opinion that in case
ancestor would like to preempt a descendant cgroup, there is no strong reason
to provide the separation and we can save at least one of the idle times
(when switching from dbench to jbd2 thread). Thus the following patch set
which improves the throughput of dbench4 from ~26 MB/s to ~48 MB/s.

The first patch in the patch set is just unrelated improvement where I've
spotted some asymetry in how slice_idle and group_idle are handled. Patches two
and three prepare cfq_should_preempt() to be able to work on service trees of
different cgroups, patch 4 then adds the logic in cfq_should_preempt() to allow
preemption by ancestor cgroup.

Comments welcome!

Honza