Re: trying to understand READ_META, READ_SYNC, WRITE_SYNC & co

From: Corrado Zoccolo
Date: Tue Jun 29 2010 - 05:07:26 EST

On Sun, Jun 27, 2010 at 5:44 PM, Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
> Vivek Goyal <vgoyal@xxxxxxxxxx> writes:
>> On Fri, Jun 25, 2010 at 01:03:20PM +0200, Christoph Hellwig wrote:
>>> On Wed, Jun 23, 2010 at 09:44:20PM -0400, Vivek Goyal wrote:
>>> I see the point of this logic for reads where various workloads have
>>> dependent reads that might be close to each other, but I don't really
>>> see any point for writes.
>>> > So looks like fsync path will do bunch of IO and then will wait for jbd thread
>>> > to finish the work. In this case idling is waste of time.
>>> Given that ->writepage already does WRITE_SYNC_PLUG I/O which includes
>>> REQ_NODILE I'm still confused why we still have that issue.
>> In current form, cfq honors REQ_NOIDLE conditionally and that's why we
>> still have the issue. If you look at cfq_completed_request(), we continue
>> to idle in following two cases.
>> - If we classifed the queue as SYNC_WORKLOAD.
>> - If there is another random read/write happening on sync-noidle service
>> Â tree.
>> SYNC_WORKLOAD means that cfq thinks this particular queue is doing sequential
>> IO. For random IO queues, we don't idle on each individual queue but a
>> group of queue.
>> In jeff's testing, fsync thread/queue sometimes is viewed as sequential
>> workload and goes on SYNC_WORKLOAD tree. In that case even if request is
>> REQ_NOIDLE, we will continue to idle hence fsync issue.
> I'm now testing OCFS2, and I'm seeing performance that is not great
> (even with the blk_yield patches applied). ÂWhat happens is that we
> successfully yield the queue to the journal thread, but then idle on the
> journal thread (even though RQ_NOIDLE was set).
> So, can we just get rid of idling when RQ_NOIDLE is set?
Hi Jeff,
I think I spotted a problem with the initial implementation of the
tree-wide idle when RQ_NOIDLE is set: I assumed that a queue would
either send possibly-idling requests or no-idle requests, but it seems
that RQ_NOIDLE is being used to mark the end of a stream of
possibly-idling requests (in my initial implementation, this will then
cause an unintended idle). The attached patch should fix it, and I
think the logic is simpler than Vivek's. Can you give it a spin?
Otherwise, I think that reverting the "noidle_tree_requires_idle"
behaviour completely may be better than adding complexity, since it is
really trying to solve corner cases (that maybe happen only on
synthetic workloads), but affecting negatively more common cases.

About what it is trying to solve, since I think it was not clear:
- we have a workload of 2 queues, both issuing requests that are being
put in the no-idle tree (e.g. they are random) + 1 queue issuing
idling requests (e.g. sequential).
- if one of the 2 "random" queues marks its requests as RQ_NOIDLE,
then the timeslice for the no-idle tree is not preserved, causing
unfairness, as soon as an RQ_NOIDLE request is serviced and the tree
is empty.


> Vivek sent me this patch to test, and it got rid of the performance
> issue for the fsync workload. ÂCan we discuss its merits?
> Thanks,
> Jeff
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c Â2010-06-25 15:57:33.832125786 -0400
> +++ linux-2.6/block/cfq-iosched.c    2010-06-25 15:59:19.788876361 -0400
> @@ -318,6 +318,7 @@
> Â Â Â ÂCFQ_CFQQ_FLAG_split_coop, Â Â Â /* shared cfqq will be splitted */
> Â Â Â ÂCFQ_CFQQ_FLAG_deep, Â Â Â Â Â Â /* sync cfqq experienced large depth */
> Â Â Â ÂCFQ_CFQQ_FLAG_wait_busy, Â Â Â Â/* Waiting for next request */
> + Â Â Â CFQ_CFQQ_FLAG_group_idle, Â Â Â /* This queue is doing group idle */
> Â};
> Â#define CFQ_CFQQ_FNS(name) Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â \
> @@ -347,6 +348,7 @@
> ÂCFQ_CFQQ_FNS(split_coop);
> ÂCFQ_CFQQ_FNS(deep);
> ÂCFQ_CFQQ_FNS(wait_busy);
> +CFQ_CFQQ_FNS(group_idle);
> Â#undef CFQ_CFQQ_FNS
> @@ -1613,6 +1615,7 @@
> Â Â Â Âcfq_clear_cfqq_wait_request(cfqq);
> Â Â Â Âcfq_clear_cfqq_wait_busy(cfqq);
> + Â Â Â cfq_clear_cfqq_group_idle(cfqq);
> Â Â Â Â/*
> Â Â Â Â * If this cfqq is shared between multiple processes, check to
> @@ -3176,6 +3179,13 @@
> Â Â Â Âif (cfq_class_rt(new_cfqq) && !cfq_class_rt(cfqq))
> Â Â Â Â Â Â Â Âreturn true;
> + Â Â Â /*
> + Â Â Â Â* If were doing group_idle and we got new request in same group,
> + Â Â Â Â* preempt the queue
> + Â Â Â Â*/
> + Â Â Â if (cfq_cfqq_group_idle(cfqq))
> + Â Â Â Â Â Â Â return true;
> +
> Â Â Â Âif (!cfqd->active_cic || !cfq_cfqq_wait_request(cfqq))
> Â Â Â Â Â Â Â Âreturn false;
> @@ -3271,6 +3281,7 @@
> Â Â Â Âstruct cfq_queue *cfqq = RQ_CFQQ(rq);
> Â Â Â Âcfq_log_cfqq(cfqd, cfqq, "insert_request");
> + Â Â Â cfq_clear_cfqq_group_idle(cfqq);
> Â Â Â Âcfq_init_prio_data(cfqq, RQ_CIC(rq)->ioc);
> Â Â Â Ârq_set_fifo_time(rq, jiffies + cfqd->cfq_fifo_expire[rq_is_sync(rq)]);
> @@ -3416,10 +3427,12 @@
> Â Â Â Â Â Â Â Â Â Â Â Â * SYNC_NOIDLE_WORKLOAD idles at the end of the tree
> Â Â Â Â Â Â Â Â Â Â Â Â * only if we processed at least one !rq_noidle request
> Â Â Â Â Â Â Â Â Â Â Â Â */
> - Â Â Â Â Â Â Â Â Â Â Â if (cfqd->serving_type == SYNC_WORKLOAD
> - Â Â Â Â Â Â Â Â Â Â Â Â Â || cfqd->noidle_tree_requires_idle
> - Â Â Â Â Â Â Â Â Â Â Â Â Â || cfqq->cfqg->nr_cfqq == 1)
> + Â Â Â Â Â Â Â Â Â Â Â if (cfqd->noidle_tree_requires_idle)
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â cfq_arm_slice_timer(cfqd);
> + Â Â Â Â Â Â Â Â Â Â Â else if (cfqq->cfqg->nr_cfqq == 1) {
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â cfq_mark_cfqq_group_idle(cfqq);
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âcfq_arm_slice_timer(cfqd);
> + Â Â Â Â Â Â Â Â Â Â Â }
> Â Â Â Â Â Â Â Â}
> Â Â Â Â}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at Â
> Please read the FAQ at Â


dott. Corrado Zoccolo mailto:czoccolo@xxxxxxxxx
PhD - Department of Computer Science - University of Pisa, Italy
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

From 30bdf702604f304607ac8a986d7d89285c7141fa Mon Sep 17 00:00:00 2001
From: Corrado Zoccolo <czoccolo@xxxxxxxxx>
Date: Tue, 29 Jun 2010 10:48:21 +0200
Subject: [PATCH] cfq-iosched: fix tree-wide handling of rq_noidle

Patch: 8e55063 cfq-iosched: fix corner cases in idling logic
introduced the possibility of iding even on no-idle requests
in the no_idle tree, if any previous request in the current slice
could idle. The implementation had a problem, though:
- if a queue sent several possibly idling requests and a noidle
request as last one in the same time slice, the tree was still
marked as idle.
This patch fixes this misbehaviour, by using a 31-bucket hash to
keep idling/non-idling status for queues in the noidle tree.

Signed-off-by: Corrado Zoccolo <czoccolo@xxxxxxxxx>
block/cfq-iosched.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d1ad066..5ef9a5d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -214,7 +214,7 @@ struct cfq_data {
enum wl_type_t serving_type;
unsigned long workload_expires;
struct cfq_group *serving_group;
- bool noidle_tree_requires_idle;
+ u32 noidle_tree_requires_idle;

* Each priority tree is sorted by next_request position. These
@@ -2066,7 +2066,7 @@ static void choose_service_tree(struct cfq_data *cfqd, struct cfq_group *cfqg)
slice = max_t(unsigned, slice, CFQ_MIN_TT);
cfq_log(cfqd, "workload slice:%d", slice);
cfqd->workload_expires = jiffies + slice;
- cfqd->noidle_tree_requires_idle = false;
+ cfqd->noidle_tree_requires_idle = 0U;

static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd)
@@ -3349,7 +3349,12 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
cfq_slice_expired(cfqd, 1);
else if (sync && cfqq_empty &&
!cfq_close_cooperator(cfqd, cfqq)) {
- cfqd->noidle_tree_requires_idle |= !rq_noidle(rq);
+ u32 bitmask = 1U << (((int)cfqq) % 31);
+ if (rq_noidle(rq))
+ cfqd->noidle_tree_requires_idle &= ~bitmask;
+ else
+ cfqd->noidle_tree_requires_idle |= bitmask;
* Idling is enabled for SYNC_WORKLOAD.
* SYNC_NOIDLE_WORKLOAD idles at the end of the tree