Re: [OOPS] 2.6.21-rc6-git5 in cfq_dispatch_insert

From: Jens Axboe
Date: Wed Apr 25 2007 - 05:48:38 EST

On Wed, Apr 25 2007, Neil Brown wrote:
> On Wednesday April 25, jens.axboe@xxxxxxxxxx wrote:
> >
> > That's pretty close to where I think the problem is (the front merging
> > and cfq_reposition_rq_rb()). The issue with that is that you'd only get
> > aliases for O_DIRECT and/or raw IO, and that doesn't seem to be the case
> > here. Given that front merges are equally not very likely, I'd be
> > surprised is something like that has ever happened.
> Well it certainly doesn't happen very often....
> And I can imagine a filesystem genuinely wanting to read the same
> block twice - maybe a block that contained packed tails of two
> different files.

But that usually happens through the page cache, so it's nicely
serialized. You'd only ever get duplicate/alias blocks if you go raw, or
have some layer on top of your block device (like md) that issues io on
its own.

> > BUT... That may explain while we are only seeing it on md. Would md
> > ever be issuing such requests that trigger this condition?
> Can someone remind me which raid level(s) was/were involved?
> I think one was raid0 - that just passes requests on from the
> filesystem, so md would only issue requests like that if the
> filesystem did.

Yep, one was raid0.

> I guess it could happen with raid4/5/6. A read request that was
> properly aligned (and we do encourage proper alignment) will be passed
> directly to the underlying device. A concurrent write elsewhere could
> require the same block to be read into the stripe-cache to enable a
> parity calculation. So you could get two reads at the same block
> address.
> Getting a front-merge would probably require two stripe-heads to be
> processed in reverse-sector order, and it tends to preserve the order
> of incoming requests (though that isn't firmly enforced).
> raid1 is much like raid0 (with totally different code) in that the
> request pattern seen by the underlying device matches the request
> pattern generated by the filesystem.
> If I read the debugging output correctly, the request which I
> hypothesise was the subject of a front-merge is a 'sync' request.
> raid5 does not generate sync requests to fill the stripe cache (maybe
> it should?) so I really think it must have come directly from the
> filesystem.
> (just checked previous email for more detail of when it hits)
> The fact that it hits degraded arrays more easily is interesting.
> Maybe we try to read a block on the missing device and so schedule
> reads for the other devices. Then we try to read a block on a good
> device and issue a request for exactly the same block that raid5 asked
> for. That still doesn't explain the 'sync' and the 'front merge'.
> But that is quite possible, just not common maybe.
> It doesn't help us understand the raid0 example though. May it is
> just a 'can happen, but only rarely' thing.

It looks to be extremely rare. Aliases are extremely rare, front merges
are rare. And you need both to happen with the details you outlined. But
it's a large user base, and we've had 3-4 reports on this in the past
months. So it obviously does happen. I could not make it trigger without
doctoring the unplug code when I used aio.

Here's a fix for it, confirmed.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9e37971..f965be7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -532,6 +532,11 @@ static void cfq_add_rq_rb(struct request *rq)

if (!cfq_cfqq_on_rr(cfqq))
cfq_add_cfqq_rr(cfqd, cfqq);
+ /*
+ * check if this request is a better next-serve candidate
+ */
+ cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);

static inline void
@@ -1639,12 +1644,6 @@ cfq_rq_enqueued(struct cfq_data *cfqd, struct cfq_queue *cfqq,

- * check if this request is a better next-serve candidate)) {
- */
- cfqq->next_rq = cfq_choose_req(cfqd, cfqq->next_rq, rq);
- BUG_ON(!cfqq->next_rq);
- /*
* we never wait for an async request and we don't allow preemption
* of an async request. so just return early

Jens Axboe

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at