Re: [SCSI][REGRESSION][BISECTED] Disk errors loop forever in 2.6.29

From: James Bottomley
Date: Thu Feb 19 2009 - 13:41:38 EST


On Thu, 2009-02-19 at 08:52 -0800, Sitsofe Wheeler wrote:
> > From: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx>
> >
> > On Thu, 19 Feb 2009, Sitsofe Wheeler wrote:
> >
> > > Hi,
> > >
> > > There appears to be a regression from 2.6.28 in how disk errors are
> > > handled in 2.6.29rc5 - rather than trying and eventually giving up, it
> > > appears to try (and report) forever.
> >
> > See this thread and patch:
> >
> > http://marc.info/?l=linux-kernel&m=123490148422684&w=2
>
> The patch there (actually I downloaded it from http://patchwork.kernel.org/patch/7989/ )
> did not make any diference. I fear my disk will soon have torn itself to bits but until then I
> can trigger the error at will so I can test any patches that are suggested...

Can you try this patch ... it was something I meant to get into 2.6.29
but forgot about. The key problem that you seem to be hitting is that
the requeue evades the timeout check. Moving the timeout check to block
should fix that.

James

---

>From 5546538f37a1f4319ec4dbdb6f2e7261ce986e61 Mon Sep 17 00:00:00 2001
From: James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 16 Dec 2008 17:00:44 -0500
Subject: block: move SCSI timeout check into block

We can eliminate the SCSI command timed out check entirely if the block
layer does this for us. The way to do this in block is to check how
long the request has been outstanding if a requeue is requested and
ending it if we've gone over retries * timeout.

This will also eliminate many cases in SCSI where we evade the command
timeout for various reasons (like initial success converted to requeue)

Signed-off-by: James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>
---
block/blk-core.c | 10 +++++++++-
1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 29bcfac..3928ec8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -937,6 +937,8 @@ EXPORT_SYMBOL(blk_start_queueing);
*/
void blk_requeue_request(struct request_queue *q, struct request *rq)
{
+ unsigned long wait_for = (rq->retries + 1) * rq->timeout;
+
blk_delete_timer(rq);
blk_clear_rq_complete(rq);
trace_block_rq_requeue(q, rq);
@@ -944,7 +946,13 @@ void blk_requeue_request(struct request_queue *q, struct request *rq)
if (blk_rq_tagged(rq))
blk_queue_end_tag(q, rq);

- elv_requeue_request(q, rq);
+ if (time_before(rq->start_time + wait_for, jiffies)) {
+ printk(KERN_ERR "%s: timing out command, waited %lus\n",
+ rq->rq_disk ? rq->rq_disk->disk_name : "?",
+ wait_for/HZ);
+ blk_end_request(rq, -EIO, blk_rq_bytes(rq));
+ } else
+ elv_requeue_request(q, rq);
}
EXPORT_SYMBOL(blk_requeue_request);

--
1.5.6.6



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/