Re: 2.6.23.1: mdadm/raid5 hung/d-state

From: BERTRAND Joël
Date: Wed Nov 07 2007 - 11:49:53 EST


Chuck Ebbert wrote:
On 11/05/2007 03:36 AM, BERTRAND Joël wrote:
Neil Brown wrote:
On Sunday November 4, jpiszcz@xxxxxxxxxxxxxxx wrote:
# ps auxww | grep D
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 273 0.0 0.0 0 0 ? D Oct21 14:40
[pdflush]
root 274 0.0 0.0 0 0 ? D Oct21 13:00
[pdflush]

After several days/weeks, this is the second time this has happened,
while doing regular file I/O (decompressing a file), everything on
the device went into D-state.
At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2
My linux-2.6.23/drivers/md/raid5.c contains your patch for a long
time :

...
spin_lock(&sh->lock);
clear_bit(STRIPE_HANDLE, &sh->state);
clear_bit(STRIPE_DELAYED, &sh->state);

s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
/* Now to look around and see what can be done */

/* clean-up completed biofill operations */
if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
}

rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = &sh->dev[i];
...

but it doesn't fix this bug.


Did that chunk starting with "clean-up completed biofill operations" end
up where it belongs? The patch with the big context moves it to a different
place from where the original one puts it when applied to 2.6.23...

Lately I've seen several problems where the context isn't enough to make
a patch apply properly when some offsets have changed. In some cases a
patch won't apply at all because two nearly-identical areas are being
changed and the first chunk gets applied where the second one should,
leaving nowhere for the second chunk to apply.

I always apply this kind of patches by hands, and no by patch command. Last patch sent here seems to fix this bug :

gershwin:[/usr/scripts] > cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
1464725632 blocks [2/1] [U_]
[=====>...............] recovery = 27.1% (396992504/1464725632) finish=1040.3min speed=17104K/sec

Regards,

JKB
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/