RE: LOTS OF BAD STUFF in raid0: raid0145-19990824-2.2.11 is unstable

Roeland M.J. Meyer (rmeyer@mhsc.com)
Sun, 7 Nov 1999 09:16:45 -0800


Uh oh! This is begining to smell like some sort of "race" condition.

> -----Original Message-----
> From: owner-linux-raid@vger.rutgers.edu
> [mailto:owner-linux-raid@vger.rutgers.edu]On Behalf Of Mike Black
> Sent: Sunday, November 07, 1999 4:53 AM
> To: P.A.M. van Dam; Florian Lohoff
> Cc: linux-raid@vger.rutgers.edu; linux-kernel@vger.rutgers.edu
> Subject: Re: LOTS OF BAD STUFF in raid0: raid0145-19990824-2.2.11 is
> unstable
>
>
> I had problems with the "beyond end of device" message quite
> regularly until
> I switched the device from non-RAID to RAID and replaced the CPU.
>
> I was able to reproduce this with an application that was
> extremely heavy
> I/O on a single SCSI disk (this was linear file access on about 20G of
> data). This was on a dual PIII-450. Once I switched to a
> 3x50G RAID device
> the problems went away ( I even had one of the "beyond"
> messages while I was
> copying). However, I then started having lockups instead of
> "beyond" -- and
> I finally traced it down to one of the CPUs (turns out one of them was
> overheating). When I replaced the PIII-450's with two new
> PIII-500's the
> system has been rock solid since (I'm now running 2.2.13 with
> the 2.2.11
> RAID patch compiled on gcc 2.95-2)
>
> I have another box that was Non-SMP (Kernel is SMP though)
> PII/233 IDE that
> I saw the "beyond end of device" message once. So, I think
> that eliminates
> SCSI, RAID, and SMP as likely candidates.
>
> So, I think there really is a bug (or maybe just a behaviour)
> lurking in the
> i/o buffers that just gets exacerbated by other problems,
> heavy I/O, cache
> problems (like overheated CPU), cables, etc.
>
> Maybe we could put a "marker" in the buffer_head at certain
> points to see if
> it gets walked on?
> ----- Original Message -----
> From: P.A.M. van Dam <nucleus@ramoth.xs4all.nl>
> To: Florian Lohoff <flo@rfc822.org>
> Cc: David Mansfield <david@cobite.com>; <linux-raid@vger.rutgers.edu>;
> <linux-kernel@vger.rutgers.edu>; Ingo Molnar
> <mingo@chiara.csoma.elte.hu>;
> Alan Cox <alan@lxorguk.ukuu.org.uk>
> Sent: Saturday, November 06, 1999 2:40 PM
> Subject: Re: LOTS OF BAD STUFF in raid0: raid0145-19990824-2.2.11 is
> unstable
>
>
> On Sat, Nov 06, 1999 at 03:44:47PM +0100, Florian Lohoff wrote:
> >
> > Easy setup
> >
> > raid0 of raid5s ...
> >
> > read:
> >
> > /dev/md0 raid0 /dev/md1 /dev/md2
> > /dev/md1 raid5 scsi disks (tried with 6 and 3)
> > /dev/md2 raid5 scsi disks (tried with 6 and 3)
> >
> > As soon as you initiate a "mke2fs /dev/md0" the machine
> > hangs with "Got md request, not good" (md.c -> do_md_request())
>
> Same here, but then with Mauelschagens LVM. An mkfs -t ext2 for the
> first time (after bootup) on a striped LV will trigger the "Got LVM
> request, not good" message (and accompanied by some "trying to ...
> beyond end of device" messages). The second attempt will not generate
> these messages.
>
> I'm sorry, but I still have the strong belief that this isn't hardware
> nor RAID (only) related. There must be some bug lurking, we have to
> stamp out before we can take 2.4 seriously.
>
>
> >
> > I reproduced this with different hardware (machines,
> harddrives, scsi
> controllers etc - Even on sparc ...)
>
> Ok. Does SMP be the common denominator in your test? Almost
> all reports
> are with SMP machines.
>
> >
> > >From the manpage this is a feature which should work.
>
> Best regards,
>
> Pascal
>
> >
> > Flo
> > --
> > Florian Lohoff flo@rfc822.org +49-5241-470566
> > ... The failure can be random; however, when it does occur, it is
> > catastrophic and is repeatable ... Cisco Field Notice
> >
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> > the body of a message to majordomo@vger.rutgers.edu
> > Please read the FAQ at http://www.tux.org/lkml/
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/