Re: ext2fs corruption under heavy load?

Peter Rival (frival@zk3.dec.com)
Wed, 02 Jun 1999 12:54:20 -0400


--------------96DBF4452982C9347316EE38
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

I really hate replying to my own posts, but... It appears that (with
help from Henry Hall) the blame for this has been taken away from ext2
and placed squarely on the shoulders of the QLogic drivers. The ISP1020
cards were particularly bad, so I replaced them with ISP1040B cards. I
am now seeing the same, _SILENT_ corruption I was before, only now it's
at a somewhat higher load (~80 users instead of 60).

The reason I accented the word "silent" is that there are no errors, no
warnings, no SCSI complaints either on the console or in the logs, and
files just end up "disappearing". At least until the time I can unmount
the disk and fsck the filesystem. Is this just an Alpha bug, or has
anyone else tried to really push their system like this (configuration
below) and gotten it to work? Or should I just chuck the QLogic cards
and go try to beg out some BusLogic/Symbios Logic cards?

- Pete (who's going to go back to testing the shared workloads to
relieve the stress ;)

Peter Rival wrote:

> Hi,
>
> I've been attempting to benchmark linux (2.2.9+AXP SMP patches) on
> a 2-CPU AS4100. Everything runs fine, and the numbers are quite good,
> until somewhere in the (simulated) 50-60 user range. I'm running AIM
> VII on a system with 26 disks attached (does this sound familiar
> again? ;) so while IO isn't the problem, we are still beating on the
> filesystem (fserver benchmark). Anyway, once I get into this user
> range, I will start seeing errors in this code path:
>
> if ((fd = creat(flist[CREAT][index],S_IRWXU | S_IRWXG |
> S_IRWXO)) < 0) { /* try create */
> perror("creat() in dsearch()"); /*
> handle error */
> sprintf(errbuf,"dsearch():can't creat '%s'\n", /*
> build error message */
> flist[CREAT][index]);
> chdir(cwd); /*
> change directories */
> cl_list(flist); /*
> clear list *
> return(-1); /*
> return error */
> } /*
> end of error */
> close(fd); /*
> close the file */
> if (unlink(flist[CREAT][index])) { /*
> unlink it */
> perror("unlink() in dsearch()"); /*
> handle error */
> getcwd(ncwd,256);
> chdir(cwd); /*
> change directories */
> cl_list(flist); /*
> clear list *
> return(-1); /*
> return error */
> } /*
> end of error */
>
>
> It complains about not being able to unlink a file with an "unlink()
> in dsearch(): No such file or directory". Well, that's true
> enough...the file doesn't actually exist. The strange part is that in
> some of the filesystems, I wind up with unattached inodes at the next
> fsck (one, every time...). As I have said, this has happened on
> almost all of the 26 work disks at one time or another, and on all 4
> of the SCSI controllers (QLogic ISP 1020 (3) and 1040(1)). I can
> reproduce this problem every time I run the benchmark.
>
> I have looked through the sys_creat (or really, sys_open)
> sys_unlink paths and don't see anything that should be wrong with it.
> Does anyone have any ideas? The system is running a stock RH6.0
> install with the 2.2.9 kernel plus the AXP SMP patches that Richard
> Henderson posted.
>
> Thanks!
>
> - Pete

--------------96DBF4452982C9347316EE38
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit

<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
I really hate replying to my own posts, but...  It appears that (with help from Henry Hall) the blame for this has been taken away from ext2 and placed squarely on the shoulders of the QLogic drivers.  The ISP1020 cards were particularly bad, so I replaced them with ISP1040B cards.  I am now seeing the same, _SILENT_ corruption I was before, only now it's at a somewhat higher load (~80 users instead of 60).

The reason I accented the word "silent" is that there are no errors, no warnings, no SCSI complaints either on the console or in the logs, and files just end up "disappearing".  At least until the time I can unmount the disk and fsck the filesystem.  Is this just an Alpha bug, or has anyone else tried to really push their system like this (configuration below) and gotten it to work?  Or should I just chuck the QLogic cards and go try to beg out some BusLogic/Symbios Logic cards?

 - Pete (who's going to go back to testing the shared workloads to relieve the stress ;)

Peter Rival wrote:

Hi,

    I've been attempting to benchmark linux (2.2.9+AXP SMP patches) on a 2-CPU AS4100.  Everything runs fine, and the numbers are quite good, until somewhere in the (simulated) 50-60 user range.  I'm running AIM VII on a system with 26 disks attached (does this sound familiar again? ;) so while IO isn't the problem, we are still beating on the filesystem (fserver benchmark).  Anyway, once I get into this user range, I will  start seeing errors in this code path:

      if ((fd = creat(flist[CREAT][index],S_IRWXU | S_IRWXG | S_IRWXO)) < 0) { /* try create */
        perror("creat() in dsearch()");                          /* handle error */
        sprintf(errbuf,"dsearch():can't creat '%s'\n",           /* build error message */
                flist[CREAT][index]);
        chdir(cwd);                                              /* change directories */
        cl_list(flist);                                          /* clear list *
        return(-1);                                              /* return error */
      }                                                          /* end of error */
      close(fd);                                                 /* close the file */
    if (unlink(flist[CREAT][index])) {                           /* unlink it */
        perror("unlink() in dsearch()");                         /* handle error */
        getcwd(ncwd,256);
        chdir(cwd);                                              /* change directories */
        cl_list(flist);                                          /* clear list *
        return(-1);                                              /* return error */
      }                                                          /* end of error */
 

It complains about not being able to unlink a file with an "unlink() in dsearch(): No such file or directory".  Well, that's true enough...the file doesn't actually exist.  The strange part is that in some of the filesystems, I wind up with unattached inodes at the next fsck (one, every time...).  As I have said, this has happened on almost all of the 26 work disks at one time or another, and on all 4 of the SCSI controllers (QLogic ISP 1020 (3) and 1040(1)).  I can reproduce this problem every time I run the benchmark.

    I have looked through the sys_creat (or really, sys_open) sys_unlink paths and don't see anything that should be wrong with it.  Does anyone have any ideas?  The system is running a stock RH6.0 install with the 2.2.9 kernel plus the AXP SMP patches that Richard Henderson posted.

Thanks!

 - Pete

--------------96DBF4452982C9347316EE38--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/