Re: Bug: App causes 2.034 kernel infinite loop in ext2

Berend Reitsma (berend@asset-control.com)
Thu, 26 Nov 1998 18:20:03 +0000 (WET)


On Wed, 25 Nov 1998, Don Bennett wrote:

> A heavily multithreaded application with lots of disk i/o causes a kernel
> infinite loop in the ext2 code.
>
> The application may run anywhere from an hour to a few days
> before the problem to occurs.
>
>
> Base release: RedHat 5.1
> Kernel version: 2.0.34, 2.035
> libc version: 2.0.7
>
> A thread will unexpectedly start to use all available CPU cycles.
>
> Gdb is sometimes unable to attach to this thread.
>
[SNIP]
>
> I have also been able to catch it at lines 331 and 343 in truncate.c.
>
[SNIP]
>
> If you have any ideas on what else I can do to track down this
> problem, let me know.

I have some info which is maybe related to this problem...
We also have a multithreaded application which is doing a lot of I/O. At
this moment I have restricted the application to only use two
worker-threads. Using more threads is guaranteed to crash the server.

Interresting thing is that when only doing reads and no writes, I am
unable to crash it. Okay this is not a very hard fact, but the read-only
server is running for more than half a year now without the problems I
have seen when writing is involved.

Because in the 2.0.x kernels we have an additional problem with file
locking, I thought everything was related to that.

A typical session is:
- mutex lock on file (between threads)
- open file (rw)
- lock file (read)
- read file
- upgrade lock on file (write)
- open file.bak (w)
- lock file.bak (write)
- write file.bak
- sync file.bak
- close file.bak
- rename file.bak file
- close file

It is very likely that the problem is with truncating or removing the file
too.

I have not investigated much in this problem because I was not in a real
need to have more than 2 threads for writing.
At this moment I am not able to do a lot of testing because the machine is
now used as a sort of semi-production test environment. Testing the
application itself is more important to our company :-(

Please note that this application does work correctly on Solaris 2.6 with
over 100 worker-threads.

Regards,
Berend.

--

Berend Reitsma

Asset Control International | Phone: +31 (0)513 469100 P.O. Box 10 | Fax: +31 (0)513 461588 8408 ZH Lippenhuizen | Email: berend@asset-control.com The Netherlands | Web: www.asset-control.com

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/