2.4.18/2.4.20 filemap.c pmd bug (was Re: Problem with mm in 2.4.19 and 2.4.20)

From: Harald Welte
Date: Mon Aug 11 2003 - 02:44:15 EST


Przemys?aw Maciuszko wrote:

I have a problem with one news server (feeder) box running INN.
Under heavy load i get the following error on the console:

filemap.c:2084: bad pmd 2bc001e3

This showed few times during last few days and few times server 'hanged up'
after this.

I can confirm this problem. It happens on one of my newsservers as well,
currently at least once per day. It is a dual PIII 650MHz, 1GB RAM,
200GB spool (scsi hardware raid array attached to adaptec aic7xxx), six
seperate SCSI disks attached to a seperate aic7xxx controller for
overview, running inn-2.3.2.

We've tried RedHat kernels 2.4.18-3, 2.4.18-17.7, 2.4.20-19.7 and 2.4.20-19.7bigmem as well as a kernel.org 2.4.20 - all with the same problem.

After the filemap.c / pmd_ERROR() printk, the box either hangs (no further printout, not that often) or has a stack overflow (most of the time):

filemap.c:2258: bad pmd c0003000(00000000000001e3).
do_IRQ: stack overflow: -864
c0252845 fffffca0 206d6564 c2426000 00000000 c0117b20 c0101018 c024bd2c
c2426000 00000018 00000018 00000000 c0117b20 c0101018 c2426470 6f6e0018
40320018 ffffff00 c0117b43 00000010 00000202 7369636e 3e65642e 613c200a
Call Trace: [<c0117b20>] do_page_fault [kernel] 0x0 (0xc242634c))
[<c0117b20>] do_page_fault [kernel] 0x0 (0xc2426368))
[<c0117b43>] do_page_fault [kernel] 0x23 (0xc2426380))
[<c0117b20>] do_page_fault [kernel] 0x0 (0xc242645c))
[<c0108cc4>] error_code [kernel] 0x34 (0xc2426464))
[<c0117fc5>] do_page_fault [kernel] 0x4a5 (0xc2426498))
[<c0117b20>] do_page_fault [kernel] 0x0 (0xc2426574))
[<c0108cc4>] error_code [kernel] 0x34 (0xc242657c))
[<c0117fc5>] do_page_fault [kernel] 0x4a5 (0xc24265b0))
[<c0117b20>] do_page_fault [kernel] 0x0 (0xc242668c))
[<c0108cc4>] error_code [kernel] 0x34 (0xc2426694))
[<c0117fc5>] do_page_fault [kernel] 0x4a5 (0xc24266c8))
[<c0117b20>] do_page_fault [kernel] 0x0 (0xc24267a4))
[<c0108cc4>] error_code [kernel] 0x34 (0xc24267ac))

The messages are always preceded by a '(scsi0:A:0:0): Locking max tag count at 64' message. The scsi device number is changing, so it cannot be a single device

Anyone has an idea what can cause it?

Unfortunately I'm not very familiar with the linux MM subsystem. But since I consider this now as a confirmed bug, maybe some of the other lkml folks have an idea what might be going on.

I'm using Linux Debian on dual PIII 1.1Ghz, 1GB RAM, LVM version 1.0.6
Qlogic FC 2200F driver version 6.01

We don't use lvm, so the similarities seem to be: Dual PIII, SCSI, INN

--
- Harald Welte <laforge@xxxxxxxxxxxx> http://www.gnumonks.org/
============================================================================
Programming is like sex: One mistake and you have to support it your lifetime

Attachment: pgp00003.pgp
Description: PGP signature