Re: Linux 2.4.25-rc1

From: Marcelo Tosatti
Date: Mon Mar 01 2004 - 10:22:37 EST




Hi David,

On Sun, 29 Feb 2004, David Luyer wrote:

> On Thu, Feb 05, 2004 at 10:44:31AM -0200, Marcelo Tosatti wrote:
> > Here goes the first release candidate.
> >
> > It contains mostly networking updates, XFS update, amongst others.
> >
> > This release contains a fix for excessive inode memory pressure with
> > highmem boxes. Help is specially wanted with testing this on heavy-load
> > highmem machines.
>
> How is this likely to manifest itself?

Basically this modification makes the inode reclaiming code rip inodes
with highmem-pagecache attached when its necessary.

What happened before was that the low memory could get filled with
unreclaimable inodes. Which would screw up the performance badly (and
probably crash the system in extreme situations).

> We just had a box which crashed only 2 hour from deployment, and
> reading over the recent changes this seems like a potential cause
> (although being new, faulty hardware is always a possibility); the
> last items on its serial console were:
>
> INIT: Sending processes the TERM signal
> memory.c:100: bad pmd 000001e3.
> memory.c

This looks like hardware fault to me or a (maybe, not sure) badly behaving
driver. The inode-highmem modifications can't cause such breakage, as far
as I can see.

Rik, Andrew ?

> Was still responding to ICMP after crash.
>
> Details:
>
> * running 2.4.25 (release with small local patch to put MPT SCSI devices
> before Adaptec SCSI devices, as "scsihosts" cannot do this)
>
> * IBM x335
>
> * dual Xeon 3.066GHz (hyperthreads in 2.4.25)
>
> * 2.5Gb RAM (HIGHMEM4G)
>
> * high CPU load (two processes around 75% of a CPU each at time of crash,
> being bzip2 compression of gigabytes of data, and some other processes
> using somewhat less CPU for network and disk IO)
>
> * moderate IO load (3Mbps on tg3 ethernet, more than double that on each
> of MPT SCSI and AIC SCSI)
>
> * high inode / file descriptor load -- a single process may open hundreds
> of thousands of file desciptors over a 5 minute period and then close
> them all at once; file-max is set to 1024^2
>
> * newly deployed (ie no track record of stability to refer to)
>
> Role of system is basically to receive a constant 3Mbps stream of UDP data
> which is then written to the internal RAID array, this data is then read
> from the internal array and written to an external RAID array (in the process
> doubling in volume; each piece of data ends up in two places; and sometimes
> being read/written a few times before ending up in the right place) and
> ultimately compressed and archived.
>
> I've rebooted into 2.4.25 for a second chance but if it fails again,
> will reboot to 2.4.24 and then if that fails, revert to old hardware
> and kernel (which was running kernel 2.4.24 on an old Intel ISP2150).

OK, waiting for you input.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/