Re: 2.4.26 SMP lockup problem

From: Norman Weathers
Date: Wed Jun 09 2004 - 08:24:51 EST



No, ACPI is disabled in the kernel on this build.

On Tuesday 08 June 2004 06:30 pm, Willy Tarreau wrote:
> Hi,
>
> do you have ACPI enabled, I don't see it in your partial config. I believe
> it was changed in 2.4.22.
>
> Regards,
> Willy
>
> On Tue, Jun 08, 2004 at 05:57:28PM -0500, Norman Weathers wrote:
> > Hello All.
> >
> > During an interesting round of kernel updates, I found a very interesting
> > problem. I have several "hundred" nodes in a cluster that I am currently
> > updating from kernel 2.4.21 to 2.4.26. These nodes are all running
> > RedHat 7.3 (old, I know, but this is the OS that are software currently
> > works on). During this round of updates, I have updated about 150 PIII
> > 800 MHz nodes, all of which are currently being used and work just fine
> > (1 GB Ram, e100 ethernet driver, IDE drives, fairly generic). Also, I
> > have a few PIII 1260 nodes (Tyan Motherboard, 2 GB Ram, e100 ethernet
> > driver, again, fairly generic) that have also been updated and run fine.
> > I have even started testing fairly new P4 3060 IBM blades. They also
> > seem to work just fine.
> >
> > Now to the problem. I have "several hundred" Tyan Thunder Motherboards
> > (older AMD 760MP chipset). I have rebooted ~ 200 of these nodes with the
> > new 2.4.26 kernel and about half of these nodes have suffered a hard
> > lockup during bootup. The lockup is hard enough that I cannot even isuse
> > sys request keys over serial or at the local keyboard to cause them to
> > reboot or output a trace. These nodes have 2 GB of ram, dual 3Com 100 Mb
> > NICS, and IDE drives. Again, fairly generic for a cluster. I had a
> > vanilal + trond patched 2.4.21 kernel running on these boxes just fine.
> > (The new 2.4.26 kernel also has the trond patches for 2.4.26). Has
> > anyone seen this happen to them?
> >
> > Here is some info on the kernel config for the 2.4.26 kernel:
> >
> > #
> > # Automatically generated by make menuconfig: don't edit
> > #
> > CONFIG_X86=y
> > # CONFIG_SBUS is not set
> > CONFIG_UID16=y
> >
> > #
> > # Code maturity level options
> > #
> > CONFIG_EXPERIMENTAL=y
> >
> > #
> > # Loadable module support
> > #
> > CONFIG_MODULES=y
> > CONFIG_MODVERSIONS=y
> > CONFIG_KMOD=y
> >
> > #
> > # Processor type and features
> > #
> > # CONFIG_M386 is not set
> > # CONFIG_M486 is not set
> > # CONFIG_M586 is not set
> > # CONFIG_M586TSC is not set
> > # CONFIG_M586MMX is not set
> > # CONFIG_M686 is not set
> > CONFIG_MPENTIUMIII=y
> > # CONFIG_MPENTIUM4 is not set
> > # CONFIG_MK6 is not set
> > # CONFIG_MK7 is not set
> > # CONFIG_MK8 is not set
> > # CONFIG_MELAN is not set
> > # CONFIG_MCRUSOE is not set
> > # CONFIG_MWINCHIPC6 is not set
> > # CONFIG_MWINCHIP2 is not set
> > # CONFIG_MWINCHIP3D is not set
> > # CONFIG_MCYRIXIII is not set
> > # CONFIG_MVIAC3_2 is not set
> > CONFIG_X86_WP_WORKS_OK=y
> > CONFIG_X86_INVLPG=y
> > CONFIG_X86_CMPXCHG=y
> > CONFIG_X86_XADD=y
> > CONFIG_X86_BSWAP=y
> > CONFIG_X86_POPAD_OK=y
> > # CONFIG_RWSEM_GENERIC_SPINLOCK is not set
> > CONFIG_RWSEM_XCHGADD_ALGORITHM=y
> > CONFIG_X86_L1_CACHE_SHIFT=5
> > CONFIG_X86_HAS_TSC=y
> > CONFIG_X86_GOOD_APIC=y
> > CONFIG_X86_PGE=y
> > CONFIG_X86_USE_PPRO_CHECKSUM=y
> > CONFIG_X86_F00F_WORKS_OK=y
> > CONFIG_X86_MCE=y
> > # CONFIG_TOSHIBA is not set
> > # CONFIG_I8K is not set
> > CONFIG_MICROCODE=y
> > # CONFIG_X86_MSR is not set
> > # CONFIG_X86_CPUID is not set
> > # CONFIG_EDD is not set
> > # CONFIG_NOHIGHMEM is not set
> > # CONFIG_HIGHMEM4G is not set
> > CONFIG_HIGHMEM64G=y
> > CONFIG_HIGHMEM=y
> > CONFIG_X86_PAE=y
> > CONFIG_HIGHIO=y
> > # CONFIG_MATH_EMULATION is not set
> > CONFIG_MTRR=y
> > CONFIG_SMP=y
> > CONFIG_NR_CPUS=32
> > # CONFIG_X86_NUMA is not set
> > # CONFIG_X86_TSC_DISABLE is not set
> > CONFIG_X86_TSC=y
> > CONFIG_HAVE_DEC_LOCK=y
> >
> > #
> > # General setup
> > #
> > CONFIG_NET=y
> > CONFIG_X86_IO_APIC=y
> > CONFIG_X86_LOCAL_APIC=y
> > CONFIG_PCI=y
> > # CONFIG_PCI_GOBIOS is not set
> > # CONFIG_PCI_GODIRECT is not set
> > CONFIG_PCI_GOANY=y
> > CONFIG_PCI_BIOS=y
> > CONFIG_PCI_DIRECT=y
> > CONFIG_ISA=y
> > CONFIG_PCI_NAMES=y
> > # CONFIG_EISA is not set
> > # CONFIG_MCA is not set
> > CONFIG_HOTPLUG=y
> > ---- Rest cut -------
> >
> > I have the noapic option passed on the lilo boot prompt line, otherwise
> > we get the APIC error after about a month or two in service.
> >
> > We tried to make the kernel somewhat generic because we want this kernel
> > to boot on the largest hardware base possible. Is there something
> > obvious that I have missed (I have used these options on the 2.4.21
> > kernel that we used on all of the nodes with the exception of the 64 GB
> > memory.
> >
> > Any help would be appreciated. Any dumps that need to be made (or try to
> > make), great as I have about 200 nodes right now that are candidates for
> > testing.
> >
> > Please contact me at email listed below as I am not on the list.
> >
> >
> > Email: norman.r.weathers@xxxxxxxxxxxxxxxxxx
> >
> >
> > Thanks in advance.
> >
> > --
> >
> > Norman Weathers
> > SIP Linux Cluster
> > TCE UNIX
> > ConocoPhillips
> > Houston, TX
> >
> > Office: LO2003
> > Phone: ETN 639-2727
> > or (281) 293-2727
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/

--

Norman Weathers
SIP Linux Cluster
TCE UNIX
ConocoPhillips
Houston, TX

Office: LO2003
Phone: ETN 639-2727
or (281) 293-2727
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/