Re: Hard Hang with __alloc_pages: 0-order allocation failed (gfp=0x20/1)- Not out of memory

From: Doug Dumitru
Date: Wed May 26 2004 - 13:55:21 EST


Marcelo Tosatti wrote:

On Tue, May 25, 2004 at 04:12:12PM -0700, David S. Miller wrote:

On Tue, 25 May 2004 15:26:30 -0700
Doug Dumitru <doug@xxxxxxxxxx> wrote:


This is the original trap dump from a __page_alloc error

__alloc_pages: 0-order allocation failed (gfp=0x20/1)

0x20 means GFP_ATOMIC which means it's fine to fail
and e1000 is doing nothing wrong. GFP_ATOMIC in interrupts
is a fine condition.


Yeap, but the crash is not a fine condition... I suspect
what can be happening is extreme gigabit traffic resulting in memory shortage.

Doug said the load average is really high. Doug, you're not
using NAPI, right? Can you try it?

Prior to the __page_alloc hang, the loadavg shoots way up, so something is spinning, but it is hard to tell what. This has persisted for as long as 8-10 minutes on one hang, although it is usually shorter (1-2 minutes). One of my concerns is that the e1000 issue might only be a symptom of the page tables getting clobbered by something else. I have been trying to get the system to hang during more "controlled" usage, but have been unable to. I have even run tsl (telnet scripting language) scripts to logon 250 processes and beat the CPU and disk up, creating and destroying processes along the way. I was able to drive loadavg > 50 and LowFree to < 5000K, but could never create a hang. I suspect that I might need truely "random" inbound traffic to find the bug (but this is a guess).

In terms of network traffic, the system is busy, but not obnoxiously so. The load on the server is primarily terminal traffic from about 200 "real humans", so there are a lot of small, random packets. In terms of network bandwidth, it is not all that bad, maybe 2-3 megabits (a guess). The arp table is reasonably big (> 200 entries) but this is not that bad either. I have not looked for arp storms or other network anomolies on the LAN. The system is on a local LAN and gets no internet traffic.

I am unfamiliar with NAPI, so I have not tried it.

On another topic, I am trying to build a 2.4.26 kernel that reserves more LowFree. The mm/page_alloc.c file describes a boot parameter called "lower_zone_reserve" that should tune this. Unfortunately, this parameter seems to be read after the zone tables are initialized (which is probably a bug).

--

--------------------------------------------------------------------
Doug Dumitru 800-470-2756 (610-237-2000)
EasyCo LLC doug@xxxxxxxxxx http://easyco.com
--------------------------------------------------------------------
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/