Re: OOM kiler/load problems with RAID/LVM and AoE

From: Rik van Riel
Date: Thu Mar 16 2006 - 17:20:37 EST

On Mon, 13 Mar 2006, Joshua Kugler wrote:

> RAID or LVM problem? AoE drivers? Network driver badness (for both of them)?

You could simply be hitting a fundamental problem that's present
on most operating systems. It happens roughly like this:

1) free memory gets low, so kswapd starts evicting pages
2) in order to write pages out over the network, the kernel
needs to allocate memory to compose network packets,
headers, etc...
3) if kswapd writes out a bunch of pages at once, or simply
if memory was low to begin with when we hit (1), there
may not be enough free memory left to receive the ACK
packets from the NAS box that acknowledge that the data
was received, nor the packets that indicate that the
data was written to disk and the kernel can complete
the IO

Locally attached disks do not have this problem because the
kernel keeps a number of reserved buffer heads around to get
us out of this deadlock problem.

Networking will need something similar. Because this is
slowly turning into an FAQ, I've written down the problem
and a proposed solution:

