Please don't beat me up (was Re: Bugs and wishes in memory management area)

Kevin Buhr (buhr@stat.wisc.edu)
26 Nov 1996 12:13:54 -0600


-----BEGIN PGP SIGNED MESSAGE-----

An amusing anecdote:

One day, right out of the blue, my poor little 8 meg machine went
loco. It began generating reams and reams of "Couldn't get a free
page" messages. I was away from the console, and it churned madly
away for several hours before I was able to power cycle it.

"Fortunately", I'd added the priority and size to the "Couldn't get a
free page" message in my kernel (2.0.13 vintage, I believe), and I
immediately realized that I was seeing request after request for a
2-page block at GFP_NFS priority. Eventually, I traced it back to
this culprit in "fs/nfs/proc.c":

static inline int *nfs_rpc_alloc(int size)
{
int *i;

while (!(i = (int *)kmalloc(size+NFS_SLACK_SPACE,GFP_NFS))) {
schedule();
}
return i;
}

Get it? *All* my runnable processes wanted a 2-page block of memory
for nefarious NFS-read-related purposes, and the kernel was viciously
failing to grant any of them. Why, you ask? Well, observe the
following from the "mm/page_alloc.c" code:

if ((priority==GFP_ATOMIC) || nr_free_pages > reserved_pages) {
RMQUEUE(order, dma);
restore_flags(flags);
return 0;
}
restore_flags(flags);
if (priority != GFP_BUFFER && try_to_free_page(priority, dma, 1))
goto repeat;
return 0;

Imagine that kernel memory is sufficiently fragmented that, even
though there are lots of pages available (say more than
"free_pages_high" which is 32 on my little box), there are no 2-page
blocks around. Note that, provided "nr_free_pages" is large enough,
we never even *get* to "try_to_free_page". Every multipage memory
request will be flatly refused until the memory becomes magically
"defragmented", which isn't always likely to happen, particularly if
"kswapd" doesn't run.

Since that horrible experience, I've hacked up my kernel so that
"kmalloc" retries multipage, non-GFP_ATOMIC, non-GFP_BUFFER requests
(that can't even be satisfied by the "kmalloc" cache and so would
otherwise result in "Couldn't get a free page" messages) at a magical
"GFP_DEFRAG" priority that will "try_to_free_page" anyway, no matter
how many "nr_free_pages" there may be.

My hack works like a dream: the only "Couldn't get a free page"
messages I get now are at GFP_ATOMIC priority (though, disturbingly
enough, they are multipage requests for 4388 bytes---anyone know what
these are?), and instead I get dozens of "Free list fragmented"
messages associated with 2-page NFS requests whenever the going gets
tough. On the other hand, I'm well aware that I've merely traded one
race condition for another: for one thing, "try_to_free_page" produces
blocks of consecutive free pages by accident, not on purpose.

* * *

I remember a long time ago, some brave soul was criticizing the
allocation buddy-system; he or she wanted to completely replace it
with a kernel page-table that could produce multipage blocks
automagically, thus eliminating the scourge of memory fragmentation
forever.

(Now you probably know why I chose the "Subject" line I did.)

If I remember correctly, the primary argument against this was the
performance penalty of invalidating the cache after every kernel
memory allocation. Besides which, it was pretty gross compared to the
superefficient buddy system.

Was this the only argument against the proposed scheme? Is it a bad
idea to have the kernel use a page-table system to back up the buddy
system? I'm thinking that a (non-GFP_DMA) multipage request that's in
danger of failing merely because of fragmentation would be satisfied
by RMQUEUEing a bunch of single pages and bundling them together in
the kernel page table up past the physical memory high-water mark. On
"free", the pages would be "unbundled" and returned to the queue.

I have to claim ignorance here: I don't know how many drivers would
break if there was virtual/physical address dichotomy in (non-GFP_DMA)
kmalloced blocks. Perhaps we'd have to add a GFP_VIRTOK flag or
something.

Comments? Flames?

Kevin <buhr@stat.wisc.edu>

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3
Charset: noconv
Comment: Processed by Mailcrypt 3.4, an Emacs/PGP interface

iQBVAwUBMpszTomVIQW1OgXhAQG7NgH+JNIA0t6Aj+EkjJQKfEELOl53JnaKXpH+
jrDCBZSaODeBEW5GFEUqw9UVQ5r/fO10T+ywCBi+N6Aao6RqwHOt7Q==
=aTg6
-----END PGP SIGNATURE-----