Re: Please don't beat me up (was Re: Bugs and wishes in memory management area)

Mark Hemment (markhe@nextd.demon.co.uk)
Thu, 28 Nov 1996 23:20:34 +0000 (GMT)


On Thu, 28 Nov 1996, Mike Jagdis wrote:
> Small allocations, yes. The problem with buddy is that to make, say,
> a block of 8 pages it needs to combine two neighbouring blocks of
> 4 pages and so on. This leads it to only allow certain alignments
> of blocks which reduces the possibilities as you go to larger sizes
> (and increases the wastage).

I think you're tackling the problem from the wrong end.

While better management of free pages is a good idea, what is really
needed is better reaping of allocated pages, with the ability to free a
page of a desired 'type'.

At present, when out (or almost out) of pages, Linux asks different
sub-systems to release a page (yes, mostly only a page at a time).
A sub-system normally (and rightly) gives up a page which it hasn't used
recently.
Unfortunately, when searching for a page to move into the free memory
pool no weight is given to whether the released page will improve the
quality of the memory-pool (such as allowing coalescing).

How to do this?
Well, not with ptr indirection!!!
Instead, changing the page-scanning to walk along the page memory map
(mem_map). This ages the pages by examining the 'referenced' bit in the
pte's. Two major problems here;
1) Some archs don't have a h/w ref-bit
2) Finding the pte's that point to the physical page
Point 2) is not that difficult (requires hanging a link list of structs
off the page-map). This is _only_ slightly inefficent to mangage (but
not that bad when you use my SLAB allocator...).
Point 1) requires software simulation of the ref-bit, unmapping the page
from the address space of process(es) and setting the bit if a fault
occurs on the page. Now, for named pages (that is, pages which are
associated with a file) finding the page after it has been unmapped is
not difficult; look in the inode page cache, if not there go to the file
system.
For both points, everything becomes more complicated for anonynmous
pages (eg. stack pages). For anon pages, we need another new structure (or
rather, several) hanging off a vm_area_struct (which covers the anon
area).

The above (with a few extra additions; such as page level locking (as
opposed to simply marking the vm_area_struct as locked), an I/O locking
count, etc), gives a good way of aging pagable pages.

When the free memory pool passes a low-water mark, we can scan for pages
that have an old age. These (of course) are the pages we wish to reap.
But they are not freed immediately. Instead, update the necessary pte's
(and anon structs), place the pages on a reclaim list, and tell the
paging out daemon to start.
If a process suddenly decides it needs one of the pages we are reaping, it
can reclaim it from the list (provided it's still there) when it page
faults.
The paging out daemon examines each page on the reclaim list, and
decides what to do with it. Does it need writing to swap, or can simply
be thrown away? The page is then added to the free memory pool.

OK, I've simplified the above quite a bit. But do you get the idea?
If you think you're seen the above before, you're correct. It's what
some other UNIX OSes do, and for good reasons.

So what are the benefits?
1) Quickly find pages to be freed/written to swap
2) Its possible to make the scanning intelligent, so that;
i) Identifies physically contigious pages. This helps
to keep the quality of the free memory pool high,
and improves the performance of swap (when reading
back, we read several pages back - locality of
reference on a none page granularity).
ii) Identifies DMA pages.
iii) Identify pages that allow coalescing in the free
memory pool (actually not efficient to do and maybe
not worth while, but the ability is there for small
memory systems).
3) Makes it possible to page out the page tables (and even
the middle-page tables (for archs which have them), and
the page directories as well).
This can be important on v. small memory systems
(correctly ageing these pages on some archs is inefficent).
4) Better ageing of pages, therefore better selection of a page(s)
to reap.

The down-side;
1) Several more dynamic structures, and time taken to manage them.
2) The better ageing adds an overhead.

> Note that I talked about page allocation above. For kmalloc it
> seems better to forget about coalescing fragments. The dominant
> kmalloc activity is at relatively small sizes so collecting fragments
> is a good thing (when we get a completely free page in the free
> lists we pull all the fragments and release the page - modulo a
> small delay to avoid thrashing). For larger requests we just grab
> more pages and peel off the fragment we want. With the lower size
> queues being kept stuffed wih fragments we will have a tendency
> to keep a few blocks on the active larger queues. It's like a
> cascade through a meat grind - pages go in the top, get fragmented
> as needed, and when the little bits drop out the bottom we just
> sweep them up and recycle them. It's pretty much self balancing.

See my slab allocator on http://www.nextd.demon.co.uk/
It only frees pages when asked to do by the VM sub-system, and avoids
a lot of the coalescing problems. Plus it's L1 cache friendly, and
allows a state to be maintained in free objects (structures). No doc
available (sorry), walk the code.

> Even single page requests fail eventually :-). Code should be prepared
> for things to fail. The likelyhood of something failing is inversely
> proportional to the cost of making it work under extreme conditions.
> As with everything we just have to do our best within the budget
> available :-).

Will always be a problem.
We can the worst problems user pages by using an 'eager' allocation of
swap space (the kernel _never_ over commits it's swap space). Does get
trickly when a swap device is remove (or shows a bad block)....

If you want to save more pages try unifying the inode-page-cache and
buffer cache (will only have an improvement for some mixture of apps, but
is worth while doing).

To get _really_ dirty;
1) Mess with the inode-page-cache.
It's possible to use pages in this cache as a fall back
to the free-lists. When a cached page has only a ref count of
1, perform 'conditional' coalescing with the free-lists. If it
coaleses to a reasonably high order (say, that required by NFS),
mark the page and make the coalescing an 'honourary' member of
the free memory-pool.
Doesn't quite work this way (gets v. smelly), but its close
enough. Means adding a partial ordering to the free-lists
(actually maintainable in constant time!).
2) Page named-pages out to swap (faster access time, as has already
been mentioned in this thread)
3) Implement madvise() so that applications can tell the kernel
how they are going to be using an address space. This isn't
that messy. If an app. maps a file in, and tells the
kernel it will be performing a sequential read of the
data only once, the kernel knows not to cache the old
pages - provided no one else has a different usage
pattern of the file. Note, this can also be help
the implemtation of page colouring - L2 cache friendly.

Regards,
markhe