This is _not_ a hack. The core of the problem is fragmentation in the
free-page pool. The odds of having two physically contigious pages in the
pool is _much_ higher than having four contigious pages.
Ideally, fixing the free-page pool fragmentation would be the ideal
solution. To do this cleanly, the whole Memory Management sub-system
needs to be written so that all references to an allocated page can be
found quickly - this involves PTE chains, which are fat and heavy.
Change the value to 1. It's a good solution.
> > This gives page-colouring (and a few other performance improvements)
> > coupled with a weak fragmentation control. I'm calling the control weak,
> > as I've out ripped all the heavy control stuff I was doing (well, was
> > doing this morning). It will also speed up most CPU/memory intensive
> > tasks.
>
> Does this fragmentation control actually defragment if needed? Not being
> able to allocate memory while enough free pages are present (though
> fragmented) looks as a deficiency to me. What about next points? Forgive me
> it these are absolute nonsens or already implented.
In the patch, the page allocator is a lazy-buddy. If the reserved number
of high orders drop below a water mark, the laziness is suspended and all
the locally free clusters are moved to be globally free and coalesced.
This usually is enough to push the level above the water mark. If it
isn't, kswapd gently starts to reap pages, as the level drops further it
becomes more aggressive. Hopefully, there are still enough high orders in
the free-pool to satisfy any requests while the engine is in
"coalesce-mode" - depends upon the setting of the water-mark and the rate
of high order allocations.
> - In the case where enough free pages are present, but there is not a
> contigious block found that is big enough: Is it not possible to rearrange
> the pages (possibly using VM addressable memory to physical memory mapping)
> I can imagine that 'just' moving data around really upsets the system, but
> the VM hardware can take care of the address->physical address
> translatlation.
The VM hardware does take care of the virt->phys, but it a page is moved
all the PTEs which refer to the page need to be updated. The task is
finding _all_ the PTEs (remember page sharing) efficiently. This is what
PTE chains can do.....if they were there.
> - If the allocation still fails try to page some memory to free a few pages.
> This should only occur if we want to allocate more pages than there are
> physically free. A swapping system is better than hanging system to me!
> Is it possible to do something like do_try_to_free_page() from vmscan.c?
Unfortunately, for large memory systems, simply freeing a few pages does
not always lead to coalescing - espically to a high order. It's late, and
I can't be bothered to work out the maths...left as a exercise for the
reader.
> > The patch also allows the SLAB_BREAK_GFP_ORDER to be set from the boot
> > line. To set it to 1, use; "slab=4,1" (don't worry too much about the
> > 4, it is the minimum objects per slab the allocator _tries_ to use for a
> > cache. Just keep it at '4').
>
> So a "slab=4,0" will make the slab allocator be happy with one free page.
> The allocation problem then still exists for caches > pagesize.
No, it means the allocator will use smaller slabs (a slab contains
objects, where an object is the memory you are requesting). Smaller slabs
mean (obviously) a smaller number of objects per slab. This leads to more
internal management overhead (the slab chains are partially order, when
all the objects in a slab are active/inactive it may need to be moved to
keep the ordering - this helps to keep down external fragmentation).
Also, small slabs tend to lead to internal fragmentation (unused memory).
SLAB_BREAK_GFP_ORDER, tells kmem_cache_create() when to give up trying to
satisfy the minimum objects-per-slab value. But if the internal
fragmentation is still too high, it may well still select a higher order.
This is at cache creating time, not a allocation-time variable.
So, "slab=4,1" tells the allocator to try initialise a cache such that
each slab has at least four objects, but to not use a page-order greater
than 1 (2 physcially contigious pages) _unless_ this is too small for one
object or the internal fragmentation is too high.
> At this moment I have to allocate the DMA buffers for the sounddriver during
> bootup. If I don't the allocation of 128kB buffers fails (not enough
> contigious memory). Anyone (Mark?) knowns if solving the fragmentation will
> also solve this DMA buffer allocation problem ?
Managing DMA pages is tricky. I ripped my changes for DMA out of the
patch because they were too heavy (taking too many CPU cycles).
It is by no means impossible, just a balancing act.
In summary, change SLAB_BREAK_GFP_ORDER to 1.
Regards,
markhe