Re: [patch 12/26] SLUB: Slab defragmentation core

From: Andrew Morton
Date: Tue Jun 26 2007 - 04:21:52 EST


On Mon, 18 Jun 2007 02:58:50 -0700 clameter@xxxxxxx wrote:

> Slab defragmentation occurs either
>
> 1. Unconditionally when kmem_cache_shrink is called on slab by the kernel
> calling kmem_cache_shrink or slabinfo triggering slab shrinking. This
> form performs defragmentation on all nodes of a NUMA system.
>
> 2. Conditionally when kmem_cache_defrag(<percentage>, <node>) is called.
>
> The defragmentation is only performed if the fragmentation of the slab
> is higher then the specified percentage. Fragmentation ratios are measured
> by calculating the percentage of objects in use compared to the total
> number of objects that the slab cache could hold.
>
> kmem_cache_defrag takes a node parameter. This can either be -1 if
> defragmentation should be performed on all nodes, or a node number.
> If a node number was specified then defragmentation is only performed
> on a specific node.
>
> Slab defragmentation is a memory intensive operation that can be
> sped up in a NUMA system if mostly node local memory is accessed. That
> is the case if we just have reclaimed reclaim on a node.
>
> For defragmentation SLUB first generates a sorted list of partial slabs.
> Sorting is performed according to the number of objects allocated.
> Thus the slabs with the least objects will be at the end.
>
> We extract slabs off the tail of that list until we have either reached a
> mininum number of slabs or until we encounter a slab that has more than a
> quarter of its objects allocated. Then we attempt to remove the objects
> from each of the slabs taken.
>
> In order for a slabcache to support defragmentation a couple of functions
> must be defined via kmem_cache_ops. These are
>
> void *get(struct kmem_cache *s, int nr, void **objects)
>
> Must obtain a reference to the listed objects. SLUB guarantees that
> the objects are still allocated. However, other threads may be blocked
> in slab_free attempting to free objects in the slab. These may succeed
> as soon as get() returns to the slab allocator. The function must
> be able to detect the situation and void the attempts to handle such
> objects (by for example voiding the corresponding entry in the objects
> array).
>
> No slab operations may be performed in get_reference(). Interrupts

s/get_reference/get/, yes?

> are disabled. What can be done is very limited. The slab lock
> for the page with the object is taken. Any attempt to perform a slab
> operation may lead to a deadlock.
>
> get() returns a private pointer that is passed to kick. Should we
> be unable to obtain all references then that pointer may indicate
> to the kick() function that it should not attempt any object removal
> or move but simply remove the reference counts.
>
> void kick(struct kmem_cache *, int nr, void **objects, void *get_result)
>
> After SLUB has established references to the objects in a
> slab it will drop all locks and then use kick() to move objects out
> of the slab. The existence of the object is guaranteed by virtue of
> the earlier obtained references via get(). The callback may perform
> any slab operation since no locks are held at the time of call.
>
> The callback should remove the object from the slab in some way. This
> may be accomplished by reclaiming the object and then running
> kmem_cache_free() or reallocating it and then running
> kmem_cache_free(). Reallocation is advantageous because the partial
> slabs were just sorted to have the partial slabs with the most objects
> first. Reallocation is likely to result in filling up a slab in
> addition to freeing up one slab so that it also can be removed from
> the partial list.
>
> Kick() does not return a result. SLUB will check the number of
> remaining objects in the slab. If all objects were removed then
> we know that the operation was successful.
>

Nice changelog ;)

> +static int __kmem_cache_vacate(struct kmem_cache *s,
> + struct page *page, unsigned long flags, void *scratch)
> +{
> + void **vector = scratch;
> + void *p;
> + void *addr = page_address(page);
> + DECLARE_BITMAP(map, s->objects);

A variable-sized local. We have a few of these in-kernel.

What's the worst-case here? With 4k pages and 4-byte slab it's 128 bytes
of stack? Seems acceptable.

(What's the smallest sized object slub will create? 4 bytes?)



To hold off a concurrent free while defragging, the code relies upon
slab_lock() on the current page, yes?

But slab_lock() isn't taken for slabs whose objects are larger than PAGE_SIZE.
How's that handled?



Overall: looks good. It'd be nice to get a buffer_head shrinker in place,
see how that goes from a proof-of-concept POV.


How much testing has been done on this code, and of what form, and with
what results?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/