The usual thing is this:
- Each processor has a private cache, that can be accessed unlocked
(or using whatever uniprocessor locking is required, which is at
most a cli), and
- If the private cache is empty, it turns to a global slab system which
does require locking, but is hopefully rare.
Thus, the race condition that you refer to in the second point cannot
happen.
The *structors are to let you cache setup work between slab allocations.
For example, dentries have some linked lists which have initial null
states (forward and back pointing to self) which have to be set up on
allocation, but are true on deallocation. Thus, when you recycle an
entry, you don't have to reinitialize them. The constructor and
destructor are for when data is received from and returned to the
system pool.
Jeff Bonwick's version has a debugging option that call them on every
object allocation and deallocation. Things like wait queues, reference
counts, and so on that begin as zero and end as zero are useful
initializations to avoid.
As for the first problem, one suggestion I have is to tackle the hit rate
head-on. Establish a target front-end hit rate, say 99%. Then do the
following on each allocation:
if (there's something in the local cache) {
allocate it for return
if (--score < 0) {
get some more from the local cache and return it to the
global pool
score += TUNABLE
}
} else {
score += 100-TUNABLE;
allocate something from the global pool;
}
This will maintain the pool size at the level necessary to keep a
99% hit rate. If it's repeated allocation and deallocation of one
object, the local cache will contain one object. If it varies a lot,
the local cache will be bigger.
The logic is easiest to see if TUNABLE is 0, but over the long run,
the number of allocations from and frees to the global pool will be
the same, so it can have any value and maintain the same hit rate.
A low value of TUNABLE causes the cache to explode initially, as all
the allocations miss and score goes up a lot. It takes a long time
to settle down, as it won't free objects until the overall miss rate
will stay below 1% and there are a lot of initial misses to make up for.
A high value reduces that effect, but causes the cache to be slow to
free resources when usage goes down.
Anyway, maybe it can be fiddled with. the main point is that the
common case (decrement, test high bit, and do nothing) is very fast.
I'm a little unclear on David Miller's design constraint
"2) No cli()'s on any code path whatsoever".
Doesn't __get_free_pages do all kinds of locking (including possibly
synchronous disk I/O!), and doesn't *any* kernel memory allocator need
to call that sometimes? David, can you clarify?
-- -Colin