[RFC] A new per-cpu memory allocator for userspace in librseq

From: Mathieu Desnoyers
Date: Wed Mar 20 2024 - 12:26:48 EST


Hi!

When looking at what is missing make librseq a generally usable
project to support per-cpu data structures in user-space, I noticed
that what we miss is a per-cpu memory allocator conceptually similar
to what the Linux kernel internally provides [1].

The per-CPU memory allocator is analogous to TLS (Thread-Local
Storage) memory: TLS is Thread-Local Storage, whereas the per-CPU
memory allocator provides CPU-Local Storage.

My goal is to improve locality and remove the need to waste precious
cache lines with padding when indexing per-cpu data as an array of
items.

So we decided to go ahead and implement a per-cpu allocator for
userspace in the librseq project [2,3] with the following
characteristics:

* Allocations are performed in memory pools (mempool). Allocations
are power of 2, fixed sized, configured at pool creation.

* Memory pools can be added to a pool set to allow allocation of
variable size records.

* Allocating "items" from a memory pool allocates memory for all
CPUs.

* The "stride" to index per-cpu data is user-configurable. Indexing
per-cpu data from an allocated pointer is as simple as:

(uintptr_t) ptr + (cpu * stride)

Where the multiplication is actually a shift because stride is
a power of 2 constant.

* Pools consist of a linked list of "ranges" (a stride worth of
item allocation), thus making the pool extensible when running
out of space, up to a user-configurable limit.

* Freeing a pointer only requires the pointer to free as input
(and the pool stride constant). Finding the range and pool
associated with the pointer is done by applying a mask to
the pointer. The memory mappings of the ranges are aligned
to make this mask find the range base, and thus allow accessing
the range structure placed in a header page immediately before.

One interesting problem we faced is what should be done to prevent
wasting memory due to allocation of useless pages in a system where
there are lots of configured CPUs, but very few are actually used
by the application due to a combination of cpu affinity, cpusets,
and cpu hotplug. Minimizing the amount of page allocation while
offering the ability to allocate zeroed (or pre-initialized)
items is the crux of this issue.

We thus came up with two approaches based on copy-on-write (COW)
to tackle this, which we call the "pool populate policy":

* RSEQ_MEMPOOL_POPULATE_COW_INIT (default):

Rely on copy-on-write (COW) of per-cpu pages to populate per-cpu pages
from the initial values pages on first write.

The COW_INIT approach maps an extra "initial values" stride with each
pool range as MAP_SHARED from a memfd. All per-cpu strides map these
initial values as MAP_PRIVATE, so the first write access from an active
CPU will trigger a COW page allocation. The downside of this scheme
is that its use of MAP_SHARED is not compatible with using the pool
from children processes after fork, and its use of COW is not
compatible with shared memory use-cases.

* RSEQ_MEMPOOL_POPULATE_COW_ZERO:

Rely on copy-on-write (COW) of per-cpu pages to populate per-cpu pages
from the zero page on first write. As long as the user only uses malloc,
zmalloc, or malloc_init with zeroed content to allocate items, it does
not trigger COW of all per-cpu pages, leaving in place the zero page
until an active CPU writes to its per-cpu item.

The COW_ZERO approach maps the per-cpu strides as private anonymous
memory, and therefore only triggers COW page allocation when a CPU
writes over those zero pages. As a downside, this scheme will trigger
COW page allocation for all possible CPUs when using zmalloc_init()
to populate non-zeroed initial values for an item. Its upsides are
that this scheme can be used across fork and eventually can be used
over shared memory.

Other noteworthy features are that this mempool allocator can be
used as a global allocator as well. It has an optional "robust"
attribute which enables checks for memory corruption and
double-free.

Users with more custom use-cases can register an "init" callback
to be called for after each new range/cpu are allocated.

Feedback is welcome !

Thanks,

Mathieu

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/percpu.h
[2] https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/include/rseq/mempool.h
[3] https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/src/rseq-mempool.c

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com