Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator
From: Josef Bacik
Date: Wed Jul 19 2017 - 15:11:28 EST
On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@xxxxxxxxx>
>
> The percpu memory allocator is experiencing scalability issues when
> allocating and freeing large numbers of counters as in BPF.
> Additionally, there is a corner case where iteration is triggered over
> all chunks if the contig_hint is the right size, but wrong alignment.
>
> Implementation:
> This patch removes the area map allocator in favor of a bitmap allocator
> backed by metadata blocks. The primary goal is to provide consistency
> in performance and memory footprint with a focus on small allocations
> (< 64 bytes). The bitmap removes the heavy memmove from the freeing
> critical path and provides a consistent memory footprint. The metadata
> blocks provide a bound on the amount of scanning required by maintaining
> a set of hints.
>
> The chunks previously were managed by free_size, a value maintained in
> bytes. Now, the chunks are managed in terms of bits, which is just a
> scaled value of free_size down by PCPU_MIN_ALLOC_SIZE.
>
> There is one caveat with this implementation. In an effort to make
> freeing fast, the only time metadata is updated on the free path is if a
> whole block becomes free or the freed area spans across metadata blocks.
> This causes the chunkâs contig_hint to be potentially smaller than what
> it could allocate by up to a block. If the chunkâs contig_hint is
> smaller than a block, a check occurs and the hint is kept accurate.
> Metadata is always kept accurate on allocation and therefore the
> situation where a chunk has a larger contig_hint than available will
> never occur.
>
> Evaluation:
> I have primarily done testing against a simple workload of allocation of
> 1 million objects of varying size. Deallocation was done by in order,
> alternating, and in reverse. These numbers were collected after rebasing
> ontop of a80099a152. I present the worst-case numbers here:
>
> Area Map Allocator:
>
> Object Size | Alloc Time (ms) | Free Time (ms)
> ----------------------------------------------
> 4B | 335 | 4960
> 16B | 485 | 1150
> 64B | 445 | 280
> 128B | 505 | 177
> 1024B | 3385 | 140
>
> Bitmap Allocator:
>
> Object Size | Alloc Time (ms) | Free Time (ms)
> ----------------------------------------------
> 4B | 725 | 70
> 16B | 760 | 70
> 64B | 855 | 80
> 128B | 910 | 90
> 1024B | 3770 | 260
>
> This data demonstrates the inability for the area map allocator to
> handle less than ideal situations. In the best case of reverse
> deallocation, the area map allocator was able to perform within range
> of the bitmap allocator. In the worst case situation, freeing took
> nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
> dramatically improves the consistency of the free path. The small
> allocations performed nearly identical regardless of the freeing
> pattern.
>
> While it does add to the allocation latency, the allocation scenario
> here is optimal for the area map allocator. The second problem of
> additional scanning can result in the area map allocator completing in
> 52 minutes. The same workload takes only 14 seconds to complete for the
> bitmap allocator. This was produced under a more contrived scenario of
> allocating 1 milion 4-byte objects with 8-byte alignment.
>
This was a bear to review, I feel like it could be split into a few smaller
pieces. You are changing hinting, allocating/freeing, and how you find chunks.
Those seem like good logical divisions to me. Overall the design seems sound
and I didn't spot any major problems. Once you've split them up I'll do another
thorough comb through and then add my reviewed by. Thanks,
Josef