Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator
From: Josef Bacik
Date: Wed Jul 19 2017 - 15:16:46 EST
On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@xxxxxxxxx>
>
> The percpu memory allocator is experiencing scalability issues when
> allocating and freeing large numbers of counters as in BPF.
> Additionally, there is a corner case where iteration is triggered over
> all chunks if the contig_hint is the right size, but wrong alignment.
>
> Implementation:
> This patch removes the area map allocator in favor of a bitmap allocator
> backed by metadata blocks. The primary goal is to provide consistency
> in performance and memory footprint with a focus on small allocations
> (< 64 bytes). The bitmap removes the heavy memmove from the freeing
> critical path and provides a consistent memory footprint. The metadata
> blocks provide a bound on the amount of scanning required by maintaining
> a set of hints.
>
> The chunks previously were managed by free_size, a value maintained in
> bytes. Now, the chunks are managed in terms of bits, which is just a
> scaled value of free_size down by PCPU_MIN_ALLOC_SIZE.
>
> There is one caveat with this implementation. In an effort to make
> freeing fast, the only time metadata is updated on the free path is if a
> whole block becomes free or the freed area spans across metadata blocks.
> This causes the chunkâs contig_hint to be potentially smaller than what
> it could allocate by up to a block. If the chunkâs contig_hint is
> smaller than a block, a check occurs and the hint is kept accurate.
> Metadata is always kept accurate on allocation and therefore the
> situation where a chunk has a larger contig_hint than available will
> never occur.
>
> Evaluation:
> I have primarily done testing against a simple workload of allocation of
> 1 million objects of varying size. Deallocation was done by in order,
> alternating, and in reverse. These numbers were collected after rebasing
> ontop of a80099a152. I present the worst-case numbers here:
>
> Area Map Allocator:
>
> Object Size | Alloc Time (ms) | Free Time (ms)
> ----------------------------------------------
> 4B | 335 | 4960
> 16B | 485 | 1150
> 64B | 445 | 280
> 128B | 505 | 177
> 1024B | 3385 | 140
>
> Bitmap Allocator:
>
> Object Size | Alloc Time (ms) | Free Time (ms)
> ----------------------------------------------
> 4B | 725 | 70
> 16B | 760 | 70
> 64B | 855 | 80
> 128B | 910 | 90
> 1024B | 3770 | 260
>
> This data demonstrates the inability for the area map allocator to
> handle less than ideal situations. In the best case of reverse
> deallocation, the area map allocator was able to perform within range
> of the bitmap allocator. In the worst case situation, freeing took
> nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
> dramatically improves the consistency of the free path. The small
> allocations performed nearly identical regardless of the freeing
> pattern.
>
> While it does add to the allocation latency, the allocation scenario
> here is optimal for the area map allocator. The second problem of
> additional scanning can result in the area map allocator completing in
> 52 minutes. The same workload takes only 14 seconds to complete for the
> bitmap allocator. This was produced under a more contrived scenario of
> allocating 1 milion 4-byte objects with 8-byte alignment.
>
> Signed-off-by: Dennis Zhou <dennisszhou@xxxxxxxxx>
> ---
> include/linux/percpu.h | 10 +-
> init/main.c | 1 -
> mm/percpu-internal.h | 70 ++-
> mm/percpu-stats.c | 99 ++--
> mm/percpu.c | 1280 ++++++++++++++++++++++++++++++------------------
> 5 files changed, 923 insertions(+), 537 deletions(-)
>
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h
> index a5cedcd..8f62b10 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -26,6 +26,15 @@
> #define PCPU_MIN_ALLOC_SHIFT 2
>
> /*
> + * This determines the size of each metadata block. There are several subtle
> + * constraints around this variable. The reserved_region and dynamic_region
> + * of the first chunk must be multiples of PCPU_BITMAP_BLOCK_SIZE. This is
> + * not a problem if the BLOCK_SIZE encompasses a page, but if exploring blocks
> + * that are backing multiple pages, this needs to be accounted for.
> + */
> +#define PCPU_BITMAP_BLOCK_SIZE (PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT)
> +
> +/*
> * Percpu allocator can serve percpu allocations before slab is
> * initialized which allows slab to depend on the percpu allocator.
> * The following two parameters decide how much resource to
> @@ -120,7 +129,6 @@ extern bool is_kernel_percpu_address(unsigned long addr);
> #if !defined(CONFIG_SMP) || !defined(CONFIG_HAVE_SETUP_PER_CPU_AREA)
> extern void __init setup_per_cpu_areas(void);
> #endif
> -extern void __init percpu_init_late(void);
>
> extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp);
> extern void __percpu *__alloc_percpu(size_t size, size_t align);
> diff --git a/init/main.c b/init/main.c
> index 052481f..c9a9fff 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -500,7 +500,6 @@ static void __init mm_init(void)
> page_ext_init_flatmem();
> mem_init();
> kmem_cache_init();
> - percpu_init_late();
> pgtable_init();
> vmalloc_init();
> ioremap_huge_init();
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> index f0776f6..2dac644 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -4,6 +4,21 @@
> #include <linux/types.h>
> #include <linux/percpu.h>
>
> +/*
> + * pcpu_bitmap_md is the metadata block struct.
> + * All units are in terms of bits.
> + */
> +struct pcpu_bitmap_md {
> + int contig_hint; /* contig hint for block */
> + int contig_hint_start; /* block relative starting
> + position of the contig hint */
> + int left_free; /* size of free space along
> + the left side of the block */
> + int right_free; /* size of free space along
> + the right side of the block */
> + int first_free; /* block position of first free */
> +};
> +
> struct pcpu_chunk {
> #ifdef CONFIG_PERCPU_STATS
> int nr_alloc; /* # of allocations */
> @@ -11,17 +26,20 @@ struct pcpu_chunk {
> #endif
>
> struct list_head list; /* linked to pcpu_slot lists */
> - int free_size; /* free bytes in the chunk */
> - int contig_hint; /* max contiguous size hint */
> + int free_bits; /* free bits in the chunk */
> + int contig_hint; /* max contiguous size hint
> + in bits */
> + int contig_hint_start; /* contig_hint starting
> + bit offset */
> void *base_addr; /* base address of this chunk */
>
> - int map_used; /* # of map entries used before the sentry */
> - int map_alloc; /* # of map entries allocated */
> - int *map; /* allocation map */
> - struct list_head map_extend_list;/* on pcpu_map_extend_chunks */
> + unsigned long *alloc_map; /* allocation map */
> + unsigned long *bound_map; /* boundary map */
> + struct pcpu_bitmap_md *md_blocks; /* metadata blocks */
>
> void *data; /* chunk data */
> - int first_free; /* no free below this */
> + int first_free_block; /* block that contains the first
> + free bit */
> bool immutable; /* no [de]population allowed */
> bool has_reserved; /* indicates if the region this chunk
> is responsible for overlaps with
> @@ -44,6 +62,44 @@ extern struct pcpu_chunk *pcpu_first_chunk;
> extern struct pcpu_chunk *pcpu_reserved_chunk;
> extern unsigned long pcpu_reserved_offset;
>
> +/*
> + * pcpu_nr_pages_to_blocks - converts nr_pages to # of md_blocks
> + * @chunk: chunk of interest
> + *
> + * This conversion is from the number of physical pages that the chunk
> + * serves to the number of bitmap blocks required. It converts to bytes
> + * served to bits required and then blocks used.
> + */
> +static inline int pcpu_nr_pages_to_blocks(struct pcpu_chunk *chunk)
> +{
> + return chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE /
> + PCPU_BITMAP_BLOCK_SIZE;
> +}
> +
> +/*
> + * pcpu_pages_to_bits - converts the pages to size of bitmap
> + * @pages: number of physical pages
> + *
> + * This conversion is from physical pages to the number of bits
> + * required in the bitmap.
> + */
> +static inline int pcpu_pages_to_bits(int pages)
> +{
> + return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE;
> +}
> +
> +/*
> + * pcpu_nr_pages_to_bits - helper to convert nr_pages to size of bitmap
> + * @chunk: chunk of interest
> + *
> + * This conversion is from the number of physical pages that the chunk
> + * serves to the number of bits in the bitmap.
> + */
> +static inline int pcpu_nr_pages_to_bits(struct pcpu_chunk *chunk)
> +{
> + return pcpu_pages_to_bits(chunk->nr_pages);
> +}
> +
> #ifdef CONFIG_PERCPU_STATS
>
> #include <linux/spinlock.h>
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
> index 6fc04b1..8dbef0c 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -29,64 +29,85 @@ static int cmpint(const void *a, const void *b)
> }
>
> /*
> - * Iterates over all chunks to find the max # of map entries used.
> + * Iterates over all chunks to find the max nr_alloc entries.
> */
> -static int find_max_map_used(void)
> +static int find_max_nr_alloc(void)
> {
> struct pcpu_chunk *chunk;
> - int slot, max_map_used;
> + int slot, max_nr_alloc;
>
> - max_map_used = 0;
> + max_nr_alloc = 0;
> for (slot = 0; slot < pcpu_nr_slots; slot++)
> list_for_each_entry(chunk, &pcpu_slot[slot], list)
> - max_map_used = max(max_map_used, chunk->map_used);
> + max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc);
>
> - return max_map_used;
> + return max_nr_alloc;
> }
>
> /*
> * Prints out chunk state. Fragmentation is considered between
> * the beginning of the chunk to the last allocation.
> + *
> + * All statistics are in bytes unless stated otherwise.
> */
> static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
> int *buffer)
> {
> - int i, s_index, last_alloc, alloc_sign, as_len;
> + int i, last_alloc, as_len, start, end;
> int *alloc_sizes, *p;
> /* statistics */
> int sum_frag = 0, max_frag = 0;
> int cur_min_alloc = 0, cur_med_alloc = 0, cur_max_alloc = 0;
>
> alloc_sizes = buffer;
> - s_index = chunk->has_reserved ? 1 : 0;
> -
> - /* find last allocation */
> - last_alloc = -1;
> - for (i = chunk->map_used - 1; i >= s_index; i--) {
> - if (chunk->map[i] & 1) {
> - last_alloc = i;
> - break;
> - }
> - }
>
> - /* if the chunk is not empty - ignoring reserve */
> - if (last_alloc >= s_index) {
> - as_len = last_alloc + 1 - s_index;
> -
> - /*
> - * Iterate through chunk map computing size info.
> - * The first bit is overloaded to be a used flag.
> - * negative = free space, positive = allocated
> - */
> - for (i = 0, p = chunk->map + s_index; i < as_len; i++, p++) {
> - alloc_sign = (*p & 1) ? 1 : -1;
> - alloc_sizes[i] = alloc_sign *
> - ((p[1] & ~1) - (p[0] & ~1));
> + /*
> + * find_last_bit returns the start value if nothing found.
> + * Therefore, we must determine if it is a failure of find_last_bit
> + * and set the appropriate value.
> + */
> + last_alloc = find_last_bit(chunk->alloc_map,
> + pcpu_nr_pages_to_bits(chunk) - 1);
> + last_alloc = test_bit(last_alloc, chunk->alloc_map) ?
> + last_alloc + 1 : 0;
> +
> + start = as_len = 0;
> + if (chunk->has_reserved)
> + start = pcpu_reserved_offset;
> +
> + /*
> + * If a bit is set in the allocation map, the bound_map identifies
> + * where the allocation ends. If the allocation is not set, the
> + * bound_map does not identify free areas as it is only kept accurate
> + * on allocation, not free.
> + *
> + * Positive values are allocations and negative values are free
> + * fragments.
> + */
> + while (start < last_alloc) {
> + if (test_bit(start, chunk->alloc_map)) {
> + end = find_next_bit(chunk->bound_map, last_alloc,
> + start + 1);
> + alloc_sizes[as_len] = 1;
> + } else {
> + end = find_next_bit(chunk->alloc_map, last_alloc,
> + start + 1);
> + alloc_sizes[as_len] = -1;
> }
>
> - sort(alloc_sizes, as_len, sizeof(chunk->map[0]), cmpint, NULL);
> + alloc_sizes[as_len++] *= (end - start) * PCPU_MIN_ALLOC_SIZE;
> +
> + start = end;
> + }
> +
> + /*
> + * The negative values are free fragments and thus sorting gives the
> + * free fragments at the beginning in largest first order.
> + */
> + if (as_len > 0) {
> + sort(alloc_sizes, as_len, sizeof(int), cmpint, NULL);
>
> - /* Iterate through the unallocated fragements. */
> + /* iterate through the unallocated fragments */
> for (i = 0, p = alloc_sizes; *p < 0 && i < as_len; i++, p++) {
> sum_frag -= *p;
> max_frag = max(max_frag, -1 * (*p));
> @@ -100,8 +121,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
> P("nr_alloc", chunk->nr_alloc);
> P("max_alloc_size", chunk->max_alloc_size);
> P("empty_pop_pages", chunk->nr_empty_pop_pages);
> - P("free_size", chunk->free_size);
> - P("contig_hint", chunk->contig_hint);
> + P("first_free_block", chunk->first_free_block);
> + P("free_size", chunk->free_bits * PCPU_MIN_ALLOC_SIZE);
> + P("contig_hint", chunk->contig_hint * PCPU_MIN_ALLOC_SIZE);
> P("sum_frag", sum_frag);
> P("max_frag", max_frag);
> P("cur_min_alloc", cur_min_alloc);
> @@ -113,22 +135,23 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
> static int percpu_stats_show(struct seq_file *m, void *v)
> {
> struct pcpu_chunk *chunk;
> - int slot, max_map_used;
> + int slot, max_nr_alloc;
> int *buffer;
>
> alloc_buffer:
> spin_lock_irq(&pcpu_lock);
> - max_map_used = find_max_map_used();
> + max_nr_alloc = find_max_nr_alloc();
> spin_unlock_irq(&pcpu_lock);
>
> - buffer = vmalloc(max_map_used * sizeof(pcpu_first_chunk->map[0]));
> + /* there can be at most this many free and allocated fragments */
> + buffer = vmalloc((2 * max_nr_alloc + 1) * sizeof(int));
> if (!buffer)
> return -ENOMEM;
>
> spin_lock_irq(&pcpu_lock);
>
> /* if the buffer allocated earlier is too small */
> - if (max_map_used < find_max_map_used()) {
> + if (max_nr_alloc < find_max_nr_alloc()) {
> spin_unlock_irq(&pcpu_lock);
> vfree(buffer);
> goto alloc_buffer;
> diff --git a/mm/percpu.c b/mm/percpu.c
> index fb01841..569df63 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -4,6 +4,9 @@
> * Copyright (C) 2009 SUSE Linux Products GmbH
> * Copyright (C) 2009 Tejun Heo <tj@xxxxxxxxxx>
> *
> + * Copyright (C) 2017 Facebook Inc.
> + * Copyright (C) 2017 Dennis Zhou <dennisszhou@xxxxxxxxx>
> + *
> * This file is released under the GPLv2 license.
> *
> * The percpu allocator handles both static and dynamic areas. Percpu
> @@ -34,19 +37,20 @@
> * percpu variables from kernel modules. Finally, the dynamic section
> * takes care of normal allocations.
> *
> - * Allocation state in each chunk is kept using an array of integers
> - * on chunk->map. A positive value in the map represents a free
> - * region and negative allocated. Allocation inside a chunk is done
> - * by scanning this map sequentially and serving the first matching
> - * entry. This is mostly copied from the percpu_modalloc() allocator.
> - * Chunks can be determined from the address using the index field
> - * in the page struct. The index field contains a pointer to the chunk.
> - *
> - * These chunks are organized into lists according to free_size and
> - * tries to allocate from the fullest chunk first. Each chunk maintains
> - * a maximum contiguous area size hint which is guaranteed to be equal
> - * to or larger than the maximum contiguous area in the chunk. This
> - * helps prevent the allocator from iterating over chunks unnecessarily.
> + * The allocator organizes chunks into lists according to free size and
> + * tries to allocate from the fullest chunk first. Each chunk is managed
> + * by a bitmap with metadata blocks. The allocation map is updated on
> + * every allocation to reflect the current state while the boundary map
> + * is only updated on allocation. Each metadata block contains
> + * information to help mitigate the need to iterate over large portions
> + * of the bitmap. The reverse mapping from page to chunk is stored in
> + * the page's index. Lastly, units are lazily backed and grow in unison.
> + *
> + * There is a unique conversion that goes on here between bytes and bits.
> + * The chunk tracks the number of pages it is responsible for in nr_pages.
> + * From there, helper functions are used to convert from physical pages
> + * to bitmap bits and bitmap blocks. All hints are managed in bits
> + * unless explicitly stated.
> *
> * To use this allocator, arch code should do the following:
> *
> @@ -86,10 +90,13 @@
>
> #include "percpu-internal.h"
>
> -#define PCPU_SLOT_BASE_SHIFT 5 /* 1-31 shares the same slot */
> -#define PCPU_DFL_MAP_ALLOC 16 /* start a map with 16 ents */
> -#define PCPU_ATOMIC_MAP_MARGIN_LOW 32
> -#define PCPU_ATOMIC_MAP_MARGIN_HIGH 64
> +/*
> + * The metadata is managed in terms of bits with each bit mapping to
> + * a fragment of size PCPU_MIN_ALLOC_SIZE. Thus, the slots are calculated
> + * with respect to the number of bits available.
> + */
> +#define PCPU_SLOT_BASE_SHIFT 3
> +
> #define PCPU_EMPTY_POP_PAGES_LOW 2
> #define PCPU_EMPTY_POP_PAGES_HIGH 4
>
> @@ -156,10 +163,11 @@ unsigned long pcpu_reserved_offset __ro_after_init;
> DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */
> static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */
>
> -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */
> -
> -/* chunks which need their map areas extended, protected by pcpu_lock */
> -static LIST_HEAD(pcpu_map_extend_chunks);
> +/*
> + * Chunk list slots. These slots order the chunks by the number of
> + * free bits available in the bitmap.
> + */
> +struct list_head *pcpu_slot __ro_after_init;
>
> /*
> * The number of empty populated pages, protected by pcpu_lock. The
> @@ -212,25 +220,25 @@ static bool pcpu_addr_in_reserved_chunk(void *addr)
> pcpu_reserved_chunk->nr_pages * PAGE_SIZE;
> }
>
> -static int __pcpu_size_to_slot(int size)
> +static int __pcpu_size_to_slot(int bit_size)
> {
> - int highbit = fls(size); /* size is in bytes */
> + int highbit = fls(bit_size); /* size is in bits */
> return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1);
> }
>
> -static int pcpu_size_to_slot(int size)
> +static int pcpu_size_to_slot(int bit_size)
> {
> - if (size == pcpu_unit_size)
> + if (bit_size == pcpu_pages_to_bits(pcpu_unit_pages))
> return pcpu_nr_slots - 1;
> - return __pcpu_size_to_slot(size);
> + return __pcpu_size_to_slot(bit_size);
> }
>
> static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
> {
> - if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int))
> + if (chunk->free_bits == 0 || chunk->contig_hint == 0)
> return 0;
>
> - return pcpu_size_to_slot(chunk->free_size);
> + return pcpu_size_to_slot(chunk->free_bits);
> }
>
> /* set the pointer to a chunk in a page struct */
> @@ -277,6 +285,37 @@ static void __maybe_unused pcpu_next_pop(struct pcpu_chunk *chunk,
> }
>
> /*
> + * The following are helper functions to help access bitmaps and convert
> + * between bitmap offsets to actual address offsets.
> + */
> +static unsigned long *pcpu_index_alloc_map(struct pcpu_chunk *chunk, int index)
> +{
> + return chunk->alloc_map +
> + (index * PCPU_BITMAP_BLOCK_SIZE / BITS_PER_LONG);
> +}
> +
> +static unsigned long pcpu_off_to_block_index(int off)
> +{
> + return off / PCPU_BITMAP_BLOCK_SIZE;
> +}
> +
> +static unsigned long pcpu_off_to_block_off(int off)
> +{
> + return off & (PCPU_BITMAP_BLOCK_SIZE - 1);
> +}
> +
> +static unsigned long pcpu_block_off_to_off(int index, int off)
> +{
> + return index * PCPU_BITMAP_BLOCK_SIZE + off;
> +}
> +
> +static unsigned long pcpu_block_get_first_page(int index)
> +{
> + return PFN_DOWN(index * PCPU_BITMAP_BLOCK_SIZE *
> + PCPU_MIN_ALLOC_SIZE);
> +}
> +
> +/*
> * (Un)populated page region iterators. Iterate over (un)populated
> * page regions between @start and @end in @chunk. @rs and @re should
> * be integer variables and will be set to start and end page index of
> @@ -329,38 +368,6 @@ static void pcpu_mem_free(void *ptr)
> }
>
> /**
> - * pcpu_count_occupied_pages - count the number of pages an area occupies
> - * @chunk: chunk of interest
> - * @i: index of the area in question
> - *
> - * Count the number of pages chunk's @i'th area occupies. When the area's
> - * start and/or end address isn't aligned to page boundary, the straddled
> - * page is included in the count iff the rest of the page is free.
> - */
> -static int pcpu_count_occupied_pages(struct pcpu_chunk *chunk, int i)
> -{
> - int off = chunk->map[i] & ~1;
> - int end = chunk->map[i + 1] & ~1;
> -
> - if (!PAGE_ALIGNED(off) && i > 0) {
> - int prev = chunk->map[i - 1];
> -
> - if (!(prev & 1) && prev <= round_down(off, PAGE_SIZE))
> - off = round_down(off, PAGE_SIZE);
> - }
> -
> - if (!PAGE_ALIGNED(end) && i + 1 < chunk->map_used) {
> - int next = chunk->map[i + 1];
> - int nend = chunk->map[i + 2] & ~1;
> -
> - if (!(next & 1) && nend >= round_up(end, PAGE_SIZE))
> - end = round_up(end, PAGE_SIZE);
> - }
> -
> - return max_t(int, PFN_DOWN(end) - PFN_UP(off), 0);
> -}
> -
> -/**
> * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
> * @chunk: chunk of interest
> * @oslot: the previous slot it was on
> @@ -386,385 +393,770 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
> }
>
> /**
> - * pcpu_need_to_extend - determine whether chunk area map needs to be extended
> + * pcpu_cnt_pop_pages- counts populated backing pages in range
> * @chunk: chunk of interest
> - * @is_atomic: the allocation context
> + * @start: start index
> + * @end: end index
> *
> - * Determine whether area map of @chunk needs to be extended. If
> - * @is_atomic, only the amount necessary for a new allocation is
> - * considered; however, async extension is scheduled if the left amount is
> - * low. If !@is_atomic, it aims for more empty space. Combined, this
> - * ensures that the map is likely to have enough available space to
> - * accomodate atomic allocations which can't extend maps directly.
> - *
> - * CONTEXT:
> - * pcpu_lock.
> + * Calculates the number of populated pages in the region [start, end).
> + * This lets us keep track of how many empty populated pages are available
> + * and decide if we should schedule async work.
> *
> * RETURNS:
> - * New target map allocation length if extension is necessary, 0
> - * otherwise.
> + * The nr of populated pages.
> */
> -static int pcpu_need_to_extend(struct pcpu_chunk *chunk, bool is_atomic)
> +static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk,
> + int start, int end)
> {
> - int margin, new_alloc;
> -
> - lockdep_assert_held(&pcpu_lock);
> + return bitmap_weight(chunk->populated, end) -
> + bitmap_weight(chunk->populated, start);
> +}
>
> - if (is_atomic) {
> - margin = 3;
> +/**
> + * pcpu_chunk_update_hint - updates metadata about a chunk
> + * @chunk: chunk of interest
> + *
> + * Responsible for iterating over metadata blocks to aggregate the
> + * overall statistics of the chunk.
> + *
> + * Updates:
> + * chunk->contig_hint
> + * chunk->contig_hint_start
> + * nr_empty_pop_pages
> + */
> +static void pcpu_chunk_update_hint(struct pcpu_chunk *chunk)
> +{
> + bool is_page_empty = true;
> + int i, off, cur_contig, nr_empty_pop_pages, l_pop_off;
> + struct pcpu_bitmap_md *block;
> +
> + chunk->contig_hint = cur_contig = 0;
> + off = nr_empty_pop_pages = 0;
> + l_pop_off = pcpu_block_get_first_page(chunk->first_free_block);
> +
> + for (i = chunk->first_free_block, block = chunk->md_blocks + i;
> + i < pcpu_nr_pages_to_blocks(chunk); i++, block++) {
> + /* Manage nr_empty_pop_pages.
> + *
> + * This is tricky. So the the background work function is
> + * triggered when there are not enough free populated pages.
> + * This is necessary to make sure atomic allocations can
> + * succeed.
> + *
> + * The first page of each block is kept track of here allowing
> + * this to scale in both situations where there are > 1 page
> + * per block and where a block may be a portion of a page.
> + */
> + int pop_off = pcpu_block_get_first_page(i);
> +
> + if (pop_off > l_pop_off) {
> + if (is_page_empty)
> + nr_empty_pop_pages +=
> + pcpu_cnt_pop_pages(chunk, l_pop_off,
> + pop_off);
> + l_pop_off = pop_off;
> + is_page_empty = true;
> + }
> + if (block->contig_hint != PCPU_BITMAP_BLOCK_SIZE)
> + is_page_empty = false;
>
> - if (chunk->map_alloc <
> - chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) {
> - if (list_empty(&chunk->map_extend_list)) {
> - list_add_tail(&chunk->map_extend_list,
> - &pcpu_map_extend_chunks);
> - pcpu_schedule_balance_work();
> + /* continue from prev block adding to the cur_contig hint */
> + if (cur_contig) {
> + cur_contig += block->left_free;
> + if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) {
> + continue;
> + } else if (cur_contig > chunk->contig_hint) {
> + chunk->contig_hint = cur_contig;
> + chunk->contig_hint_start = off;
> }
> + cur_contig = 0;
> }
> - } else {
> - margin = PCPU_ATOMIC_MAP_MARGIN_HIGH;
> + /* check if the block->contig_hint is larger */
> + if (block->contig_hint > chunk->contig_hint) {
> + chunk->contig_hint = block->contig_hint;
> + chunk->contig_hint_start =
> + pcpu_block_off_to_off(i,
> + block->contig_hint_start);
> + }
> + /* let the next iteration catch the right_free */
> + cur_contig = block->right_free;
> + off = (i + 1) * PCPU_BITMAP_BLOCK_SIZE - block->right_free;
> }
>
> - if (chunk->map_alloc >= chunk->map_used + margin)
> - return 0;
> -
> - new_alloc = PCPU_DFL_MAP_ALLOC;
> - while (new_alloc < chunk->map_used + margin)
> - new_alloc *= 2;
> + /* catch last iteration if the last block ends with free space */
> + if (cur_contig > chunk->contig_hint) {
> + chunk->contig_hint = cur_contig;
> + chunk->contig_hint_start = off;
> + }
>
> - return new_alloc;
> + /*
> + * Keep track of nr_empty_pop_pages.
> + *
> + * The chunk is maintains the previous number of free pages it held,
> + * so the delta is used to update the global counter. The reserved
> + * chunk is not part of the free page count as they are populated
> + * at init and are special to serving reserved allocations.
> + */
> + if (is_page_empty) {
> + nr_empty_pop_pages += pcpu_cnt_pop_pages(chunk, l_pop_off,
> + chunk->nr_pages);
> + }
> + if (chunk != pcpu_reserved_chunk)
> + pcpu_nr_empty_pop_pages +=
> + (nr_empty_pop_pages - chunk->nr_empty_pop_pages);
> + chunk->nr_empty_pop_pages = nr_empty_pop_pages;
> }
>
> /**
> - * pcpu_extend_area_map - extend area map of a chunk
> + * pcpu_block_update_hint
> * @chunk: chunk of interest
> - * @new_alloc: new target allocation length of the area map
> + * @index: block index of the metadata block
> *
> - * Extend area map of @chunk to have @new_alloc entries.
> + * Full scan over the entire block to recalculate block-level metadata.
> + */
> +static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
> +{
> + unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index);
> + struct pcpu_bitmap_md *block = chunk->md_blocks + index;
> + bool is_left_free = false, is_right_free = false;
> + int contig;
> + unsigned long start, end;
> +
> + block->contig_hint = 0;
> + start = end = block->first_free;
> + while (start < PCPU_BITMAP_BLOCK_SIZE) {
> + /*
> + * Scans the allocation map corresponding to this block
> + * to find free fragments and update metadata accordingly.
> + */
> + start = find_next_zero_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE,
> + start);
> + if (start >= PCPU_BITMAP_BLOCK_SIZE)
> + break;
> + /* returns PCPU_BITMAP_BLOCK_SIZE if no next bit is found */
> + end = find_next_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, start);
> + /* update left_free */
> + contig = end - start;
> + if (start == 0) {
> + block->left_free = contig;
> + is_left_free = true;
> + }
> + /* update right_free */
> + if (end == PCPU_BITMAP_BLOCK_SIZE) {
> + block->right_free = contig;
> + is_right_free = true;
> + }
> + /* update block contig_hints */
> + if (block->contig_hint < contig) {
> + block->contig_hint = contig;
> + block->contig_hint_start = start;
> + }
> + start = end;
> + }
> +
> + /* clear left/right free hints */
> + if (!is_left_free)
> + block->left_free = 0;
> + if (!is_right_free)
> + block->right_free = 0;
> +}
> +
> +/**
> + * pcpu_block_update_hint_alloc - update hint on allocation path
> + * @chunk: chunk of interest
> + * @bit_off: bitmap offset
> + * @bit_size: size of request in allocation units
> *
> - * CONTEXT:
> - * Does GFP_KERNEL allocation. Grabs and releases pcpu_lock.
> + * Updates metadata for the allocation path. The metadata only has to be
> + * refreshed by a full scan iff we break the largest contig region.
> *
> * RETURNS:
> - * 0 on success, -errno on failure.
> + * Bool if we need to update the chunk's metadata. This occurs only if we
> + * break the chunk's contig hint.
> */
> -static int pcpu_extend_area_map(struct pcpu_chunk *chunk, int new_alloc)
> +static bool pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
> + int bit_size)
> {
> - int *old = NULL, *new = NULL;
> - size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
> - unsigned long flags;
> -
> - lockdep_assert_held(&pcpu_alloc_mutex);
> + bool update_chunk = false;
> + int i;
> + int s_index, e_index, s_off, e_off;
> + struct pcpu_bitmap_md *s_block, *e_block, *block;
>
> - new = pcpu_mem_zalloc(new_size);
> - if (!new)
> - return -ENOMEM;
> + /* calculate per block offsets */
> + s_index = pcpu_off_to_block_index(bit_off);
> + e_index = pcpu_off_to_block_index(bit_off + bit_size);
> + s_off = pcpu_off_to_block_off(bit_off);
> + e_off = pcpu_off_to_block_off(bit_off + bit_size);
>
> - /* acquire pcpu_lock and switch to new area map */
> - spin_lock_irqsave(&pcpu_lock, flags);
> + /*
> + * If the offset is the beginning of the next block, set it to the
> + * end of the previous block as the last bit is the exclusive.
> + */
> + if (e_off == 0) {
> + e_off = PCPU_BITMAP_BLOCK_SIZE;
> + e_index--;
> + }
>
> - if (new_alloc <= chunk->map_alloc)
> - goto out_unlock;
> + s_block = chunk->md_blocks + s_index;
> + e_block = chunk->md_blocks + e_index;
>
> - old_size = chunk->map_alloc * sizeof(chunk->map[0]);
> - old = chunk->map;
> + /*
> + * Update s_block.
> + *
> + * block->first_free must be updated if the allocation takes its place.
> + * If the allocation breaks the contig_hint, a scan is required to
> + * restore this hint.
> + */
> + if (s_off == s_block->first_free)
> + s_block->first_free = find_next_zero_bit(
> + pcpu_index_alloc_map(chunk, s_index),
> + PCPU_BITMAP_BLOCK_SIZE,
> + s_off + bit_size);
> +
> + if (s_off >= s_block->contig_hint_start &&
> + s_off < s_block->contig_hint_start + s_block->contig_hint) {
> + pcpu_block_refresh_hint(chunk, s_index);
> + } else {
> + /* update left and right contig manually */
> + s_block->left_free = min(s_block->left_free, s_off);
> + if (s_index == e_index)
> + s_block->right_free = min_t(int, s_block->right_free,
> + PCPU_BITMAP_BLOCK_SIZE - e_off);
> + else
> + s_block->right_free = 0;
> + }
>
> - memcpy(new, old, old_size);
> + /*
> + * Update e_block.
> + * If they are different, then e_block's first_free is guaranteed to
> + * be the extend of e_off. first_free must be updated and a scan
> + * over e_block is issued.
> + */
> + if (s_index != e_index) {
> + e_block->first_free = find_next_zero_bit(
> + pcpu_index_alloc_map(chunk, e_index),
> + PCPU_BITMAP_BLOCK_SIZE, e_off);
>
> - chunk->map_alloc = new_alloc;
> - chunk->map = new;
> - new = NULL;
> + pcpu_block_refresh_hint(chunk, e_index);
> + }
>
> -out_unlock:
> - spin_unlock_irqrestore(&pcpu_lock, flags);
> + /* update in-between md_blocks */
> + for (i = s_index + 1, block = chunk->md_blocks + i; i < e_index;
> + i++, block++) {
> + block->contig_hint = 0;
> + block->left_free = 0;
> + block->right_free = 0;
> + }
>
> /*
> - * pcpu_mem_free() might end up calling vfree() which uses
> - * IRQ-unsafe lock and thus can't be called under pcpu_lock.
> + * The only time a full chunk scan is required is if the global
> + * contig_hint is broken. Otherwise, it means a smaller space
> + * was used and therefore the global contig_hint is still correct.
> */
> - pcpu_mem_free(old);
> - pcpu_mem_free(new);
> + if (bit_off >= chunk->contig_hint_start &&
> + bit_off < chunk->contig_hint_start + chunk->contig_hint)
> + update_chunk = true;
>
> - return 0;
> + return update_chunk;
> }
>
> /**
> - * pcpu_fit_in_area - try to fit the requested allocation in a candidate area
> - * @chunk: chunk the candidate area belongs to
> - * @off: the offset to the start of the candidate area
> - * @this_size: the size of the candidate area
> - * @size: the size of the target allocation
> - * @align: the alignment of the target allocation
> - * @pop_only: only allocate from already populated region
> - *
> - * We're trying to allocate @size bytes aligned at @align. @chunk's area
> - * at @off sized @this_size is a candidate. This function determines
> - * whether the target allocation fits in the candidate area and returns the
> - * number of bytes to pad after @off. If the target area doesn't fit, -1
> - * is returned.
> - *
> - * If @pop_only is %true, this function only considers the already
> - * populated part of the candidate area.
> + * pcpu_block_update_hint_free - updates the block hints on the free path
> + * @chunk: chunk of interest
> + * @bit_off: bitmap offset
> + * @bit_size: size of request in allocation units
> + *
> + * Updates the hint along the free path by taking advantage of current metadata
> + * to minimize scanning of the bitmap. Triggers a global update if an entire
> + * block becomes free or the free spans across blocks. This tradeoff is to
> + * minimize global scanning to update the chunk->contig_hint. The
> + * chunk->contig_hint may be off by up to a block, but a chunk->contig_hint
> + * will never be more than the available space. If the chunk->contig_hint is
> + * in this block, it will be accurate.
> + *
> + * RETURNS:
> + * Bool if we need to update the chunk's metadata. This occurs if a larger
> + * contig region is created along the edges or we free across blocks.
> */
> -static int pcpu_fit_in_area(struct pcpu_chunk *chunk, int off, int this_size,
> - int size, int align, bool pop_only)
> +static bool pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
> + int bit_size)
> {
> - int cand_off = off;
> + bool update_chunk = false;
> + int i;
> + int s_index, e_index, s_off, e_off;
> + int start, end, contig;
> + struct pcpu_bitmap_md *s_block, *e_block, *block;
>
> - while (true) {
> - int head = ALIGN(cand_off, align) - off;
> - int page_start, page_end, rs, re;
> + /* calculate per block offsets */
> + s_index = pcpu_off_to_block_index(bit_off);
> + e_index = pcpu_off_to_block_index(bit_off + bit_size);
> + s_off = pcpu_off_to_block_off(bit_off);
> + e_off = pcpu_off_to_block_off(bit_off + bit_size);
> +
> + /*
> + * If the offset is the beginning of the next block, set it to the
> + * end of the previous block as the last bit is the exclusive.
> + */
> + if (e_off == 0) {
> + e_off = PCPU_BITMAP_BLOCK_SIZE;
> + e_index--;
> + }
> +
> + s_block = chunk->md_blocks + s_index;
> + e_block = chunk->md_blocks + e_index;
> +
> + /*
> + * Check if the freed area aligns with the block->contig_hint.
> + * If it does, then the scan to find the beginning/end of the
> + * larger free area can be avoided.
> + *
> + * start and end refer to beginning and end of the free region
> + * within each their respective blocks. This is not necessarily
> + * the entire free region as it may span blocks past the beginning
> + * or end of the block.
> + */
> + start = s_off;
> + if (s_off == s_block->contig_hint + s_block->contig_hint_start) {
> + start = s_block->contig_hint_start;
> + } else {
> + int l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index),
> + start);
> + start = (start == l_bit) ? 0 : l_bit + 1;
> + }
> +
> + end = e_off;
> + if (e_off == e_block->contig_hint_start)
> + end = e_block->contig_hint_start + e_block->contig_hint;
> + else
> + end = find_next_bit(pcpu_index_alloc_map(chunk, e_index),
> + PCPU_BITMAP_BLOCK_SIZE, end);
>
> - if (this_size < head + size)
> - return -1;
> + /* freeing in the same block */
> + if (s_index == e_index) {
> + contig = end - start;
>
> - if (!pop_only)
> - return head;
> + if (start == 0)
> + s_block->left_free = contig;
>
> + if (end == PCPU_BITMAP_BLOCK_SIZE)
> + s_block->right_free = contig;
> +
> + s_block->first_free = min(s_block->first_free, start);
> + if (contig > s_block->contig_hint) {
> + s_block->contig_hint = contig;
> + s_block->contig_hint_start = start;
> + }
> +
> + } else {
> /*
> - * If the first unpopulated page is beyond the end of the
> - * allocation, the whole allocation is populated;
> - * otherwise, retry from the end of the unpopulated area.
> + * Freeing across md_blocks.
> + *
> + * If the start is at the beginning of the block, just
> + * reset the block instead.
> */
> - page_start = PFN_DOWN(head + off);
> - page_end = PFN_UP(head + off + size);
> -
> - rs = page_start;
> - pcpu_next_unpop(chunk, &rs, &re, PFN_UP(off + this_size));
> - if (rs >= page_end)
> - return head;
> - cand_off = re * PAGE_SIZE;
> + if (start == 0) {
> + s_index--;
> + } else {
> + /*
> + * Knowing that the free is across blocks, this means
> + * the hint can be updated on the right side and the
> + * left side does not need to be touched.
> + */
> + s_block->first_free = min(s_block->first_free, start);
> + contig = PCPU_BITMAP_BLOCK_SIZE - start;
> + s_block->right_free = contig;
> + if (contig > s_block->contig_hint) {
> + s_block->contig_hint = contig;
> + s_block->contig_hint_start = start;
> + }
> + }
> + /*
> + * If end is the entire e_block, just reset the block
> + * as well.
> + */
> + if (end == PCPU_BITMAP_BLOCK_SIZE) {
> + e_index++;
> + } else {
> + /*
> + * The hint must only be on the left side, so
> + * update accordingly.
> + */
> + e_block->first_free = 0;
> + e_block->left_free = end;
> + if (end > e_block->contig_hint) {
> + e_block->contig_hint = end;
> + e_block->contig_hint_start = 0;
> + }
> + }
> +
> + /* reset md_blocks in the middle */
> + for (i = s_index + 1, block = chunk->md_blocks + i;
> + i < e_index; i++, block++) {
> + block->first_free = 0;
> + block->contig_hint_start = 0;
> + block->contig_hint = PCPU_BITMAP_BLOCK_SIZE;
> + block->left_free = PCPU_BITMAP_BLOCK_SIZE;
> + block->right_free = PCPU_BITMAP_BLOCK_SIZE;
> + }
> }
> +
> + /*
> + * The hint is only checked in the s_block and e_block when
> + * freeing and particularly only when it is self contained within
> + * its own block. A scan is required if the free space spans
> + * blocks or makes a block whole as the scan will take into
> + * account free space across blocks.
> + */
> + if ((start == 0 && end == PCPU_BITMAP_BLOCK_SIZE) ||
> + s_index != e_index) {
> + update_chunk = true;
> + } else if (s_block->contig_hint > chunk->contig_hint) {
> + /* check if block contig_hint is bigger */
> + chunk->contig_hint = s_block->contig_hint;
> + chunk->contig_hint_start =
> + pcpu_block_off_to_off(s_index,
> + s_block->contig_hint_start);
> + }
> +
> + return update_chunk;
> }
>
> /**
> - * pcpu_alloc_area - allocate area from a pcpu_chunk
> + * pcpu_is_populated - determines if the region is populated
> * @chunk: chunk of interest
> - * @size: wanted size in bytes
> - * @align: wanted align
> - * @pop_only: allocate only from the populated area
> - * @occ_pages_p: out param for the number of pages the area occupies
> + * @index: block index
> + * @block_off: offset within the bitmap
> + * @bit_size: size of request in allocation units
> + * @next_index: return value for next block index that is populated
> *
> - * Try to allocate @size bytes area aligned at @align from @chunk.
> - * Note that this function only allocates the offset. It doesn't
> - * populate or map the area.
> + * For atomic allocations, we must check if the backing pages are populated.
> *
> - * @chunk->map must have at least two free slots.
> + * RETURNS:
> + * Bool if the backing pages are populated. next_index is to skip over
> + * unpopulated blocks in pcpu_find_block_fit.
> + */
> +static bool pcpu_is_populated(struct pcpu_chunk *chunk, int index,
> + int block_off, int bit_size, int *next_index)
> +{
> + int page_start, page_end, rs, re;
> + int off = pcpu_block_off_to_off(index, block_off);
> + int e_off = off + bit_size * PCPU_MIN_ALLOC_SIZE;
> +
> + page_start = PFN_DOWN(off);
> + page_end = PFN_UP(e_off);
> +
> + rs = page_start;
> + pcpu_next_unpop(chunk, &rs, &re, PFN_UP(e_off));
> + if (rs >= page_end)
> + return true;
> + *next_index = re * PAGE_SIZE / PCPU_BITMAP_BLOCK_SIZE;
> + return false;
> +}
> +
> +/**
> + * pcpu_find_block_fit - finds the block index to start searching
> + * @chunk: chunk of interest
> + * @bit_size: size of request in allocation units
> + * @align: alignment of area (max PAGE_SIZE)
> + * @pop_only: use populated regions only
> *
> - * CONTEXT:
> - * pcpu_lock.
> + * Given a chunk and an allocation spec, find the offset to begin searching
> + * for a free region. This is done by iterating over the bitmap metadata
> + * blocks and then only returning regions that will be guaranteed to fit
> + * alignment by comparing against the block->contig_hint_start or a correctly
> + * aligned offset. Iteration is used within a block as an allocation may be
> + * able to be served prior to the contig_hint.
> + *
> + * Note: This errs on the side of caution by only selecting blocks guaranteed
> + * to have a fit in the chunk's contig_hint. Poor alignment can cause
> + * us to skip over chunk's that have valid vacancies.
> *
> * RETURNS:
> - * Allocated offset in @chunk on success, -1 if no matching area is
> - * found.
> + * The offset in the bitmap to begin searching.
> + * -1 if no offset is found.
> */
> -static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align,
> - bool pop_only, int *occ_pages_p)
> +static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size,
> + size_t align, bool pop_only)
> {
> - int oslot = pcpu_chunk_slot(chunk);
> - int max_contig = 0;
> - int i, off;
> - bool seen_free = false;
> - int *p;
> -
> - for (i = chunk->first_free, p = chunk->map + i; i < chunk->map_used; i++, p++) {
> - int head, tail;
> - int this_size;
> -
> - off = *p;
> - if (off & 1)
> - continue;
> + int i, cur_free;
> + int s_index, block_off, next_index, end_off; /* interior alloc index */
> + struct pcpu_bitmap_md *block;
> + unsigned long *alloc_map;
>
> - this_size = (p[1] & ~1) - off;
> + lockdep_assert_held(&pcpu_lock);
>
> - head = pcpu_fit_in_area(chunk, off, this_size, size, align,
> - pop_only);
> - if (head < 0) {
> - if (!seen_free) {
> - chunk->first_free = i;
> - seen_free = true;
> - }
> - max_contig = max(this_size, max_contig);
> + cur_free = block_off = 0;
> + s_index = chunk->first_free_block;
> + for (i = chunk->first_free_block; i < pcpu_nr_pages_to_blocks(chunk);
> + i++) {
> + alloc_map = pcpu_index_alloc_map(chunk, i);
> + block = chunk->md_blocks + i;
> +
> + /* continue from prev block */
> + cur_free += block->left_free;
> + if (cur_free >= bit_size) {
> + end_off = bit_size;
> + goto check_populated;
> + } else if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) {
> continue;
> }
>
> /*
> - * If head is small or the previous block is free,
> - * merge'em. Note that 'small' is defined as smaller
> - * than sizeof(int), which is very small but isn't too
> - * uncommon for percpu allocations.
> + * Can this block hold this alloc?
> + *
> + * Here the block->contig_hint is used to guarantee a fit,
> + * but the block->first_free is returned as we may be able
> + * to serve the allocation earlier. The population check
> + * must take into account the area beginning at first_free
> + * through the end of the contig_hint.
> */
> - if (head && (head < sizeof(int) || !(p[-1] & 1))) {
> - *p = off += head;
> - if (p[-1] & 1)
> - chunk->free_size -= head;
> - else
> - max_contig = max(*p - p[-1], max_contig);
> - this_size -= head;
> - head = 0;
> + cur_free = 0;
> + s_index = i;
> + block_off = ALIGN(block->contig_hint_start, align);
> + block_off -= block->contig_hint_start;
> + if (block->contig_hint >= block_off + bit_size) {
> + block_off = block->first_free;
> + end_off = block->contig_hint_start - block_off +
> + bit_size;
> + goto check_populated;
> }
>
> - /* if tail is small, just keep it around */
> - tail = this_size - head - size;
> - if (tail < sizeof(int)) {
> - tail = 0;
> - size = this_size - head;
> + /* check right */
> + block_off = ALIGN(PCPU_BITMAP_BLOCK_SIZE - block->right_free,
> + align);
> + /* reset to start looking in the next block */
> + if (block_off >= PCPU_BITMAP_BLOCK_SIZE) {
> + s_index++;
> + cur_free = block_off = 0;
> + continue;
> }
> + cur_free = PCPU_BITMAP_BLOCK_SIZE - block_off;
> + if (cur_free >= bit_size) {
> + end_off = bit_size;
> +check_populated:
> + if (!pop_only ||
> + pcpu_is_populated(chunk, s_index, block_off,
> + end_off, &next_index))
> + break;
>
> - /* split if warranted */
> - if (head || tail) {
> - int nr_extra = !!head + !!tail;
> -
> - /* insert new subblocks */
> - memmove(p + nr_extra + 1, p + 1,
> - sizeof(chunk->map[0]) * (chunk->map_used - i));
> - chunk->map_used += nr_extra;
> -
> - if (head) {
> - if (!seen_free) {
> - chunk->first_free = i;
> - seen_free = true;
> - }
> - *++p = off += head;
> - ++i;
> - max_contig = max(head, max_contig);
> - }
> - if (tail) {
> - p[1] = off + size;
> - max_contig = max(tail, max_contig);
> - }
> + i = next_index - 1;
> + s_index = next_index;
> + cur_free = block_off = 0;
> }
> + }
>
> - if (!seen_free)
> - chunk->first_free = i + 1;
> + /* nothing found */
> + if (i == pcpu_nr_pages_to_blocks(chunk))
> + return -1;
>
> - /* update hint and mark allocated */
> - if (i + 1 == chunk->map_used)
> - chunk->contig_hint = max_contig; /* fully scanned */
> - else
> - chunk->contig_hint = max(chunk->contig_hint,
> - max_contig);
> + return s_index * PCPU_BITMAP_BLOCK_SIZE + block_off;
> +}
>
> - chunk->free_size -= size;
> - *p |= 1;
>
> - *occ_pages_p = pcpu_count_occupied_pages(chunk, i);
> - pcpu_chunk_relocate(chunk, oslot);
> - return off;
> - }
> +/**
> + * pcpu_alloc_area - allocates area from a pcpu_chunk
> + * @chunk: chunk of interest
> + * @bit_size: size of request in allocation units
> + * @align: alignment of area (max PAGE_SIZE)
> + * @start: bit_off to start searching
> + *
> + * This function takes in a start bit_offset to begin searching. It
> + * searches the allocation bitmap to verify that the offset is available
> + * as block->first_free is provided when allocation within a block is
> + * available.
> + *
> + * RETURNS:
> + * Allocated addr offset in @chunk on success,
> + * -1 if no matching area is found
> + */
> +static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bit_size,
> + size_t align, int start)
> +{
> + size_t align_mask = (align) ? (align - 1) : 0;
> + int i, bit_off, oslot;
> + struct pcpu_bitmap_md *block;
> +
> + lockdep_assert_held(&pcpu_lock);
> +
> + oslot = pcpu_chunk_slot(chunk);
> +
> + /* search to find fit */
> + bit_off = bitmap_find_next_zero_area(chunk->alloc_map,
> + pcpu_nr_pages_to_bits(chunk),
> + start, bit_size, align_mask);
> +
> + if (bit_off >= pcpu_nr_pages_to_bits(chunk))
> + return -1;
> +
> + /* update alloc map */
> + bitmap_set(chunk->alloc_map, bit_off, bit_size);
> + /* update boundary map */
> + set_bit(bit_off, chunk->bound_map);
> + bitmap_clear(chunk->bound_map, bit_off + 1, bit_size - 1);
> + set_bit(bit_off + bit_size, chunk->bound_map);
> +
Actually I decided I do want to complain about this. Have you considered making
chunks statically sized, like slab does? We could avoid this whole bound_map
thing completely and save quite a few cycles trying to figure out how big our
allocation was. Thanks,
Josef