segregated list + slab merging is much better than original SLOB

From: Hyeonggon Yoo
Date: Sun Oct 17 2021 - 09:36:31 EST


On Sun, Oct 17, 2021 at 04:28:52AM +0000, Hyeonggon Yoo wrote:
> I've been reading SLUB/SLOB code for a while. SLUB recently became
> real time compatible by reducing its locking area.
>
> for now, SLUB is the only slab allocator for PREEMPT_RT because
> it works better than SLAB on RT and SLOB uses non-deterministic method,
> sequential fit.
>
> But memory usage of SLUB is too high for systems with low memory.
> So In my local repository I made SLOB to use segregated free list
> method, which is more more deterministic, to provide bounded latency.
>
> This can be done by managing list of partial pages globally
> for every power of two sizes (8, 16, 32, ..., PAGE_SIZE) per NUMA nodes.
> minimal allocation size is size of pointers to keep pointer of next free object
> like SLUB.
>
> By making objects in same page to have same size, there's no
> need to iterate free blocks in a page. (Also iterating pages isn't needed)
>
> Some cleanups and more tests (especially with NUMA/RT configs) needed,
> but want to hear your opinion about the idea. Did not test on RT yet.
>
> Below is result of benchmarks and memory usage. (on !RT)
> with 13% increase in memory usage, it's nine times faster and
> bounded fragmentation, and importantly provides predictable execution time.
>

Hello linux-mm, I improved it and it uses lower memory
and 9x~13x faster than original SLOB. it shows much less fragmentation
after hackbench.

Rather than managing global freelist that has power of 2 sizes,
I made a kmem_cache to manage its own freelist (for each NUMA nodes) and
Added support for slab merging. So It quite looks like a lightweight SLUB now.

I'll send rfc patch after some testing and code cleaning.

I think it is more RT-friendly becuase it's uses more deterministic
algorithm (But lock is still shared among cpus). Any opinions for RT?

current SLOB:
memory usage:
after boot:
Slab: 7908 kB
after hackbench:
Slab: 8544 kB

Time: 189.947
Performance counter stats for 'hackbench -g 4 -l 10000':
379413.20 msec cpu-clock # 1.997 CPUs utilized
8818226 context-switches # 23.242 K/sec
375186 cpu-migrations # 988.859 /sec
3954 page-faults # 10.421 /sec
269923095290 cycles # 0.711 GHz
212341582012 instructions # 0.79 insn per cycle
2361087153 branch-misses
58222839688 cache-references # 153.455 M/sec
6786521959 cache-misses # 11.656 % of all cache refs

190.002062273 seconds time elapsed

3.486150000 seconds user
375.599495000 seconds sys

SLOB with segregated list + slab merging:
memory usage:
after boot:
Slab: 7560 kB
after hackbench:
Slab: 7836 kB

hackbench:
Time: 20.780
Performance counter stats for 'hackbench -g 4 -l 10000':
41509.79 msec cpu-clock # 1.996 CPUs utilized
630032 context-switches # 15.178 K/sec
8287 cpu-migrations # 199.640 /sec
4036 page-faults # 97.230 /sec
57477161020 cycles # 1.385 GHz
62775453932 instructions # 1.09 insn per cycle
164902523 branch-misses
22559952993 cache-references # 543.485 M/sec
832404011 cache-misses # 3.690 % of all cache refs

20.791893590 seconds time elapsed

1.423282000 seconds user
40.072449000 seconds sys
-
Thanks,
Hyeonggon