Re: [RFC PATCH v3 00/10] kmem_cache instances with static storage duration

From: Harry Yoo

Date: Mon Jun 29 2026 - 11:33:50 EST



[Adding a new slab reviewer Hao Li and dropping my old email]

The thread:
https://lore.kernel.org/linux-mm/20260613050951.855141-1-viro@xxxxxxxxxxxxxxxxxx

Trying to follow the discussion here...

On 6/24/26 9:48 AM, Al Viro wrote:
> On Tue, Jun 23, 2026 at 10:09:41AM +0200, Vlastimil Babka (SUSE) wrote:
>
>> But the argument for doing the static duration support is that it should be
>> faster, not just "not slower"? So is runtime_const equivalent or for some
>> fundamental reason it's slower than plain &?
>
> Yes, on any 64bit RISC. And if nothing else, arm64 has enough users to care
> about.
Agreed that arm64 has enough users to care about performance.

Out of curiosity, could you please share on which workload you observed
this (dereferencing cache pointers) on profile and how bad it was?

> Compiler does *not* build the address of global variable in a sequence of
> shifts and bitwise operations when it needs to pass it to a function.
>
> runtime_const_ptr() must be able to handle an arbitrary address; it can't
> avoid doing the general "build a 64bit value in register", which tends to
> be nasty on RISC.

It'd be nice to have some examples in the cover letter.

So if I'm following correctly, arm64, for example, uses four
instructions to build 64-bit value in register (one instruction to fill
each 16 bit) in runtime_const_ptr().

And by making it plain &, the compiler can use clever tricks to use
fewer instructions to do this?

e.g.) a quick search says [1], on arm64, it is possible to reduce them
to two instructions to build the address of any global variables, one
instruction (ADRP) to generate page-aligned address using pc-relative
addressing (within +- 4GB), and another instruction (ADD) to fill the
lower 12bits.

That's what makes it faster than runtime_const(), right?

[1] https://devblogs.microsoft.com/oldnewthing/20220809-00/?p=106955

> If you want real ugliness, take a look at riscv - AFAICS, they wanted to
> avoid a long dependency chain, so they load chunks of constant into 4 registers,
> then shift and combine those.

--
Cheers,
Harry / Hyeonggon