Re: [RFC PATCH 00/11] Add a percpu subsection for hot data

From: Ingo Molnar
Date: Sun Feb 23 2025 - 04:37:14 EST



* Brian Gerst <brgerst@xxxxxxxxx> wrote:

> Add a new percpu subsection for data that is frequently accessed and
> exclusive to each processor. This is intended to replace the pcpu_hot
> struct on X86, and is available to all architectures.
>
> The one caveat with this approach is that it depends on the linker to
> effeciently pack data that is smaller than machine word size. The
> binutils linker does this properly:
>
> ffffffff842f6000 D __per_cpu_hot_start
> ffffffff842f6000 D softirq_pending
> ffffffff842f6002 D hardirq_stack_inuse
> ffffffff842f6008 D hardirq_stack_ptr
> ffffffff842f6010 D __ref_stack_chk_guard
> ffffffff842f6010 D __stack_chk_guard
> ffffffff842f6018 D const_cpu_current_top_of_stack
> ffffffff842f6018 D cpu_current_top_of_stack
> ffffffff842f6020 D const_current_task
> ffffffff842f6020 D current_task
> ffffffff842f6028 D __preempt_count
> ffffffff842f602c D cpu_number
> ffffffff842f6030 D this_cpu_off
> ffffffff842f6038 D __x86_call_depth
> ffffffff842f6040 D __per_cpu_hot_end
>
> The LLVM linker doesn't do as well with packing smaller data objects,
> causing it to spill over into a second cacheline.

Ok, so I like it how it decentralizes the decision about what is 'hot'
and what is not:

--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c

DEFINE_PER_CPU_SHARED_ALIGNED(irq_cpustat_t, irq_stat);
EXPORT_PER_CPU_SYMBOL(irq_stat);

+DEFINE_PER_CPU_HOT(u16, softirq_pending);

This can also be a drawback if it's abused by random driver code - so I
think it should at minimum be documented to be used by core & arch
code. Maybe add a build #error too if it's defined in modular code?

Other variants like DEFINE_PER_CPU_SHARED_ALIGNED aren't being abused
really AFAICS, so maybe this isn't too much of a concern.

One potential drawback would be that previously the section was
hand-ordered:

struct pcpu_hot {
union {
struct {
struct task_struct *current_task;
int preempt_count;
int cpu_number;
#ifdef CONFIG_MITIGATION_CALL_DEPTH_TRACKING
u64 call_depth;
#endif
unsigned long top_of_stack;
void *hardirq_stack_ptr;
u16 softirq_pending;
#ifdef CONFIG_X86_64
bool hardirq_stack_inuse;
#else
void *softirq_stack_ptr;
#endif
};
u8 pad[64];
};
};

... while now it's linker-ordered. But on the other hand that can be an
advantage too: the linker will try to (or at least has a chance to)
order the fields optimally for cache density, while the hand-packing
always has the potential to bitrot without much of an outside,
actionable indicator for the bitrot.

One naming suggestion, wouldn't it be better to make it explicit that
the 'hot' qualifier is about cache locality:

+DEFINE_PER_CPU_CACHE_HOT(u16, softirq_pending);

Makes it more of a mouthful to write definitions/declarations, but the
actual per-cpu usage sites are unaffected as this too is otherwise part
of the generic percpu namespace.

... and yes, DEFINE_PER_CPU_ALIGNED should probably have been named
DEFINE_PER_CPU_CACHE_ALIGNED too. (Because 'aligned' often means
machine word unit, so the naming is a bit ambiguous.)

I.e. in an ideal world the complete set of DEFINE_PER_CPU_XXX
attributes should be something like:

DEFINE_PER_CPU_CACHE_HOT
DEFINE_PER_CPU_CACHE_ALIGNED # was: DEFINE_PER_CPU_ALIGNED
DEFINE_PER_CPU_CACHE_ALIGNED_SHARED # was: DEFINE_PER_CPU_SHARED_ALIGNED

DEFINE_PER_CPU_PAGE_ALIGNED

DEFINE_PER_CPU_READ_MOSTLY
DEFINE_PER_CPU_DECRYPTED

But I digress...

Anyway, I've Cc:-ed various potentially interested parties, please
speak up now or forever hold your peace. ;-)

Thanks,

Ingo