Re: [PATCH v3 31/35] lib: add memory allocations report in show_mem()
From: Kent Overstreet
Date: Thu Feb 15 2024 - 19:33:36 EST
On Thu, Feb 15, 2024 at 07:21:41PM -0500, Steven Rostedt wrote:
> On Thu, 15 Feb 2024 18:51:41 -0500
> Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
>
> > Most of that is data (505024), not text (68582, or 66k).
> >
>
> And the 4K extra would have been data too.
"It's not that much" isn't an argument for being wasteful.
> > The data is mostly the alloc tags themselves (one per allocation
> > callsite, and you compiled the entire kernel), so that's expected.
> >
> > Of the text, a lot of that is going to be slowpath stuff - module load
> > and unload hooks, formatt and printing the output, other assorted bits.
> >
> > Then there's Allocation and deallocating obj extensions vectors - not
> > slowpath but not super fast path, not every allocation.
> >
> > The fastpath instruction count overhead is pretty small
> > - actually doing the accounting - the core of slub.c, page_alloc.c,
> > percpu.c
> > - setting/restoring the alloc tag: this is overhead we add to every
> > allocation callsite, so it's the most relevant - but it's just a few
> > instructions.
> >
> > So that's the breakdown. Definitely not zero overhead, but that fixed
> > memory overhead (and additionally, the percpu counters) is the price we
> > pay for very low runtime CPU overhead.
>
> But where are the benchmarks that are not micro-benchmarks. How much
> overhead does this cause to those? Is it in the noise, or is it noticeable?
Microbenchmarks are how we magnify the effect of a change like this to
the most we'll ever see. Barring cache effects, it'll be in the noise.
Cache effects are a concern here because we're now touching task_struct
in the allocation fast path; that is where the
"compiled-in-but-turned-off" overhead comes from, because we can't add
static keys for that code without doubling the amount of icache
footprint, and I don't think that would be a great tradeoff.
So: if your code has fastpath allocations where the hot part of
task_struct isn't in cache, then this will be noticeable overhead to
you, otherwise it won't be.