Re: [PATCH v2 3/3] mm: don't account memmap per node

From: Pasha Tatashin
Date: Thu Aug 08 2024 - 10:38:42 EST


On Thu, Aug 8, 2024 at 1:36 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
>
> On Wed, Aug 7, 2024 at 9:04 PM Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote:
> >
> > On Wed, Aug 7, 2024 at 10:59 PM Muchun Song <muchun.song@xxxxxxxxx> wrote:
> > >
> > >
> > >
> > > > On Aug 8, 2024, at 05:19, Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> wrote:
> > > >
> > > > Currently, when memory is hot-plugged or hot-removed the accounting is
> > > > done based on the assumption that memmap is allocated from the same node
> > > > as the hot-plugged/hot-removed memory, which is not always the case.
> > > >
> > > > In addition, there are challenges with keeping the node id of the memory
> > > > that is being remove to the time when memmap accounting is actually
> > > > performed: since this is done after remove_pfn_range_from_zone(), and
> > > > also after remove_memory_block_devices(). Meaning that we cannot use
> > > > pgdat nor walking though memblocks to get the nid.
> > > >
> > > > Given all of that, account the memmap overhead system wide instead.
> > >
> > > Hi Pasha,
> > >
> > > You've changed it to vm event mechanism. But I found a comment (below) say
> > > "CONFIG_VM_EVENT_COUNTERS". I do not know why it has such a rule
> > > sice 2006. Now the rule should be changed, is there any effect to users of
> > > /proc/vmstat?
> >
> > There should not be any effect on the users of the /proc/vmstat, the
> > values for nr_memap and nr_memmap_boot before and after are still in
> > /proc/vmstat under the same names.
> >
> > >
> > > /*
> > > * Light weight per cpu counter implementation.
> > > *
> > > * Counters should only be incremented and no critical kernel component
> > > * should rely on the counter values.
> > > *
> > > * Counters are handled completely inline. On many platforms the code
> > > * generated will simply be the increment of a global address.
> >
> > Thank you for noticing this. Based on my digging, it looks like this
> > comment means that the increment only produces the most efficient code
> > on some architectures (i.e. i386, ia64):
> >
> > Here is the original commit message from 6/30/06:
> > f8891e5e1f93a1 [PATCH] Lightweight event counters
> >
> > Relevant information:
> > The implementation of these counters is through inline code that hopefully
> > results in only a single instruction increment instruction being emitted
> > (i386, x86_64) or in the increment being hidden though instruction
> > concurrency (EPIC architectures such as ia64 can get that done).
> >
> > My patch does not change anything in other places where vm_events are
> > used, so it won't introduce performance regression anywhere. Memmap,
> > increment and decrement can happen based on the value of delta. I have
> > tested, and it works correctly. Perhaps we should update the comment.
>
> I think there may be a semantic inconsistency here.
>
> I am not so sure about this code, but for memcg stats, there is a
> semantic distinction between stat (or state) and event.
>
> Per-memcg events (which are a subset of NR_VM_EVENT_ITEMS) are
> basically counting the number of times a certain event happened (e.g.
> PGFAULT). This naturally cannot be decremented because the number of
> page faults that happened cannot decrease.

>From what I can tell, for users, there is no difference, at the end
everything is provided through /proc/vmstat, depending on a counter
name they can either only-grow or go up and down. The separation is an
internal only concept.

> Per-memcg state are things that represent the current state of the
> system (e.g. NR_SWAPCACHE). This can naturally go up or down.
>
> It seems like the code here follows the same semantics, and this
> change breaks that. Also, now these stats depend on
> CONFIG_VM_EVENT_COUNTERS .

Yes, nr_memmap_* won't show up without this config with my change.

>
> Looking at NR_VMSTAT_ITEMS, it looks like it's composed of:
> NR_VM_ZONE_STAT_ITEMS,
> NR_VM_NUMA_EVENT_ITEMS,
> NR_VM_NODE_STAT_ITEMS,
> NR_VM_WRITEBACK_STAT_ITEMS,
> NR_VM_EVENT_ITEMS (with CONFIG_VM_EVENT_COUNTERS)
> Semantically, the memmap stats do not fit into any of the above
> categories if we do not want them to be per-node. Maybe they should
> have their own category like NR_VM_WRITEBACK_STAT_ITEMS, or maybe we
> should consolidate both of them into a global stat items category
> (e.g. NR_VM_STAT_ITEMS)?

I like the idea of renaming NR_VM_WRITEBACK_STAT_ITEMS with
NR_GLOBAL_STAT_ITEMS, and add counters there, let me do that.

Pasha