Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
From: Leon Huang Fu
Date: Tue Nov 11 2025 - 01:13:12 EST
Hi Harry,
On Mon, Nov 10, 2025 at 7:52 PM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote:
>
> On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu wrote:
> > Memory cgroup statistics are updated asynchronously with periodic
> > flushing to reduce overhead. The current implementation uses a flush
> > threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
> > determining when to aggregate per-CPU memory cgroup statistics. On
> > systems with high core counts, this threshold can become very large
> > (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
> > statistics when userspace reads memory.stat files.
> >
> > This is particularly problematic for monitoring and management tools
> > that rely on reasonably fresh statistics, as they may observe data
> > that is thousands of updates out of date.
> >
> > Introduce a new write-only file, memory.stat_refresh, that allows
> > userspace to explicitly trigger an immediate flush of memory statistics.
> >
> > Writing any value to this file forces a synchronous flush via
> > __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
> > descendants, ensuring that subsequent reads of memory.stat and
> > memory.numa_stat reflect current data.
> >
> > This approach follows the pattern established by /proc/sys/vm/stat_refresh
> > and memory.peak, where the written value is ignored, keeping the
> > interface simple and consistent with existing kernel APIs.
> >
> > Usage example:
> > echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
> > cat /sys/fs/cgroup/mygroup/memory.stat
> >
> > The feature is available in both cgroup v1 and v2 for consistency.
> >
> > Signed-off-by: Leon Huang Fu <leon.huangfu@xxxxxxxxxx>
> > ---
> > v2 -> v3:
> > - Flush stats by memory.stat_refresh (per Michal)
> > - https://lore.kernel.org/linux-mm/20251105074917.94531-1-leon.huangfu@xxxxxxxxxx/
> >
> > v1 -> v2:
> > - Flush stats when write the file (per Michal).
> > - https://lore.kernel.org/linux-mm/20251104031908.77313-1-leon.huangfu@xxxxxxxxxx/
> >
> > Documentation/admin-guide/cgroup-v2.rst | 21 +++++++++++++++++--
> > mm/memcontrol-v1.c | 4 ++++
> > mm/memcontrol-v1.h | 2 ++
> > mm/memcontrol.c | 27 ++++++++++++++++++-------
> > 4 files changed, 45 insertions(+), 9 deletions(-)
>
> Hi Leon, I have a few questions on the patch.
>
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 3345961c30ac..ca079932f957 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1337,7 +1337,7 @@ PAGE_SIZE multiple when read back.
> > cgroup is within its effective low boundary, the cgroup's
> > memory won't be reclaimed unless there is no reclaimable
> > memory available in unprotected cgroups.
> > - Above the effective low boundary (or
> > + Above the effective low boundary (or
>
> Is this whitespace change? it looks the same as before.
>
Yes, that hunk just trims the trailing whitespace.
If you'd prefer to avoid the churn, I'm happy to drop it from the series.
> > effective min boundary if it is higher), pages are reclaimed
> > proportionally to the overage, reducing reclaim pressure for
> > smaller overages.
> > @@ -1785,6 +1785,23 @@ The following nested keys are defined.
> > up if hugetlb usage is accounted for in memory.current (i.e.
> > cgroup is mounted with the memory_hugetlb_accounting option).
> >
> > + memory.stat_refresh
> > + A write-only file which exists on non-root cgroups.
>
> Why don't we create the file for the root cgroup?
>
Thanks for pointing that out—I copied the wording from the memory.stat section without double-checking.
All three files, memory.{stat,numa_stat,stat_refresh}, are created for the root cgroup.
> > + Writing any value to this file forces an immediate flush of
> > + memory statistics for this cgroup and its descendants. This
> > + ensures subsequent reads of memory.stat and memory.numa_stat
> > + reflect the most current data.
> > +
> > + This is useful on high-core count systems where per-CPU caching
> > + can lead to stale statistics, or when precise memory usage
> > + information is needed for monitoring or debugging purposes.
> > +
> > + Example::
> > +
> > + echo 1 > memory.stat_refresh
> > + cat memory.stat
> > +
> > memory.numa_stat
> > A read-only nested-keyed file which exists on non-root cgroups.
> >
> > @@ -2173,7 +2190,7 @@ of the two is enforced.
> >
> > cgroup writeback requires explicit support from the underlying
> > filesystem. Currently, cgroup writeback is implemented on ext2, ext4,
> > -btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
> > +btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
> > attributed to the root cgroup.
>
> Same here, not sure what's changed...
That's just trimming the trailing whitespace.
>
> > There are inherent differences in memory and writeback management
> > diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> > index 6358464bb416..a14d4d74c9aa 100644
> > --- a/mm/memcontrol-v1.h
> > +++ b/mm/memcontrol-v1.h
> > @@ -4666,6 +4675,10 @@ static struct cftype memory_files[] = {
> > .name = "stat",
> > .seq_show = memory_stat_show,
> > },
> > + {
> > + .name = "stat_refresh",
> > + .write = memory_stat_refresh_write,
>
> I think we should use the CFTYPE_NOT_ON_ROOT flag to avoid creating
> the file for the root cgroup if that's intended?
>
I kept memory.stat_refresh aligned with the existing memory.stat entry, so
I left CFTYPE_NOT_ON_ROOT unset.
That said, the documentation is behind the current behavior; I'll update
it to spell out that the files exist on the root cgroup too.
Thanks,
Leon