Re: [PATCH] Make SPLIT_RSS_COUNTING configurable

From: Kirill A. Shutemov
Date: Sun Oct 06 2019 - 20:11:49 EST

Next message: kernelci.org bot: "Re: [PATCH 5.3 000/166] 5.3.5-stable review"
Previous message: Jarkko Sakkinen: "Re: [PATCH] KEYS: asym_tpm: Switch to get_random_bytes()"
In reply to: Michal Hocko: "Re: [PATCH] Make SPLIT_RSS_COUNTING configurable"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Oct 04, 2019 at 06:45:21AM -0700, Daniel Colascione wrote:
> On Fri, Oct 4, 2019 at 6:26 AM Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
> > On Fri, Oct 04, 2019 at 02:33:49PM +0200, Michal Hocko wrote:
> > > On Wed 02-10-19 19:08:16, Daniel Colascione wrote:
> > > > On Wed, Oct 2, 2019 at 6:56 PM Qian Cai <cai@xxxxxx> wrote:
> > > > > > On Oct 2, 2019, at 4:29 PM, Daniel Colascione <dancol@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > Adding the correct linux-mm address.
> > > > > >
> > > > > >
> > > > > >> +config SPLIT_RSS_COUNTING
> > > > > >> + bool "Per-thread mm counter caching"
> > > > > >> + depends on MMU
> > > > > >> + default y if NR_CPUS >= SPLIT_PTLOCK_CPUS
> > > > > >> + help
> > > > > >> + Cache mm counter updates in thread structures and
> > > > > >> + flush them to visible per-process statistics in batches.
> > > > > >> + Say Y here to slightly reduce cache contention in processes
> > > > > >> + with many threads at the expense of decreasing the accuracy
> > > > > >> + of memory statistics in /proc.
> > > > > >> +
> > > > > >> endmenu
> > > > >
> > > > > All those vague words are going to make developers almost
> > > > > impossible to decide the right selection here. It sounds like we
> > > > > should kill SPLIT_RSS_COUNTING at all to simplify the code as
> > > > > the benefit is so small vs the side-effect?
> > > >
> > > > Killing SPLIT_RSS_COUNTING would be my first choice; IME, on mobile
> > > > and a basic desktop, it doesn't make a difference. I figured making it
> > > > a knob would help allay concerns about the performance impact in more
> > > > extreme configurations.
> > >
> > > I do agree with Qian. Either it is really helpful (is it? probably on
> > > the number of cpus) and it should be auto-enabled or it should be
> > > dropped altogether. You cannot really expect people know how to enable
> > > this without a deep understanding of the MM internals. Not to mention
> > > all those users using distro kernels/configs.
> > >
> > > A config option sounds like a bad way forward.
> >
> > And I don't see much point anyway. Reading RSS counters from proc is
> > inherently racy. It can just either way after the read due to process
> > behaviour.
>
> Split RSS accounting doesn't make reading from mm counters racy. It
> makes these counters *wrong*. We flush task mm counters to the
> mm_struct once every 64 page faults that a task incurs or when that
> task exits. That means that if a thread takes 63 page faults and then
> sleeps for a week, that thread's process's mm counters are wrong by 63
> pages *for a week*. And some processes have a lot of threads,
> compounding the error. Split RSS accounting means that memory usage
> numbers don't add up. I don't think it's unreasonable to want a mode
> where memory counters to agree with other indicators of system
> activity.

It's documented behaviour that is upstream for 9 years. Why is your workload
special?

The documentation suggests to use smaps if you want to have precise data.
Why would it not fly for you?

> Nobody has demonstrated that split RSS accounting actually helps in
> the real world.

The original commit 34e55232e59f ("mm: avoid false sharing of mm_counter")
shows numbers on cache misses. It's not a real world workload, but you
don't have any numbers at all to back your claim.

> But I've described above, concretely, how split RSS
> accounting hurts. I've been trying for over a year to either disable
> split RSS accounting or to let people opt out of it. If you won't
> remove split RSS accounting and you won't let me add a configuration
> knob that lets people opt out of it, what will you accept?

Keeping stats precise is welcome, but often expensive. It might be
negligible for small machine, but becomes a problem on multisocket
machine with dozens or hundreds of cores. We need to keep kernel scalable.

We have other stats that update asynchronously (i.e. /proc/vmstat). Would
you like to convert them too?

--
Kirill A. Shutemov

Next message: kernelci.org bot: "Re: [PATCH 5.3 000/166] 5.3.5-stable review"
Previous message: Jarkko Sakkinen: "Re: [PATCH] KEYS: asym_tpm: Switch to get_random_bytes()"
In reply to: Michal Hocko: "Re: [PATCH] Make SPLIT_RSS_COUNTING configurable"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]