Re: [PATCH v2] mm: emit tracepoint when RSS changes by threshold

From: Joel Fernandes
Date: Wed Sep 04 2019 - 12:28:12 EST


On Wed, Sep 4, 2019 at 11:38 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>
> On Wed 04-09-19 11:32:58, Joel Fernandes wrote:
> > On Wed, Sep 04, 2019 at 10:45:08AM +0200, Michal Hocko wrote:
> > > On Tue 03-09-19 16:09:05, Joel Fernandes (Google) wrote:
> > > > Useful to track how RSS is changing per TGID to detect spikes in RSS and
> > > > memory hogs. Several Android teams have been using this patch in various
> > > > kernel trees for half a year now. Many reported to me it is really
> > > > useful so I'm posting it upstream.
> > > >
> > > > Initial patch developed by Tim Murray. Changes I made from original patch:
> > > > o Prevent any additional space consumed by mm_struct.
> > > > o Keep overhead low by checking if tracing is enabled.
> > > > o Add some noise reduction and lower overhead by emitting only on
> > > > threshold changes.
> > >
> > > Does this have any pre-requisite? I do not see trace_rss_stat_enabled in
> > > the Linus tree (nor in linux-next).
> >
> > No, this is generated automatically by the tracepoint infrastructure when a
> > tracepoint is added.
>
> OK, I was not aware of that.
>
> > > Besides that why do we need batching in the first place. Does this have a
> > > measurable overhead? How does it differ from any other tracepoints that we
> > > have in other hotpaths (e.g. page allocator doesn't do any checks).
> >
> > We do need batching not only for overhead reduction,
>
> What is the overhead?

The overhead is occasionally higher without the threshold (that is if we
trace every counter change). I would classify performance benefit to be
almost the same and within the noise.

For memset of 1GB data:

With threshold:
Total time for 1GB data: 684172499 nanoseconds.
Total time for 1GB data: 692379986 nanoseconds.
Total time for 1GB data: 760023463 nanoseconds.
Total time for 1GB data: 669291457 nanoseconds.
Total time for 1GB data: 729722783 nanoseconds.

Without threshold
Total time for 1GB data: 722505810 nanoseconds.
Total time for 1GB data: 648724292 nanoseconds.
Total time for 1GB data: 643102853 nanoseconds.
Total time for 1GB data: 641815282 nanoseconds.
Total time for 1GB data: 828561187 nanoseconds. <-- outlier but it did happen.

> > but also for reducing
> > tracing noise. Flooding the traces makes it less useful for long traces and
> > post-processing of traces. IOW, the overhead reduction is a bonus.
>
> This is not really anything special for this tracepoint though.
> Basically any tracepoint in a hot path is in the same situation and I do
> not see a point why each of them should really invent its own way to
> throttle. Maybe there is some way to do that in the tracing subsystem
> directly.

I am not sure if there is a way to do this easily. Add to that, the fact that
you still have to call into trace events. Why call into it at all, if you can
filter in advance and have a sane filtering default?

The bigger improvement with the threshold is the number of trace records are
almost halved by using a threshold. The number of records went from 4.6K to
2.6K.

I don't see any drawbacks with using a threshold. There is no overhead either
way. For system without split RSS accounting, the reduction in number of
trace records would be even higher significantly reducing the consumption of
the ftrace buffer and the noise that people have to deal with.

Hope you agree now?

thanks,

- Joel