Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is disabled by default v4

From: Mel Gorman
Date: Wed Feb 03 2016 - 08:32:55 EST


On Wed, Feb 03, 2016 at 01:49:21PM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
>
> > On Wed, Feb 03, 2016 at 12:28:49PM +0100, Ingo Molnar wrote:
> > >
> > > * Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > > Changelog since v3
> > > > o Force enable stats during profiling and latencytop
> > > >
> > > > Changelog since V2
> > > > o Print stats that are not related to schedstat
> > > > o Reintroduce a static inline for update_stats_dequeue
> > > >
> > > > Changelog since V1
> > > > o Introduce schedstat_enabled and address Ingo's feedback
> > > > o More schedstat-only paths eliminated, particularly ttwu_stat
> > > >
> > > > schedstats is very useful during debugging and performance tuning but it
> > > > incurs overhead. As such, even though it can be disabled at build time,
> > > > it is often enabled as the information is useful. This patch adds a
> > > > kernel command-line and sysctl tunable to enable or disable schedstats on
> > > > demand. It is disabled by default as someone who knows they need it can
> > > > also learn to enable it when necessary.
> > > >
> > > > The benefits are workload-dependent but when it gets down to it, the
> > > > difference will be whether cache misses are incurred updating the shared
> > > > stats or not. [...]
> > >
> > > Hm, which shared stats are those?
> >
> > Extremely poor phrasing on my part. The stats share a cache line and the impact
> > partially depends on whether unrelated stats share a cache line or not during
> > updates.
>
> Yes, but the question is, are there true cross-CPU cache-misses? I.e. are there
> any 'global' (or per node) counters that we keep touching and which keep
> generating cache-misses?
>

I haven't specifically identified them as I consider the calculations for
some of them to be expensive in their own right even without accounting for
cache misses. Moving to per-cpu counters would not eliminate all cache misses
as a stat updated on one CPU for a task that is woken on a separate CPU is
still going to trigger a cache miss. Even if such counters were identified
and moved to separate cache lines, the calculation overhead would remain.

> > > I think we should really fix those as well: those shared stats should be
> > > percpu collected as well, with no extra cache misses in any scheduler fast
> > > path.
> >
> > I looked into that but converting those stats to per-cpu counters would incur
> > sizable memory overhead. There are a *lot* of them and the basic structure for
> > the generic percpu-counter is
> >
> > struct percpu_counter {
> > raw_spinlock_t lock;
> > s64 count;
> > #ifdef CONFIG_HOTPLUG_CPU
> > struct list_head list; /* All percpu_counters are on a list */
> > #endif
> > s32 __percpu *counters;
> > };
>
> We don't have to reuse percpu_counter().
>

No, but rolling a specialised solution for a debugging feature is overkill
and the calculation overhead would remain. It's specialised code with very
little upside.

The main gain from the patch is that the calculation overhead is
avoided. Avoid any potential cache miss is a bonus.

> > That's not taking the associated runtime overhead such as synchronising them.
>
> Why do we have to synchronize them in the kernel?

Because some simply require it or are not suitable for moving to per-cpu
counters at all. sleep_start is an obvious one as it can wake on another
CPU.

>? User-space can recover them on a
> percpu basis and add them up if it wishes to. We can update the schedstat utility
> to handle the more spread out fields as well.
>

Any user of /proc/pid/sched would also need updating, including latencytop
and all of them will need to be able to handle CPU hotplug or else deal
with the output from all possible CPUs instead of the currently online ones.

--
Mel Gorman
SUSE Labs