Re: [PATCH 0/2] quickstats, kernel sample collector

From: Luigi Rizzo
Date: Wed Feb 26 2020 - 06:40:42 EST


On Wed, Feb 26, 2020 at 2:15 AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Feb 26, 2020 at 01:52:25AM -0800, Luigi Rizzo wrote:
> > On Wed, Feb 26, 2020 at 12:10 AM Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Tue, Feb 25, 2020 at 06:30:25PM -0800, Luigi Rizzo wrote:
> > > > This patchset introduces a small library to collect per-cpu samples and
> > > > accumulate distributions to be exported through debugfs.
> > >
> > > Shouldn't this be part of the tracing infrastructure instead of being
> > > "stand-alone"?
> >
> > That's an option. My reasoning for making it standalone was that
> > there are no dependencies in the (trivial) collection/aggregation part,
> > so that code might conveniently replace/extend existing snippets of
> > code that collect distributions in ad-hoc and perhaps suboptimal ways.
>
> But that's what perf and tracing already does today, right?

Maybe I am mistaken but I believe there are substantial performance and use case
differences between kstats and existing perf/tracing code, as described below.

kstats is meant to be a) used for manual code annotations and b) be as
fast as possible.
For a) there are already several places in the kernel (a grep
indicates fs/fscache, drivers/md/,
some drivers; I am sure there are more places) where we accumulate and export
metrics in ad-hoc ways (packet sizes, memory requests, requests
execution times).
There are other places where we would in principle have the information (eg
CONFIG_IRQ_TIME_ACCOUNTING knows intervals spent in soft/hard interrupts;
napi calls report how much of the budget has been used; NIC drivers know actual
batch sizes) but we don't try to accumulate it even though it would be
precious for
performance tuning.
kstats in my view fits this use case

For b), the manual annotations are as fast as possible, and kstats_record() with
a hot cache takes only about 5ns, and 250ns with cold cache (this is probably
the same as the existing code that it is meant to replace), and
inherits the accuracy
of the base clock (ktime_get_ns() is about 20ns on x86).
This means that we can definitely tell apart samples that differ by
O(50ns), which is
the order of magnitude of cache misses, and instrument even
sub-microsecond sections
of code with limited impact on performance. For networking code for
instance, or other
high speed drivers, scheduler-related functions, signaling latencies
etc, those are
a significant use case.

The tracepoint/kprobe/kretprobe solution is much more expensive --
from my measurements,
the hooks that invoke the various handlers take ~250ns with hot cache,
1500+ns with cold
cache, and tracing an empty function this way reports 90ns with hot
cache, 500ns with
cold cache.
As a consequence, enabling tracing through those hooks is only viable
on much longer time
intervals, and the much coarser accuracy (anything shorter than those
90..500ns becomes
hidden in the noise) would hide shorter phenomena.

> You need to
> integrate into the existing subsystems of the kernel and not duplicate
> things, creating new user/kernel apis whenever possible.

For the above, I am not sure this is a duplication.
Perhaps part of the problem is that "perf and tracing" are too general
terms, and while
at a high level they encompass every possible monitoring activity, the
actual implementation
seems to me orthogonal to kstats. Of course we can fold the 300 lines
of kstats into
perf/tracing, but then I wonder, do we need to bring in the whole
thing when all we need
is just the smaller component ?

cheers
luigi