Re: [PATCH v2 0/5] Statsfs: a new ram-based file sytem for Linux kernel statistics

From: Amit Kucheria
Date: Thu Jun 04 2020 - 08:00:03 EST


On Tue, May 5, 2020 at 3:07 AM David Rientjes <rientjes@xxxxxxxxxx> wrote:
>
> On Mon, 4 May 2020, Emanuele Giuseppe Esposito wrote:
>
> > There is currently no common way for Linux kernel subsystems to expose
> > statistics to userspace shared throughout the Linux kernel; subsystems
> > have to take care of gathering and displaying statistics by themselves,
> > for example in the form of files in debugfs. For example KVM has its own
> > code section that takes care of this in virt/kvm/kvm_main.c, where it sets
> > up debugfs handlers for displaying values and aggregating them from
> > various subfolders to obtain information about the system state (i.e.
> > displaying the total number of exits, calculated by summing all exits of
> > all cpus of all running virtual machines).
> >
> > Allowing each section of the kernel to do so has two disadvantages. First,
> > it will introduce redundant code. Second, debugfs is anyway not the right
> > place for statistics (for example it is affected by lockdown)
> >
> > In this patch series I introduce statsfs, a synthetic ram-based virtual
> > filesystem that takes care of gathering and displaying statistics for the
> > Linux kernel subsystems.
> >
>
> This is exciting, we have been looking in the same area recently. Adding
> Jonathan Adams <jwadams@xxxxxxxxxx>.
>
> In your diffstat, one thing I notice that is omitted: an update to
> Documentation/* :) Any chance of getting some proposed Documentation/
> updates with structure of the fs, the per subsystem breakdown, and best
> practices for managing the stats from the kernel level?
>
> > The file system is mounted on /sys/kernel/stats and would be already used
> > by kvm. Statsfs was initially introduced by Paolo Bonzini [1].
> >
> > Statsfs offers a generic and stable API, allowing any kind of
> > directory/file organization and supporting multiple kind of aggregations
> > (not only sum, but also average, max, min and count_zero) and data types
> > (all unsigned and signed types plus boolean). The implementation, which is
> > a generalization of KVMâs debugfs statistics code, takes care of gathering
> > and displaying information at run time; users only need to specify the
> > values to be included in each source.
> >
> > Statsfs would also be a different mountpoint from debugfs, and would not
> > suffer from limited access due to the security lock down patches. Its main
> > function is to display each statistics as a file in the desired folder
> > hierarchy defined through the API. Statsfs files can be read, and possibly
> > cleared if their file mode allows it.
> >
> > Statsfs has two main components: the public API defined by
> > include/linux/statsfs.h, and the virtual file system which should end up
> > in /sys/kernel/stats.
> >
> > The API has two main elements, values and sources. Kernel subsystems like
> > KVM can use the API to create a source, add child
> > sources/values/aggregates and register it to the root source (that on the
> > virtual fs would be /sys/kernel/statsfs).
> >
> > Sources are created via statsfs_source_create(), and each source becomes a
> > directory in the file system. Sources form a parent-child relationship;
> > root sources are added to the file system via statsfs_source_register().
> > Every other source is added to or removed from a parent through the
> > statsfs_source_add_subordinate and statsfs_source_remote_subordinate APIs.
> > Once a source is created and added to the tree (via add_subordinate), it
> > will be used to compute aggregate values in the parent source.
> >
> > Values represent quantites that are gathered by the statsfs user. Examples
> > of values include the number of vm exits of a given kind, the amount of
> > memory used by some data structure, the length of the longest hash table
> > chain, or anything like that. Values are defined with the
> > statsfs_source_add_values function. Each value is defined by a struct
> > statsfs_value; the same statsfs_value can be added to many different
> > sources. A value can be considered "simple" if it fetches data from a
> > user-provided location, or "aggregate" if it groups all values in the
> > subordinates sources that include the same statsfs_value.
> >
>
> This seems like it could have a lot of overhead if we wanted to
> periodically track the totality of subsystem stats as a form of telemetry
> gathering from userspace. To collect telemetry for 1,000 different stats,
> do we need to issue lseek()+read() syscalls for each of them individually
> (or, worse, open()+read()+close())?
>
> Any thoughts on how that can be optimized? A couple of ideas:
>
> - an interface that allows gathering of all stats for a particular
> interface through a single file that would likely be encoded in binary
> and the responsibility of userspace to disseminate, or
>
> - an interface that extends beyond this proposal and allows the reader to
> specify which stats they are interested in collecting and then the
> kernel will only provide these stats in a well formed structure and
> also be binary encoded.

Something akin to how ftrace allows you specify the list of functions
in /sys/kernel/debug/tracing/set_ftrace_filter would make this a lot
easier to use than the one-file-per-stat interface.

That would be useful, e.g. in capturing correlated stats periodically
e.g. scheduler, power and thermal stats

> We've found that the one-file-per-stat method is pretty much a show
> stopper from the performance view and we always must execute at least two
> syscalls to obtain a single stat.
>
> Since this is becoming a generic API (good!!), maybe we can discuss
> possible ways to optimize gathering of stats in mass?
>
> > For more information, please consult the kerneldoc documentation in patch
> > 2 and the sample uses in the kunit tests and in KVM.
> >
> > This series of patches is based on my previous series "libfs: group and
> > simplify linux fs code" and the single patch sent to kvm "kvm_host: unify
> > VM_STAT and VCPU_STAT definitions in a single place". The former
> > simplifies code duplicated in debugfs and tracefs (from which statsfs is
> > based on), the latter groups all macros definition for statistics in kvm
> > in a single common file shared by all architectures.
> >
> > Patch 1 adds a new refcount and kref destructor wrappers that take a
> > semaphore, as those are used later by statsfs. Patch 2 introduces the
> > statsfs API, patch 3 provides extensive tests that can also be used as
> > example on how to use the API and patch 4 adds the file system support.
> > Finally, patch 5 provides a real-life example of statsfs usage in KVM.
> >
> > [1] https://lore.kernel.org/kvm/5d6cdcb1-d8ad-7ae6-7351-3544e2fa366d@xxxxxxxxxx/?fbclid=IwAR18LHJ0PBcXcDaLzILFhHsl3qpT3z2vlG60RnqgbpGYhDv7L43n0ZXJY8M
> >
> > Signed-off-by: Emanuele Giuseppe Esposito <eesposit@xxxxxxxxxx>
> >
> > v1->v2 remove unnecessary list_foreach_safe loops, fix wrong indentation,
> > change statsfs in stats_fs
> >
> > Emanuele Giuseppe Esposito (5):
> > refcount, kref: add dec-and-test wrappers for rw_semaphores
> > stats_fs API: create, add and remove stats_fs sources and values
> > kunit: tests for stats_fs API
> > stats_fs fs: virtual fs to show stats to the end-user
> > kvm_main: replace debugfs with stats_fs
> >
> > MAINTAINERS | 7 +
> > arch/arm64/kvm/Kconfig | 1 +
> > arch/arm64/kvm/guest.c | 2 +-
> > arch/mips/kvm/Kconfig | 1 +
> > arch/mips/kvm/mips.c | 2 +-
> > arch/powerpc/kvm/Kconfig | 1 +
> > arch/powerpc/kvm/book3s.c | 6 +-
> > arch/powerpc/kvm/booke.c | 8 +-
> > arch/s390/kvm/Kconfig | 1 +
> > arch/s390/kvm/kvm-s390.c | 16 +-
> > arch/x86/include/asm/kvm_host.h | 2 +-
> > arch/x86/kvm/Kconfig | 1 +
> > arch/x86/kvm/Makefile | 2 +-
> > arch/x86/kvm/debugfs.c | 64 --
> > arch/x86/kvm/stats_fs.c | 56 ++
> > arch/x86/kvm/x86.c | 6 +-
> > fs/Kconfig | 12 +
> > fs/Makefile | 1 +
> > fs/stats_fs/Makefile | 6 +
> > fs/stats_fs/inode.c | 337 ++++++++++
> > fs/stats_fs/internal.h | 35 +
> > fs/stats_fs/stats_fs-tests.c | 1088 +++++++++++++++++++++++++++++++
> > fs/stats_fs/stats_fs.c | 773 ++++++++++++++++++++++
> > include/linux/kref.h | 11 +
> > include/linux/kvm_host.h | 39 +-
> > include/linux/refcount.h | 2 +
> > include/linux/stats_fs.h | 304 +++++++++
> > include/uapi/linux/magic.h | 1 +
> > lib/refcount.c | 32 +
> > tools/lib/api/fs/fs.c | 21 +
> > virt/kvm/arm/arm.c | 2 +-
> > virt/kvm/kvm_main.c | 314 ++-------
> > 32 files changed, 2772 insertions(+), 382 deletions(-)
> > delete mode 100644 arch/x86/kvm/debugfs.c
> > create mode 100644 arch/x86/kvm/stats_fs.c
> > create mode 100644 fs/stats_fs/Makefile
> > create mode 100644 fs/stats_fs/inode.c
> > create mode 100644 fs/stats_fs/internal.h
> > create mode 100644 fs/stats_fs/stats_fs-tests.c
> > create mode 100644 fs/stats_fs/stats_fs.c
> > create mode 100644 include/linux/stats_fs.h
> >
> > --
> > 2.25.2
> >
> >