Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

From: Mathieu Desnoyers
Date: Tue Mar 03 2020 - 11:58:50 EST


----- On Mar 3, 2020, at 1:57 AM, Greg Kroah-Hartman gregkh@xxxxxxxxxxxxxxxxxxx wrote:

> On Mon, Mar 02, 2020 at 03:17:07PM -0500, Mathieu Desnoyers wrote:
>> ----- On Mar 2, 2020, at 2:39 PM, Greg Kroah-Hartman gregkh@xxxxxxxxxxxxxxxxxxx
>> wrote:
>>
>> > On Mon, Mar 02, 2020 at 08:36:58PM +0100, Greg Kroah-Hartman wrote:
[...]
>> >>
>> >> I hate to ask, but why does anyone need block_class? or disk_name or
>> >> disk_type? I need to put them behind a driver core namespace or
>> >> something soon...
>> >
>>
>> In LTTng, we have a "statedump" which dumps all the relevant state of
>> the kernel at trace start (or when the user asks for it) into the
>> kernel trace. It is used to collect information which helps translating
>> compact numeric data into human-readable information at post-processing.
>>
>> For block devices, the statedump contains the mapping between the
>> device number and the disk name [1]. It uses the "block_class",
>> "disk_name", and "disk_type" symbols for this. Here is the
>> post-processing output:
>>
>> [14:48:41.388934812] (+?.?????????) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 1048576, diskname = "ram0" }
>> [...]
>> [14:48:41.442548745] (+0.003574998) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 1048591, diskname = "ram15" }
>> [14:48:41.446064977] (+0.003516232) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289728, diskname = "vda" }
>> [14:48:41.449579781] (+0.003514804) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289729, diskname = "vda1" }
>> [14:48:41.453113808] (+0.003534027) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289744, diskname = "vdb" }
>> [14:48:41.456640876] (+0.003527068) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289745, diskname = "vdb1" }
>>
>> This information is then used in our I/O analyses to show information
>> comprehensible to a user.
>
> But all of that is availble to you today in userspace, why dig through
> random kernel symbols?
>
> Look in /sys/dev/block/ or in /sys/block/ for all of that information.
> Is there something that you can only find by the internal symbols that
> is not present today in sysfs?

There is indeed additional information we are getting by iterating
directly on the data structures and emitting tracepoints from within the
kernel which is lost when we copy the data to user-space: the time-stamp
at which the data is observed.

Please note that the statedump approach is applied not only to block
devices, but also to namespaces, thread scheduling, process memory
mappings, file descriptor tables, interrupt handlers, network
interfaces, and cpu topology. Those are more or less long running states
which can change dynamically while a trace is being recorded. Our trace
post-processing tools model the overall kernel state over time by
reconciling tracepoints tracking all state changes (e.g.
insertions/removals) with the statedump information. The time-stamps
available for both state-change events and statedump events is what
allow us to do this modeling.

In the case of block/genhd.c, we care about the mapping between the
device number and the disk name, which is something which can be changed
dynamically, and is thus valid for a given time-range in the trace.

What we are trying to ensure is to gather all the relevant information
to allow trace analyses to re-create a precise model of the kernel
through time. In the case of genhd, that would be tuples of mapping
between device number and name, which are valid for given time-ranges.

These are recorded by the "lttng_statedump_block_device" event, which
has a timestamp generated by the LTTng tracer, along with the
(device_nr, disk_name) event payload. You will notice from what I stated
above that we are missing information about disks being
registered/unregistered later in the trace. Ideally, we would also like
to add tracepoints into register_disk() and del_gendisk() (or more
relevant functions nearby) to trace those state changes. We have those
state-change tracking tracepoints in other subsystems, but unfortunately
not in the block layer.

This information is then used at post-processing to augment the block
events (see include/trace/block.h) to convert the device numbers to
human-readable strings. It is also used to provide human-readable
rows, columns and graph labels when presenting block I/O state over
time, and aggregated summary, on a per disk basis.

If we would instead go through sysfs to export this block device
information, we end up copying the relevant information to user-space,
and only then write the data into a trace buffer. There ends up being a
delay between the point at which the state is observed within the kernel
and the sampling of the time-stamp describing when it occurs, which
introduces many interesting race scenarios where disk
registration/unregistration happens in parallel with a statedump and
block device activity emitting tracepoints into the trace. We lose the
ability to provide a reliable time-range for which the mapping between
(device_nr, disk_name) is valid. Going through uevent also lacks
time-stamp consistency with the block tracepoint events.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com