Re: [rfc] object collection tracing (was: [PATCH 5/5] proc: exportmore page flags in /proc/kpageflags)

From: Frederic Weisbecker
Date: Tue May 12 2009 - 09:01:43 EST


On Tue, Apr 28, 2009 at 09:31:08PM +0800, Wu Fengguang wrote:
> On Tue, Apr 28, 2009 at 08:17:51PM +0800, Ingo Molnar wrote:
> > tent-Transfer-Encoding: quoted-printable
> > Status: RO
> > Content-Length: 5480
> > Lines: 161
> >
> >
> > * Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
> >
> > > > The above 'get object state' interface (which allows passive
> > > > sampling) - integrated into the tracing framework - would serve
> > > > that goal, agreed?
> > >
> > > Agreed. That could in theory a good complement to dynamic
> > > tracings.
> > >
> > > Then what will be the canonical form for all the 'get object
> > > state' interfaces - "object.attr=value", or whatever? [...]
> >
> > Lemme outline what i'm thinking of.
> >
> > I'd call the feature "object collection tracing", which would live
> > in /debug/tracing, accessed via such files:
> >
> > /debug/tracing/objects/mm/pages/
> > /debug/tracing/objects/mm/pages/format
> > /debug/tracing/objects/mm/pages/filter
> > /debug/tracing/objects/mm/pages/trace_pipe
> > /debug/tracing/objects/mm/pages/stats
> > /debug/tracing/objects/mm/pages/events/
> >
> > here's the (proposed) semantics of those files:
> >
> > 1) /debug/tracing/objects/mm/pages/
> >
> > There's a subsystem / object basic directory structure to make it
> > easy and intuitive to find our way around there.
> >
> > 2) /debug/tracing/objects/mm/pages/format
> >
> > the format file:
> >
> > /debug/tracing/objects/mm/pages/format
> >
> > Would reuse the existing dynamic-tracepoint structured-logging
> > descriptor format and code (this is upstream already):
> >
> > [root@phoenix sched_signal_send]# pwd
> > /debug/tracing/events/sched/sched_signal_send
> >
> > [root@phoenix sched_signal_send]# cat format
> > name: sched_signal_send
> > ID: 24
> > format:
> > field:unsigned short common_type; offset:0; size:2;
> > field:unsigned char common_flags; offset:2; size:1;
> > field:unsigned char common_preempt_count; offset:3; size:1;
> > field:int common_pid; offset:4; size:4;
> > field:int common_tgid; offset:8; size:4;
> >
> > field:int sig; offset:12; size:4;
> > field:char comm[TASK_COMM_LEN]; offset:16; size:16;
> > field:pid_t pid; offset:32; size:4;
> >
> > print fmt: "sig: %d task %s:%d", REC->sig, REC->comm, REC->pid
> >
> > These format descriptors enumerate fields, types and sizes, in a
> > structured way that user-space tools can parse easily. (The binary
> > records that come from the trace_pipe file follow this format
> > description.)
> >
> > 3) /debug/tracing/objects/mm/pages/filter
> >
> > This is the tracing filter that can be set based on the 'format'
> > descriptor. So with the above (signal-send tracepoint) you can
> > define such filter expressions:
> >
> > echo "(sig == 10 && comm == bash) || sig == 13" > filter
> >
> > To restrict the 'scope' of the object collection along pretty much
> > any key or combination of keys. (Or you can leave it as it is and
> > dump all objects and do keying in user-space.)
> >
> > [ Using in-kernel filtering is obviously faster that streaming it
> > out to user-space - but there might be details and types of
> > visualization you want to do in user-space - so we dont want to
> > restrict things here. ]
> >
> > For the mm object collection tracepoint i could imagine such filter
> > expressions:
> >
> > echo "type == shared && file == /sbin/init" > filter
> >
> > To dump all shared pages that are mapped to /sbin/init.
> >
> > 4) /debug/tracing/objects/mm/pages/trace_pipe
> >
> > The 'trace_pipe' file can be used to dump all objects in the
> > collection, which match the filter ('all objects' by default). The
> > record format is described in 'format'.
> >
> > trace_pipe would be a reuse of the existing trace_pipe code: it is a
> > modern, poll()-able, read()-able, splice()-able pipe abstraction.
> >
> > 5) /debug/tracing/objects/mm/pages/stats
> >
> > The 'stats' file would be a reuse of the existing histogram code of
> > the tracing code. We already make use of it for the branch tracers
> > and for the workqueue tracer - it could be extended to be applicable
> > to object collections as well.
> >
> > The advantage there would be that there's no dumping at all - all
> > the integration is done straight in the kernel. ( The 'filter'
> > condition is listened to - increasing flexibility. The filter file
> > could perhaps also act as a default histogram key. )
> >
> > 6) /debug/tracing/objects/mm/pages/events/
> >
> > The 'events' directory offers links back to existing dynamic
> > tracepoints that are under /debug/tracing/events/. This would serve
> > as an additional coherent force that keeps dynamic tracepoints
> > collected by subsystem and by object type as well. (Tools could make
> > use of this information as well - without being aware of actual
> > object semantics.)
> >
> >
> > There would be a number of other object collections we could
> > enumerate:
> >
> > tasks:
> >
> > /debug/tracing/objects/sched/tasks/
> >
> > active inodes known to the kernel:
> >
> > /debug/tracing/objects/fs/inodes/
> >
> > interrupts:
> >
> > /debug/tracing/objects/hw/irqs/
> >
> > etc.
> >
> > These would use the same 'object collection' framework. Once done we
> > can use it for many other thing too.
> >
> > Note how organically integrated it all is with the tracing
> > framework. You could start from an 'object view' to get an overview
> > and then go towards a more dynamic view of specific object
> > attributes (or specific objects), as you drill down on a specific
> > problem you want to analyze.
> >
> > How does this all sound to you?
>
> Great! I saw much opportunity to adapt the not yet submitted
> /proc/filecache interface to the proposed framework.
>
> Its basic form is:
>
> # ino size cached cached% refcnt state age accessed process dev file
> [snip]
> 320 1 4 100 1 D- 50443 1085 udevd 00:11(tmpfs) /.udev/uevent_seqnum
> 460725 123 124 100 35 -- 50444 6795 touch 08:02(sda2) /lib/libpthread-2.9.so
> 460727 31 32 100 14 -- 50444 2007 touch 08:02(sda2) /lib/librt-2.9.so
> 458865 97 80 82 1 -- 50444 49 mount 08:02(sda2) /lib/libdevmapper.so.1.02.1
> 460090 15 16 100 1 -- 50444 48 mount 08:02(sda2) /lib/libuuid.so.1.2
> 458866 46 48 100 1 -- 50444 47 mount 08:02(sda2) /lib/libblkid.so.1.0
> 460732 43 44 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_nis-2.9.so
> 460739 87 88 100 73 -- 50444 3597 rcS 08:02(sda2) /lib/libnsl-2.9.so
> 460726 31 32 100 69 -- 50444 3581 rcS 08:02(sda2) /lib/libnss_compat-2.9.so
> 458804 250 252 100 11 -- 50445 8175 rcS 08:02(sda2) /lib/libncurses.so.5.6
> 229540 780 752 96 3 -- 50445 7594 init 08:02(sda2) /bin/bash
> 460735 15 16 100 89 -- 50445 17581 init 08:02(sda2) /lib/libdl-2.9.so
> 460721 1344 1340 99 117 -- 50445 48732 init 08:02(sda2) /lib/libc-2.9.so
> 458801 107 104 97 24 -- 50445 3586 init 08:02(sda2) /lib/libselinux.so.1
> 671870 37 24 65 1 -- 50446 1 swapper 08:02(sda2) /sbin/init
> 175 1 24412 100 1 -- 50446 0 swapper 00:01(rootfs) /dev/root
>
> The patch basically does a traversal through one or more of the inode
> lists to produce the output:
> inode_in_use
> inode_unused
> sb->s_dirty
> sb->s_io
> sb->s_more_io
> sb->s_inodes
>
> The filtering feature is a necessity for this interface - or it will
> take considerable time to do a full listing. It supports the following
> filters:
> { LS_OPT_DIRTY, "dirty" },
> { LS_OPT_CLEAN, "clean" },
> { LS_OPT_INUSE, "inuse" },
> { LS_OPT_EMPTY, "empty" },
> { LS_OPT_ALL, "all" },
> { LS_OPT_DEV, "dev=%s" },
>
> There are two possible challenges for the conversion:
>
> - One trick it does is to select different lists to traverse on
> different filter options. Will this be possible in the object
> tracing framework?



Yeah, I guess.



> - The file name lookup(last field) is the performance killer. Is it
> possible to skip the file name lookup when the filter failed on the
> leading fields?


objects collection lays on trace events where filters basically ignore
a whole entry in case of non-matching. Not sure if we can easily only
ignore one field.

But I guess we can do something about the performances...

Could you send us the (sob'ed) patch you made which implements this.
I could try to adapt it to object collection.

Thanks,
Frederic.


> Will the object tracing interface allow such flexibilities?
> (Sorry I'm not yet familiar with the tracing framework.)
>
> > Can you see any conceptual holes in the scheme, any use-case that
> > /proc/kpageflags supports but the object collection approach does
> > not?
>
> kpageflags is simply a big (perhaps sparse) binary array.
> I'd still prefer to retain its current form - the kernel patches and
> user space tools are all ready made, and I see no benefits in
> converting to the tracing framework.
>
> > Would you be interested in seeing something like this, if we tried
> > to implement it in the tracing tree? The majority of the code
> > already exists, we just need interest from the MM side and we have
> > to hook it all up. (it is by no means trivial to do - but looks like
> > a very exciting feature.)
>
> Definitely! /proc/filecache has another 'page view':
>
> # head /proc/filecache
> # file /bin/bash
> # flags R:referenced A:active M:mmap U:uptodate D:dirty W:writeback X:readahead P:private O:owner b:buffer d:dirty w:writeback
> # idx len state refcnt
> 0 1 RAMU________ 4
> 3 8 RAMU________ 4
> 12 1 RAMU________ 4
> 14 5 RAMU________ 4
> 20 7 RAMU________ 4
> 27 2 RAMU________ 5
> 29 1 RAMU________ 4
>
> Which is also a good candidate. However I still need to investigate
> whether it offers considerable margins over the mincore() syscall.
>
> Thanks and Regards,
> Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/