Re: [PATCH 0/7] [RFC] kernel: add a netlink interface to get information about processes

From: Andrew Vagin
Date: Thu Feb 19 2015 - 16:39:50 EST


On Wed, Feb 18, 2015 at 05:18:38PM -0800, Andy Lutomirski wrote:
> On Feb 18, 2015 6:27 AM, "Andrew Vagin" <avagin@xxxxxxxxxxxxx> wrote:
> >
> > On Tue, Feb 17, 2015 at 11:05:31AM -0800, Andy Lutomirski wrote:
> > > On Feb 17, 2015 12:40 AM, "Andrey Vagin" <avagin@xxxxxxxxxx> wrote:
> > > >
> > > > Here is a preview version. It provides restricted set of functionality.
> > > > I would like to collect feedback about this idea.
> > > >
> > > > Currently we use the proc file system, where all information are
> > > > presented in text files, what is convenient for humans. But if we need
> > > > to get information about processes from code (e.g. in C), the procfs
> > > > doesn't look so cool.
> > > >
> > > > From code we would prefer to get information in binary format and to be
> > > > able to specify which information and for which tasks are required. Here
> > > > is a new interface with all these features, which is called task_diag.
> > > > In addition it's much faster than procfs.
> > > >
> > > > task_diag is based on netlink sockets and looks like socket-diag, which
> > > > is used to get information about sockets.
> > > >
> > > > A request is described by the task_diag_pid structure:
> > > >
> > > > struct task_diag_pid {
> > > > __u64 show_flags; /* specify which information are required */
> > > > __u64 dump_stratagy; /* specify a group of processes */
> > > >
> > > > __u32 pid;
> > > > };
> > > >
> > > > A respone is a set of netlink messages. Each message describes one task.
> > > > All task properties are divided on groups. A message contains the
> > > > TASK_DIAG_MSG group and other groups if they have been requested in
> > > > show_flags. For example, if show_flags contains TASK_DIAG_SHOW_CRED, a
> > > > response will contain the TASK_DIAG_CRED group which is described by the
> > > > task_diag_creds structure.
> > > >
> > > > struct task_diag_msg {
> > > > __u32 tgid;
> > > > __u32 pid;
> > > > __u32 ppid;
> > > > __u32 tpid;
> > > > __u32 sid;
> > > > __u32 pgid;
> > > > __u8 state;
> > > > char comm[TASK_DIAG_COMM_LEN];
> > > > };
> > > >
> > > > Another good feature of task_diag is an ability to request information
> > > > for a few processes. Currently here are two stratgies
> > > > TASK_DIAG_DUMP_ALL - get information for all tasks
> > > > TASK_DIAG_DUMP_CHILDREN - get information for children of a specified
> > > > tasks
> > > >
> > > > The task diag is much faster than the proc file system. We don't need to
> > > > create a new file descriptor for each task. We need to send a request
> > > > and get a response. It allows to get information for a few task in one
> > > > request-response iteration.
> > > >
> > > > I have compared performance of procfs and task-diag for the
> > > > "ps ax -o pid,ppid" command.
> > > >
> > > > A test stand contains 10348 processes.
> > > > $ ps ax -o pid,ppid | wc -l
> > > > 10348
> > > >
> > > > $ time ps ax -o pid,ppid > /dev/null
> > > >
> > > > real 0m1.073s
> > > > user 0m0.086s
> > > > sys 0m0.903s
> > > >
> > > > $ time ./task_diag_all > /dev/null
> > > >
> > > > real 0m0.037s
> > > > user 0m0.004s
> > > > sys 0m0.020s
> > > >
> > > > And here are statistics about syscalls which were called by each
> > > > command.
> > > > $ perf stat -e syscalls:sys_exit* -- ps ax -o pid,ppid 2>&1 | grep syscalls | sort -n -r | head -n 5
> > > > 20,713 syscalls:sys_exit_open
> > > > 20,710 syscalls:sys_exit_close
> > > > 20,708 syscalls:sys_exit_read
> > > > 10,348 syscalls:sys_exit_newstat
> > > > 31 syscalls:sys_exit_write
> > > >
> > > > $ perf stat -e syscalls:sys_exit* -- ./task_diag_all 2>&1 | grep syscalls | sort -n -r | head -n 5
> > > > 114 syscalls:sys_exit_recvfrom
> > > > 49 syscalls:sys_exit_write
> > > > 8 syscalls:sys_exit_mmap
> > > > 4 syscalls:sys_exit_mprotect
> > > > 3 syscalls:sys_exit_newfstat
> > > >
> > > > You can find the test program from this experiment in the last patch.
> > > >
> > > > The idea of this functionality was suggested by Pavel Emelyanov
> > > > (xemul@), when he found that operations with /proc forms a significant
> > > > part of a checkpointing time.
> > > >
> > > > Ten years ago here was attempt to add a netlink interface to access to /proc
> > > > information:
> > > > http://lwn.net/Articles/99600/
> > >
> > > I don't suppose this could use real syscalls instead of netlink. If
> > > nothing else, netlink seems to conflate pid and net namespaces.
> >
> > What do you mean by "conflate pid and net namespaces"?
>
> A netlink socket is bound to a network namespace, but you should be
> returning data specific to a pid namespace.

Here is a good question. When we mount a procfs instance, the current
pidns is saved on a superblock. Then if we read data from
this procfs from another pidns, we will see pid-s from the pidns where
this procfs has been mounted.

$ unshare -p -- bash -c '(bash)'
$ cat /proc/self/status | grep ^Pid:
Pid: 15770
$ echo $$
1

A similar situation with socket_diag. A socket_diag socket is bound to a
network namespace. If we open a socket_diag socket and change a network
namespace, it will return infromation about the initial netns.

In this version I always use a current pid namespace.
But to be consistant with other kernel logic, a socket diag has to be
linked with a pidns where it has been created.

>
> On a related note, how does this interact with hidepid? More

Currently it always work as procfs with hidepid = 2 (highest level of
security).

> generally, what privileges are you requiring to obtain what data?

It dumps information only if ptrace_may_access(tsk, PTRACE_MODE_READ) returns true

>
> >
> > >
> > > Also, using an asynchronous interface (send, poll?, recv) for
> > > something that's inherently synchronous (as the kernel a local
> > > question) seems awkward to me.
> >
> > Actually all requests are handled synchronously. We call sendmsg to send
> > a request and it is handled in this syscall.
> > 2) | netlink_sendmsg() {
> > 2) | netlink_unicast() {
> > 2) | taskdiag_doit() {
> > 2) 2.153 us | task_diag_fill();
> > 2) | netlink_unicast() {
> > 2) 0.185 us | netlink_attachskb();
> > 2) 0.291 us | __netlink_sendskb();
> > 2) 2.452 us | }
> > 2) + 33.625 us | }
> > 2) + 54.611 us | }
> > 2) + 76.370 us | }
> > 2) | netlink_recvmsg() {
> > 2) 1.178 us | skb_recv_datagram();
> > 2) + 46.953 us | }
> >
> > If we request information for a group of tasks (NLM_F_DUMP), a first
> > portion of data is filled from the sendmsg syscall. And then when we read
> > it, the kernel fills the next portion.
> >
> > 3) | netlink_sendmsg() {
> > 3) | __netlink_dump_start() {
> > 3) | netlink_dump() {
> > 3) | taskdiag_dumpid() {
> > 3) 0.685 us | task_diag_fill();
> > ...
> > 3) 0.224 us | task_diag_fill();
> > 3) + 74.028 us | }
> > 3) + 88.757 us | }
> > 3) + 89.296 us | }
> > 3) + 98.705 us | }
> > 3) | netlink_recvmsg() {
> > 3) | netlink_dump() {
> > 3) | taskdiag_dumpid() {
> > 3) 0.594 us | task_diag_fill();
> > ...
> > 3) 0.242 us | task_diag_fill();
> > 3) + 60.634 us | }
> > 3) + 72.803 us | }
> > 3) + 88.005 us | }
> > 3) | netlink_recvmsg() {
> > 3) | netlink_dump() {
> > 3) 2.403 us | taskdiag_dumpid();
> > 3) + 26.236 us | }
> > 3) + 40.522 us | }
> > 0) + 20.407 us | netlink_recvmsg();
> >
> >
> > netlink is really good for this type of tasks. It allows to create an
> > extendable interface which can be easy customized for different needs.
> >
> > I don't think that we would want to create another similar interface
> > just to be independent from network subsystem.
>
> I guess this is a bit streamy in that you ask one question and get
> multiple answers.

It's like seq_file in procfs. The kernel allocates a buffer then fills
it, copies it into userspace, fills it again, ... repeats these actions.
And we can read data from file by portions.

Actually here is one more analogy. When we open a file in procfs,
we sends a request to the kernel and a file path is a request body in
this case. But in case of procfs, we can't construct requests, we only
have a set of predefined requests.

>
> >
> > Thanks,
> > Andrew
> >
> > >
> > > --Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/