Re: Re: [PATCH] fs/proc: introduce /proc/stat2 file

From: Alexey Dobriyan
Date: Mon Oct 29 2018 - 20:09:32 EST

On Mon, Oct 29, 2018 at 11:40:47PM +0000, Daniel Colascione wrote:
> On Mon, Oct 29, 2018 at 11:34 PM, Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote:
> >> I'd much rather move to a model in which userspace *explicitly* tells
> >> the kernel which fields it wants, with the kernel replying with just
> >> those particular fields, maybe in their raw binary representations.
> >> The ASCII-text bag-of-everything files would remain available for
> >> ad-hoc and non-performance critical use, but programs that cared about
> >> performance would have an efficient bypass. One concrete approach is
> >> to let users open up today's proc files and, instead of read(2)ing a
> >> text blob, use an ioctl to retrieve specified and targeted information
> >> of the sort that would normally be encoded in the text blob. Because
> >> callers would open the same file when using either the text or binary
> >> interfaces, little would have to change, and it'd be easy to implement
> >> fallbacks when a particular system doesn't support a particular
> >> fast-path ioctl.
> >
> > You've just reinvented systems calls.
> I don't know why you say so. There are important benefits that come
> from using an ioctl on a proc file FD instead of a plain system call.
> Procfs files have file permissions,auditing, SCM_RIGHTS-ability, PID
> race immunity, and other things that you wouldn't get from a plain
> "get this information about this PID" system call.

This whole thread started because /proc/stat is slow and every number in
/proc/stat is system global.

If you continue adding stuff to /proc, one day someone will notice that
core VFS adds considerable overhead, at this point there is nothing
anyone could do.

I'd strongly advise to look at what this DB actually needs and deliver
just that.

Very little of other things apply to /proc/stat:
* system call auditing exists,
* /proc/stat is world readable and continues to be so,
* thus passing descriptor around is pretty useless,
* $PID race doesn't apply.

Additionally passing descriptors feels like party trick.
I suspect that's not how people use statistics in /proc: they run
processes and one priviledged enough monitoring daemon collects data,
otherwise userspace needs to cooperate with monitoring userspace
which of course doesn't happen.

PID race is solved by giving out descriptors which pin "struct pid".
Which is how the race is solved currently: dentry pins inode, inode
pins "struct pid".