Re: pidfd design

From: Christian Brauner
Date: Mon Mar 25 2019 - 16:45:51 EST


On Mon, Mar 25, 2019 at 10:45:29AM -0700, Linus Torvalds wrote:
> On Fri, Mar 22, 2019 at 11:34 AM Michael Tirado <mtirado418@xxxxxxxxx> wrote:
> >
> > On Wed, Mar 20, 2019 at 8:08 PM Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote:
> > >
> > > pidfd code should be backed out immediately. Forget about /proc.
> >
> > Seems like Torvalds just merges this sort of "stuff" without reading
> > it now, or there's something that auto accepted pull request to RC tree?
>
> There is no auto-accept.
>
> But there also didn't seem to be any valid arguments against it, and
> the android people had arguments for it.
>
> Arguing against it based on "I don't like /proc" is pointless. The
> fact is, /proc is our system interface for a lot of things.
>
> Arguing against it based on "I worry about the _other_
> non-signal-sending things that could be done with this" is also
> pointless. What other things? The only thing that got merged was the
> signalling.

To back Linus defense up with a glimpse into the future.

We will not be to rely on dirfds from proc to do general process
management. That is even in the commit message for the pidfd_send_signal
syscall, that we intend to decouple this from procfs, i.e. decouple
process management from process metadata reading.
We have an ongoing discussion and what a lot of people agree upon is
that pidfds will be anon inode file descriptors that stash a reference
to struct pid in their private_data member. They can be pollable if ever
need be and they are just conceptually cleaner and way simpler and
mirror what will happen in the new mount api as well.
The idea is to translate these pidfds e.g. via a simple ioctl()
interface that takes a pidfd and gives back - with standard permissions
applied as are today - a corresponding /proc/<pid> fd that can be used
to read metadata of a process (see the suggestion by Andy and Jann [1]).
The advantage is that this means that pidfd_clone() or something similar
can simply return a pidfd and does not need to care about what procfs
the process is supposed to be located in/reference and is in general way
safer.

But there is absolutely nothing wrong with allowing users to use
/proc/<pid> to signal processes. One of the reasons why I did this is
that it is so intuitive to users that non-kernel people have requested
this be possible over and over.
As mentioned in the orignal patchset the future was always to decouple
this from procfs (see the references in there) and this is what the new
pidctl() syscall is for that transparently translates between the
pid-based api and the pidfd-based api.

[1]: https://lore.kernel.org/lkml/CAG48ez3VMjLJBC_F3BxC2sc2s-28NdsrUduR=jX66XH0w2O-Qg@xxxxxxxxxxxxxx/

>
> Now, arguing that signalling should use the open-time credentials
> might make sense, but this isn't read/write. You can't fool some suid
> program to do magic randon system calls for you, and if you can, then
> arguing about pidfd is kind of pointless.
>
> So the model of using a file descriptor instead of a 'pid' for signal
> handling is actually very unix-like. Maybe that's how pid's should
> have worked to begin with. Remember that whole "everything is a file"
> thing?
>
> Now, the fact that fork() and clone() return a pid obviously means
> that pidfd isn't the primary model (not to decades of just history),
> but that doesn't make pidfd wrong.
>
> And namespace issues etc are all also kind of irrelevant. If you open
> random files in /proc and randomly do pidfd_send_signal() on those,
> you get random results. If that worries you, then DON'T DO THAT THEN,
> for chrissake! That's not a sane model to begin with, but it's not the
> usage model for this, so it's another completely specious argument.
>
> So yes, I thought about the pidfd pull (which was why it happened at
> the very end of the merge window), and I found the arguments against
> it bad.
>
> Linus