Re: [PATCH 17/19] sched: Inherit task cookie on fork()

From: Peter Zijlstra
Date: Wed May 12 2021 - 05:06:06 EST


On Mon, May 10, 2021 at 05:38:18PM -0400, Chris Hyser wrote:
> On 5/10/21 4:47 PM, Joel Fernandes wrote:
> > On Mon, May 10, 2021 at 12:23 PM Chris Hyser <chris.hyser@xxxxxxxxxx> wrote:
>
> > > > > +void sched_core_fork(struct task_struct *p)
> > > > > +{
> > > > > + RB_CLEAR_NODE(&p->core_node);
> > > > > + p->core_cookie = sched_core_clone_cookie(current);
> > > >
> > > > Does this make sense also for !CLONE_THREAD forks?
> > >
> > > Yes. Given the absence of a cgroup interface, fork inheritance
> > > (clone the cookie) is the best way to create shared cookie
> > > hierarchies. The security issue you mentioned was handled in my
> > > original code by setting a unique cookie on 'exec', but Peter took
> > > that out for the reason mentioned above. It was part of the "lets
> > > get this in compromise" effort.

Right, not only that, given all this is moot when parent and child have
the same PTRACE permissions, since if they do, they can inspect one
another's innards anyway, exec()/fork() just fundamentally isn't a
magical boundary we should care about.

The only special case there is SUID exec(), because in that case the
actual credentials change and the PTRACE permissions do actually change.

I sorta had a patch to do that, but it's yuck because that cred change
happens after the point of no return and we need an allocation for the
new cookie. Now, we could rely on the fact that a task context
allocation (GFP_KERNEL) for something as small as our cookie will never
fail and hence we shouldn't be bothered by it, we should do the error
path and yuck.

> > Thanks for sharing the history of it. I guess one can argue that this
> > policy is better to be hardcoded in userspace since core-scheduling
> > can be used for non-security usecases as well. Maybe one could simply
> > call the prctl(2) from userspace if they so desire, before calling
> > exec() ?
>
> I think the defining use case is a container's init. If the cookie is set
> for it by the container creator and without any other user code knowing
> about core_sched, every descendant spawned will have the same cookie and be
> in the same core_sched group much like the cgroup interface had provided. If
> we create a unique cookie in the kernel either on fork or exec, we are
> secure, but we will now have 1000's of core sched groups.
>
> CLEAR was also removed (temporarily, I hope) because a core_sched
> knowledgeable program in the example core_sched container group should not
> be able to remove itself from _all_ core sched groups. It can modify it's
> cookie, but that is no different than the normal case.

Note that much of clear is possible by using SHARE_FROM on your parent
to reset the cookie.

> Both of these beg for a kernel policy, but that discussion was TBD.

Right, I need a Champion that actually cares about cgroups and has
use-cases to go argue with TJ on this. I've proposed code that I think
has sane semantics, but I'm not in a position to argue for it, given I
think a world without cgroups is a better world :-)))