Re: [PATCH -tip 22/32] sched: Split the cookie and setup per-task cookie on fork

From: Joel Fernandes
Date: Sun Dec 06 2020 - 12:50:58 EST


Hi Peter,

On Tue, Dec 01, 2020 at 08:34:51PM +0100, Peter Zijlstra wrote:
> On Tue, Dec 01, 2020 at 02:20:28PM -0500, Joel Fernandes wrote:
> > On Wed, Nov 25, 2020 at 12:10:14PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 17, 2020 at 06:19:52PM -0500, Joel Fernandes (Google) wrote:
> > > > +void sched_core_tag_requeue(struct task_struct *p, unsigned long cookie, bool group)
> > > > +{
> > > > + if (!p)
> > > > + return;
> > > > +
> > > > + if (group)
> > > > + p->core_group_cookie = cookie;
> > > > + else
> > > > + p->core_task_cookie = cookie;
> > > > +
> > > > + /* Use up half of the cookie's bits for task cookie and remaining for group cookie. */
> > > > + p->core_cookie = (p->core_task_cookie <<
> > > > + (sizeof(unsigned long) * 4)) + p->core_group_cookie;
> > >
> > > This seems dangerous; afaict there is nothing that prevents cookie
> > > collision.
> >
> > This is fixed in a later patch by Josh "sched: Refactor core cookie into
> > struct" where we are having independent fields for each type of cookie.
>
> So I don't think that later patch is right... That is, it works, but
> afaict it's massive overkill.
>
> COOKIE_CMP_RETURN(task_cookie);
> COOKIE_CMP_RETURN(group_cookie);
> COOKIE_CMP_RETURN(color);
>
> So if task_cookie matches, we consider group_cookie, if that matches we
> consider color.

Yes, the cookie is a compound one. A cookie structure is created here which
has all 3 components. The final cookie value is a pointer to this compound
structure.

>
> Now, afaict that's semantically exactly the same as just using the
> narrowest cookie. That is, use the task cookie if there is, and then,
> walking the cgroup hierarchy (up) pick the first cgroup cookie.
[..]
> Which means you only need a single active cookie field.

Its not the same. The "compounded" cookie is needed to enforce the CGroup
delegation model. This is exactly how both uclamp and cpusets work. For
uclamp, if we take uclamp.min as an example, if the CGroup sets the
uclamp.min of a group to 0, then even if you use the per-task interface
(sched_setattr) for setting the task's clamp - the "effective uclamp.min" of
the task as seen in /proc/pid/sched will still be 0.

Similar thing here, if 2 tasks belong to 2 different CGroups and each group is
tagged with its own tag, then if you use the per-task interface to make the 2
tasks share a core, the "effective" sharing is still such that the tasks will
not share a core -- because the CGroup decided to make it so.

I would like to really maintain this model. Doing it any other way is
confusing - we have already discussed doing it this way before. With that way
you end up failing one interface if another one was already used. Been
there, done that. It sucks a lot.

> IOW, you're just making things complicated and expensive.

The cost of the additional comparisons you mentioned is only in the slow path
(i.e. when someone joins or leaves a group). Once the task_struct's cookie
field is set, the cost is not any more than what it was before.

Any other thoughts?

thanks,

- Joel