Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer

From: Roman Gushchin
Date: Tue Nov 13 2018 - 15:55:28 EST


On Tue, Nov 13, 2018 at 11:15:41AM -0800, Tejun Heo wrote:
> Hello, Roman.
>
> On Tue, Nov 13, 2018 at 06:47:55PM +0000, Roman Gushchin wrote:
> > > > + /* Should the cgroup and its descendants be frozen. */
> > > > + bool freeze;
> > >
> > > Why not have this in freezer too?
> >
> > I thought that this variable is just the state of the cgroup.freeze knob,
> > where the freezer field contains the internal state of the freezer, and
> > can in theory be allocated dynamically.
> >
> > Not a strong preference, I can move it there too, if you prefer to.
>
> Yeah, let's just put it together.
>
> > > > +void cgroup_freezer_enter(void);
> > > > +void cgroup_freezer_leave(void);
> > >
> > > So, if we use freeze, freezing, frozen instead, the aboves can be
> > > cgroup_frozen_enter() and cgroup_frozen_leave() (or begin/end).
> >
> > Idk, maybe cgroup_enter_frozen()/cgroup_leave_frozen() ?
>
> Sure.
>
> > > > + /* task is in the cgroup freezer loop */
> > >
> > > The above comment isn't strictly true, right?
> >
> > Why so?
> >
> > It actually means that the task is looping somewhere in the signal delivery loop
> > after entering cgroup_freezer_enter() and before cgroup_freezer_leave().
> >
> > Maybe simple "task is frozen by the cgroup freezer"?
>
> Yeah, sounds good.
>
> > > > @@ -5642,6 +5700,23 @@ void cgroup_post_fork(struct task_struct *child)
> > > > cset->nr_tasks++;
> > > > css_set_move_task(child, NULL, cset, false);
> > > > }
> > > > +
> > > > + if (unlikely(cgroup_frozen(child) &&
> > > > + (child->flags & ~PF_KTHREAD))) {
> > >
> > > I don't think we need explicit PF_KTHREAD test here. We don't allow
> > > kthreads in non-root cgroups anyway and if we wanna change that there
> > > are a bunch of other things which need updating anyway.
> >
> > Don't we? I think we do. I've proposed a patch to fix this some time ago
> > (https://lkml.org/lkml/2017/10/12/556), but was NAKed by Peter.
>
> Ah, right, I thought that went in. Oh well, let's keep the test then.
>
> > > > + /*
> > > > + * Did we race with fork() or exit()? Np, everything is still frozen.
> > > > + */
> > > > + if (frozen == test_bit(CGRP_FROZEN, &cgrp->flags))
> > > > + return;
> > > > +
> > > > + if (frozen)
> > > > + set_bit(CGRP_FROZEN, &cgrp->flags);
> > > > + else
> > > > + clear_bit(CGRP_FROZEN, &cgrp->flags);
> > >
> > > I'm not sure this is wrong but it feels a bit weird to tie the actual
> > > state transition to notification. Wouldn't it be more
> > > straight-forward if CGRP_FROZEN bit is purely determined by whether
> > > the tasks are frozen or not and the notification just checks that
> > > against the last notified state and generate a notification if they're
> > > different?
> >
> > So, maybe cgroup_notify_frozen() is not the best name, maybe
> > cgroup_propagate_frozen() better reflects what's happening here.
> > We need to recalc the state of ancestor cgroups, and we have to do it
> > with cgroup_mutex held, this is why we do it from the delayed work
> > context (on hot paths).
>
> Can't we protect that state with css_set_lock too? That's what task
> states are protected by and the cgroup state is a mere aggregation of
> task states.

See below.

>
> > The first pat of the function can be probably separated and called
> > immediately. Is this what you're suggesting?
>
> Pretty much. I think separating out state transitions and
> notifications would make it more straightforward.
>
> > > So that all these state transitions are synchronous with the actual
> > > freezing events and we can just queue per-cgroup work items all the
> > > way to the top if the new state is different from the last one
> > > cgroup-by-cgroup?
> >
> > Hm, Idk. Why it's better?
>
> So, the pieces are - 1. task states, 2. cgroup states and
> 3. notifications. The current code ties together #2 and #3 together
> which is weird because #2 is a mere aggregation of #1. Also, that
> way, notifications become a lot more robust because whether to
> generate a notification or not can be solely determined from #2
> flipping. ie. sth like the following
>
> change_task_frozen_state()
> {
> update counters
> if (cgroup state needs to change) {
> change cgroup state;
> queue notification work;
> repeat for the parent;
> }
> }
>
> where notification work always notifies should work and trivially
> satisfies the requirement (there should be at least one notification
> since the last state transition) without any further work. Wouldn't
> this be easier and more robust? The current code depends on
> annotating each possible transition event, which is kinda fragile.

Ok, it looks like I can additionally synchronize descendant counters
using css_set_lock, and then eliminate the whole thing with delayed
transitions/notifications.

Will do in v3.

>
> > > > + if (lock_task_sighand(task, &flags)) {
> > > > + if (test_bit(CGRP_FREEZE, &dst->flags))
> > > > + task->jobctl |= JOBCTL_TRAP_FREEZE;
> > > > + else
> > > > + task->jobctl &= ~JOBCTL_TRAP_FREEZE;
> > >
> > > How are these flags synchronized?
> >
> > Using the css_set_lock.
>
> But other JOBCTL_TRAP bits aren't synchronized by css_set_lock, right?


But if we don't touch this bit anywhere else, should be fine, right?

Thanks!