Re: [PATCH v2 3/6] cgroup: cgroup v2 freezer

From: Roman Gushchin
Date: Tue Nov 13 2018 - 17:00:12 EST


Hi Oleg!

On Tue, Nov 13, 2018 at 04:48:25PM +0100, Oleg Nesterov wrote:
> On 11/12, Roman Gushchin wrote:
> >
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -83,7 +83,8 @@ struct task_group;
> > #define TASK_WAKING 0x0200
> > #define TASK_NOLOAD 0x0400
> > #define TASK_NEW 0x0800
> > -#define TASK_STATE_MAX 0x1000
> > +#define TASK_FROZEN 0x1000
> > +#define TASK_STATE_MAX 0x2000
>
> Just noticed the new task state... Why? Can't we avoid it?

We can, but it's nice to show to userspace that tasks are frozen,
rather than just stuck somewhere in the kernel...

>
> ...
>
> > +void cgroup_freezer_enter(void)
> > +{
> > + long state = current->state;
>
> Why? it must be TASK_RUNNING?
>
> If not set_current_state() at the end is simply wrong... Yes, __refrigerator()
> does this, but at least it has a reason although it is wrong too.
>
> > + struct cgroup *cgrp;
> > +
> > + if (!current->frozen) {
> > + spin_lock_irq(&css_set_lock);
> > + current->frozen = true;
> > + cgrp = task_dfl_cgroup(current);
> > + cgrp->freezer.nr_frozen_tasks++;
> > +
> > + WARN_ON_ONCE(cgrp->freezer.nr_tasks_to_freeze <
> > + cgrp->freezer.nr_frozen_tasks);
> > +
> > + if (cgrp->freezer.nr_tasks_to_freeze ==
> > + cgrp->freezer.nr_frozen_tasks)
> > + cgroup_queue_notify_frozen(cgrp);
> > + spin_unlock_irq(&css_set_lock);
> > + }
> > +
> > + /* refrigerator */
> > + set_current_state(TASK_WAKEKILL | TASK_INTERRUPTIBLE | TASK_FROZEN);
>
> Why not __set_current_state() ?

Hm, it's not a hot path at all, so set_current_state() is good enough.
Not a strong preference, of course.

>
> If ->state include TASK_INTERRUPTIBLE, why do we need TASK_WAKEKILL?
>
> And again, why TASK_FROZEN?

So, should it be just TASK_INTERRUPTIBLE | TASK_FROZEN ?

>
> > + clear_thread_flag(TIF_SIGPENDING);
> > + schedule();
> > + recalc_sigpending();
>
> I simply can't understand these 3 lines above but I bet this is not correct ;)

So, yeah, the problem is that if there is TIF_SIGPENDING bit set, schedule()
will return immediately, so we're getting pretty much a busy loop here.
This is a nasty workaround.

I believe we can clear and not call recalc_sigpending() at all. Does this seem
to be correct?

>
> if nothing else recalc_sigpending() without ->siglock is wrong, it can race
> with signal_wakeup/etc.
>
> > + set_current_state(state);
>
> see above...

Thank you for the review!
And looking forward for more comments from you!