Re: [PATCH] cgroup_freezer: Freezing and task move race fix

From: Tomasz Buchert
Date: Tue Aug 17 2010 - 21:13:17 EST


Le 12/08/2010 22:13, Matt Helsley a écrit :
On Thu, Aug 12, 2010 at 02:53:25AM +0200, Tomasz Buchert wrote:
Matt Helsley a écrit :
On Wed, Aug 11, 2010 at 09:35:43AM +0200, Tomasz Buchert wrote:
Matt Helsley a écrit :
On Wed, Aug 11, 2010 at 12:18:44AM +0200, Tomasz Buchert wrote:
Matt Helsley a écrit :
On Tue, Aug 10, 2010 at 09:53:21PM +0200, Tomasz Buchert wrote:
Writing 'FROZEN' to freezer.state file does not
forbid the task to be moved away from its cgroup
(for a very short time). Nevertheless the moved task
can become frozen OUTSIDE its cgroup which puts
discussed task in a permanent 'D' state.

This patch forbids migration of either FROZEN
or FREEZING tasks.

This behavior was observed and easily reproduced on
a single core laptop. Program and instructions how
to reproduce the bug can be fetched from:
http://pentium.hopto.org/~thinred/repos/linux-misc/freezer_bug.c
Thanks for the report and the test code.

I'm will try to reproduce this race in the next few hours and analyze
it since I'm not sure the patch really fixes the race -- it may only
make the race trigger less frequently.

At the very least the patch won't break the current code since it's
essentially a more-strict version of is_task_frozen_enough() -- it lets
fewer tasks attach/detach to/from frozen cgroups.

Cheers,
-Matt Helsley
Hi Matt!
I am a novice if it comes to the kernel and I find the cgroup_freezer
code especially complicated, so definetely this may be not enough to fix that.
Notice also that if you uncomment the line 55 in my testcase this will also
trigger the race! This, however, makes sense since process may not be in the cgroup anymore
and consequently won't be thawed.
OK, I triggered it with that. Interesting.

Good!

I think that this patch fixes these problems because it does the flag checking in a right order:
first freezing() is used and then frozen() which assures (see frozen_process()) that
the race will not happen. Right? :)
I see what you mean. It still seems like it wouldn't actually fix the race -- just make it
harder to trigger. I think you're saying this is what happens without the patch:

Time "bug" goes through these states cgroup code checks for these states
-----------------------------------------------------------------------------------
| freezing
| is_frozen? Nope.
| frozen
| is_freezing? Nope.
| <move>
V

My first scenario was a bit different:
Time "bug" goes through these states cgroup code checks for these states
-----------------------------------------------------------------------------------
| freezing
| is_task_frozen_enough? Nope.
| <move>
| frozen
V
but the problem is the same.

I think I found a bug in the cgroup freezer which your patch fixes.
However I don't think it's a race.

/* Task is frozen or will freeze immediately when next it gets woken */
static bool is_task_frozen_enough(struct task_struct *task)
{
return frozen(task) ||
(task_is_stopped_or_traced(task)&& freezing(task));
}

So it will only refuse to attach freezing tasks which have been stopped
or traced! Yet for attach we need to refuse to move _any_ freezing tasks.

Though stopped/traced _is_ relevant to the cgroup freezer state itself.
If we uses frozen(task) || freezing(task) to determine whether a cgroup
is frozen then it would be possible for the task to still be active
when the cgroup is finally reported FROZEN. However that's not possible
when the task is stopped/traced *and* freezing since even if woken it
won't exit the kernel before entering the refrigerator.

But, without having carefully investigated the details, this could just as easily happen
with your patch:

Time "bug" goes through these states cgroup code checks for these states
-----------------------------------------------------------------------------------
| is_freezing? Nope.
| is_frozen? Nope.
| freezing
| <move>
| frozen
V

This can't happen as far as I know because there is cgroup_lock around the code in freezer_write()
and freezer_can_attach().
The task can't enter 'freezing' state when can_attach is executed.

You're right, cgroup_mutex should protect against that. However presumably
that same logic applies to the first case. So again I don't think the bug
is a race.

<snipped the remaining cases which should be the same>

At this point I think the bug description in your patch needs to change
but the patch itself is perfect. Assuming you agree with my assessment
of the bug I think the process is: you'll rewrite the description, add -stable
maintaners to your current Cc's (since this bug is simple, exists in previous
versions, and is somewhat nasty), add:

Reviewed-by: Matt Helsley<matthltc@xxxxxxxxxx>
Tested-by: Matt Helsley<matthltc@xxxxxxxxxx>

and send it to Andrew Morton. Hopefully then (if not before) Paul and Li
will ack it.

Thanks!

Cheers,
-Matt Helsley

I agree with your assessment, Matt. The wrong the definition of being 'frozen enough'
allows a task to sneak out of its freezing cgroup.
The problem is that I found another, closely related bug. Right now, I have
a new queue of 3 patches to fix both bugs in a more general setting. They are not well tested
yet but I'm fairly confident that they do the right thing. I'm going to test them tomorrow.
Do you still want me to push the first patch? I propose to let me work a bit on
the new patches and prepare the code for the incoming fix to the related bug.
What is your opinion?

OK, I'll have a look at the 3 new patches and your test(s) then we can discuss
what to do.

Cheers,
-Matt

Matt,
Am I supposed to do something right now? The discussion
became a bit quiet recently.

Cheers,
Tomasz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/