Re: [PATCH] cgroup: use an ordered workqueue for cgroup destruction

From: Tejun Heo
Date: Fri Feb 07 2014 - 15:35:23 EST


Hello, Hugh.

On Fri, Feb 07, 2014 at 12:20:44PM -0800, Hugh Dickins wrote:
> > But even with offline being called outside cgroup_mutex, IIRC, the
> > described problem would still be able to deadlock as long as the tree
> > depth is deeper than max concurrency level of the destruction
> > workqueue. Sure, we can give it large enough number but it's
> > generally nasty.
>
> You worry me there: I certainly don't want to be introducing new
> deadlocks. You understand workqueues much better than most of us: I'm
> not sure what "max concurrency level of the destruction workqueue" is,
> but it sounds uncomfortably like an ordered workqueue's max_active 1.

Ooh, max_active is always a finite number. The only reason we usually
don't worry about it is because they are large enough for the existing
dependency chains to cause deadlocks. The theoretical problem with
cgroup is that the dependency chain can grow arbitrarily long and
multiple removals along different subhierarchies can overlap which
means that there can be multiple long dependency chains among work
items. The probability would be extremely low but deadlock might be
possible even with relatively high max_active.

Besides, the reason we reduced max_active in the first place was
because destruction work items tend to just stack up without any
actual concurrency benefits, so increasing concurrncy level seems a
bit nasty to me (but probably a lot of those traffic jam was from
cgroup_mutex and once we take that out of the picture, it could become
fine).

> You don't return to this concern in the following mails of the thread:
> did you later decide that it actually won't be a problem? I'll assume
> so for the moment, since you took the patch, but please reassure me.

I was just worrying about a different solution where we take
css_offline invocation outside of cgroup_mutex and bumping up
max_active. There's nothing to worry about your patch. Sorry about
not being clear. :)

> > One thing I don't get is why memcg has such reverse dependency at all.
> > Why does the parent wait for its descendants to do something during
> > offline? Shouldn't it be able to just bail and let whatever
> > descendant which is stil busy propagate things upwards? That's a
> > usual pattern we use to tree shutdowns anyway. Would that be nasty to
> > implement in memcg?
>
> I've no idea how nasty it would be to change memcg around, but Michal
> and Hannes appear very open to doing so. I do think that memcg's current
> expectation is very reasonable: it's perfectly normal that a rmdir cannot
> succeed until the directory is empty, and to depend upon that fact; but
> the use of workqueue made some things asynchronous which were not before,
> which has led to some surprises.

Maybe. The thing is that ->css_offline() isn't really comparable to
rmdir. ->css_free() is and is fully ordered through refcnts as one
would expect. Whether ->css_offline() should be ordered similarly so
that the parent's offline is called iff all its children finished
offlining, I'm not sure. Maybe it'd be something nice to have but I
kinda wanna keep the offline hook and its usages simple and limited.
It's not where the actual destruction should happen. It's just a
notification to get ready.

Looks like Johannes's patch is headed towards that direction - moving
destruction from ->css_offline to ->css_free(), so if that works out,
I think we should be good for the time being.

Thanks!

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/