Re: [PATCH 8/9] [RFC] Example multi-bindable subsystem: a per-cgroup notes field

From: Paul Menage
Date: Thu Jul 02 2009 - 03:23:23 EST


On Wed, Jul 1, 2009 at 7:56 PM, Paul Menage<menage@xxxxxxxxxx> wrote:
>> Hmm, do we need to this "info" file as subsys ? How about making this as
>> default file set ? (if there are users.)
>>
>
> That would certainly be possible, and would be an alternative to
> having multi-bindable subsystem support.
>
> The advantage of adding multi-bindable subsystems is that you can
> avoid bloating the core cgroups code, by putting individual small
> cgroups features in their own code modules, and you get to decide at
> mount time which features are actually mounted; if they were part of
> the core cgroups files, then there would either need to be special
> mount options for each separate feature, or else no way to pick which
> features were mounted on each hierarchy.

BTW, just to give a balanced argument: I agree that these example
multi-bindable subsystems are somewhat weak justifications for the new
feature - they each supply a single control file, they're not
connected to anything in the kernel outside of the core cgroups
framework, and they're almost zero overhead if they're not actively
used, so making them part of the cgroups framework directly wouldn't
be totally unreasonable.

An example of a less-trivial multi-bindable subsystem could be cpuacct
- logically there's no reason that you couldn't track CPU usage in
multiple different hierarchies, keeping totals aggregated in different
ways for the groupings in different hierarchies, and the overhead
associated with tracking would mean that you wouldn't want to
automatically link cpuacct into every hierarchy. The practical problem
with this would be that finding the cgroup for a process would be
slower since there wouldn't be a 1:1 mapping from a task to a cpuacct
cgroup state object.

Instead each task would have multiple such states and to update the
usage accounting on each of them you'd have to do a list traversal
rather than a direct lookup (and worse, right now that list traversal
can only be done while holding cgroup_mutex, which is impossible when
doing cpuacct charging from the guts of the scheduler). I can see how
to extend the multi-bindable support to make it cheaper and to require
less synchronization (i.e. walking an RCU-safe array to find the
various state objects rather than doing a list traversal).

Although before doing that I guess it would be worth asking whether
anyone would actually *want* to aggregate CPU usage different ways for
different hierarchies, even if it makes logical sense to be able to do
so.

Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/