On 11/15/2011 07:59 AM, Glauber Costa wrote:Of course it is. But not in *this* patchset. If you look at the last one I sent, with all the functionality, before a split attempt, you will see that this is indeed used.This patch makes the scheduler statistics, such as user ticks,
system ticks, etc, per-cgroup. With this information, we are
able to display the same information we currently do in cpuacct
cgroup, but within the normal cpu cgroup.
Hmm,
So this goes a little beyond the existing stats exported by cpuacct.
Currently we have:
CPUACCT_STAT_USER
CPUACCT_STAT_SYSTEM (in cpuacct.info)
and
cpuacct.usage / cpuacct.usage_per_cpu
Arguably the last two stats are the *most* useful information exported
by cpuacct (and the ones we get for free from existing sched_entity
accounting). But their functionality is not maintained.
As proposed in: https://lkml.org/lkml/2011/11/11/265
I'm not sure we really want to bring the other stats /within/ the CPU
controller.
Furthermore, given your stated goal of changing virtualizing some of the
/proc interfaces using this export it definitely seems like these fields
(and any future behavioral changes using them may enable) be independent
from core cpu.
(/me ... reads through patch then continues thoughts at bottom.)
OK.+static struct jump_label_key sched_cgroup_enabled;
This name does not really suggest what this jump-label is used for.
Something like task_group_sched_stats_enabled is much clearer.
+static int sched_has_sched_stats = 0;
+
+struct kernel_cpustat *task_group_kstat(struct task_struct *p)
+{
+ if (static_branch(&sched_cgroup_enabled)) {
+ struct task_group *tg;
+ tg = task_group(p);
+ return tg->cpustat;
+ }
+
+ return&kernel_cpustat;
+}
+EXPORT_SYMBOL(task_group_kstat);
+
/* Change a task's cfs_rq and parent entity if it moves across
CPUs/groups */
static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
{
@@ -784,9 +806,36 @@ static inline struct task_group
*task_group(struct task_struct *p)
{
return NULL;
}
-
#endif /* CONFIG_CGROUP_SCHED */
+static inline void task_group_account_field(struct task_struct *p,
+ u64 tmp, int index)
+{
+ /*
+ * Since all updates are sure to touch the root cgroup, we
+ * get ourselves ahead and touch it first. If the root cgroup
+ * is the only cgroup, then nothing else should be necessary.
+ *
+ */
+ __get_cpu_var(kernel_cpustat).cpustat[index] += tmp;
+
+#ifdef CONFIG_CGROUP_SCHED
+ if (static_branch(&sched_cgroup_enabled)) {
+ struct kernel_cpustat *kcpustat;
+ struct task_group *tg;
+
+ rcu_read_lock();
+ tg = task_group(p);
+ while (tg&& (tg !=&root_task_group)) {
You could use for_each_entity starting from &p->se here.
How exactly? I thought this_cpu_ptr was always needed in the case of dynamic allocated percpu variables.+ kcpustat = this_cpu_ptr(tg->cpustat);
This is going to have to do the this_cpu_ptr work at every level; we
already know what cpu we're on it and can reference it directly.
local_irq_save(flags);
latest_ns = this_cpu_read(cpu_hardirq_time);
+ kstat_lock();
This protects ?
struct cgroup_subsys cpu_cgroup_subsys = {
So I'm not seeing any touch-points that intrinsically benefit from being
a part of the cpu sub-system. The hierarchy walk in
task_group_account_field() is completely independent from the rest of
the controller.
The argument for merging {usage, usage_per_cpu} into cpu is almost
entirely performance based -- the stats are very useful from a
management perspective and we already maintain (hidden) versions of them
in cpu. Whereas, as it stands these this really would seem not to suffer
at all from being in its own controller. I previously suggested that
this might want to be a "co-controller" (e.g. one that only ever exists
mounted adjacent to cpu so that we could leverage the existing hierarchy
without over-loading the core of "cpu"). But I'm not even sure that is
required or beneficial given that this isn't going to add value or make
anything cheaper.
From that perspective, perhaps what you're looking for *really* is best
served just by greatly extending the stats exported by cpuacct (as above).
We'll still pull {usage, usage_per_cpu} into "cpu" for the common case
but those who really want everything else could continue using "cpuacct".
Reasonable?