Re: Perf event operation with hotplug cpus and cgroups

From: Peter Zijlstra
Date: Fri Mar 20 2015 - 16:20:57 EST


On Fri, Mar 20, 2015 at 03:41:54PM -0400, William Cohen wrote:
>
> There isn't any desire to aggregate the different cgroup data
> together. The desired grouping is measurements per cgroup, kind of
> like the pid scoping for perf but for a cgroup. It is just that the
> way that the perf event measurements works for cgroups that the
> measurements need to be taken system-wide.

Still doesn't make any sense; if you want to monitor just the vcpu
attach to the one task already.

Without the vcpu per cgroup thing you'll never end up with O(n^2). You
get cgroups * cpus, which is what it is.

Your specific complain was about this weird setup where you place
nr_cpus tasks in nr_cpus cgroups and then end up with O(n^2) fds.

Also this isn't perf specific, cgroups _are_ system wide, so obviously
it needs system-wide measurement.

> > Just measure the parent cgroup of the vcpu cgroups if you're really only
> > interested in the virtual machine crap thing.
> >
> >> Given the issues with these uses cases is user-space setting up the
> >> counters for each cpu in the system the best solution? Would it be
> >> better to to allow the system-wide data collection to selected with
> >> one perf event open with pid==-1 and cpu==-1? Is setup of per cpu
> >> monitoring and aggregation of the counters across processors too
> >> difficult to do in the kernel?
> >
> > Not hard at all, but useless, you need a fd per cpu in order to get your
> > data out. Remember that the ring buffers are strictly per cpu.
> >
>
> Are the ring buffers needed just for the sampling or are they also
> needed "perf stat" type information?

No counting could do this; but even there I'd worry about scalability.
We'd need to fold the value into the 'global' counter on every cgroup
switch, now imagine all 80 cpus context switching at high rates between
cgroups.

Also we'd need to somehow manage multiple events with a single fd,
that's complexity we really do not need.

When we started out with perf we had such global constructs and we had
to quickly kill them for much smaller systems than this 80 cpu machine
you talk about.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/