Re: [PATCH v3 1/2] perf/core: Share an event with multiple cgroups

From: Namhyung Kim
Date: Tue Apr 20 2021 - 14:37:27 EST


Hi Peter,

On Tue, Apr 20, 2021 at 7:28 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Fri, Apr 16, 2021 at 06:49:09PM +0900, Namhyung Kim wrote:
> > On Thu, Apr 15, 2021 at 11:51 PM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > > +static void perf_update_cgroup_node(struct perf_event *event, struct cgroup *cgrp)
> > > > +{
> > > > + u64 delta_count, delta_time_enabled, delta_time_running;
> > > > + int i;
> > > > +
> > > > + if (event->cgrp_node_count == 0)
> > > > + goto out;
> > > > +
> > > > + delta_count = local64_read(&event->count) - event->cgrp_node_count;
>
> From here...
>
> > > > + delta_time_enabled = event->total_time_enabled - event->cgrp_node_time_enabled;
> > > > + delta_time_running = event->total_time_running - event->cgrp_node_time_running;
> > > > +
> > > > + /* account delta to all ancestor cgroups */
> > > > + for (i = 0; i <= cgrp->level; i++) {
> > > > + struct perf_cgroup_node *node;
> > > > +
> > > > + node = find_cgroup_node(event, cgrp->ancestor_ids[i]);
> > > > + if (node) {
> > > > + node->count += delta_count;
> > > > + node->time_enabled += delta_time_enabled;
> > > > + node->time_running += delta_time_running;
> > > > + }
> > > > + }
>
> ... till here, NMI could hit and increment event->count, which then
> means that:
>
> > > > +
> > > > +out:
> > > > + event->cgrp_node_count = local64_read(&event->count);
>
> This load doesn't match the delta_count load and events will go missing.
>
> Obviously correct solution is:
>
> event->cgrp_node_count += delta_count;
>
>
> > > > + event->cgrp_node_time_enabled = event->total_time_enabled;
> > > > + event->cgrp_node_time_running = event->total_time_running;
>
> And while total_time doesn't have that problem, consistency would then
> have you do:
>
> event->cgrp_node_time_foo += delta_time_foo;
>
> > >
> > > This is wrong; there's no guarantee these are the same values you read
> > > at the begin, IOW you could be loosing events.
> >
> > Could you please elaborate?
>
> You forgot NMI.

Thanks for your explanation. Maybe I'm missing something but
this event is basically for counting and doesn't allow sampling.
Do you say it's affected by other sampling events? Note that
it's not reading from the PMU here, what it reads is a snapshot
of last pmu->read(event) afaik.

Thanks,
Namhyung