[PATCH v2] events/core: fix acoount failure for event's total_enable_time

From: Yeoreum Yun
Date: Fri Dec 20 2024 - 05:02:20 EST


The perf core code fails to account for total_enable_time of event
when its state is inactive.

Here is the error case for failure to account for total_enable_time for
core event:

sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s
...
armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740
armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440

Performance counter stats for 'stress-ng --pthread=2 -t 2s':

1,138,698,008 armv8_pmuv3_0/event=0x08/ (94.99%)
1,826,791,390 armv8_pmuv3_1/event=0x08/ (43.47%)

Since above two events are belong to the same task context and
mutually-exclusive per cpu (they couldn't be active at the same time on the same cpu),
the total_enable_time should be the same (the 2nd number among 3 numbers)
and the summation of ratio should be 100%.

This account failure of total_enable_time for event could happen in below sequence.

1. two event opened with:
- first event (e0) is opened with pmu0(p0) which could be added on CPU0.
- second event (e1) is opened with pmu1(p1) which could be added on CPU1.
- these two event belongs to the same task_ctx.

at this point:
task_ctx->time = 0
e0->total_enable_time = 0
e0->total_running_time = 0
e1->total_enable_time = 0
e1->total_running_time = 0

2. the task_ctx is sched in CPU0.
- In ctx_sched_in(), the task_ctx->time doesn't updated.
- In event_sched_in() e0 is activated so, its state becomes ACTIVE.
- In event_sched_in() e1 is activated, but soon becomes !ACTIVE
because pmu1 doesn't support CPU1 so it failed in pmu1->add().

at this point:
task_ctx->time = 0
e0->total_enable_time = 0
e0->total_running_time = 0
e1->total_enable_time = 0
e1->total_running_time = 0

3. the task_ctx is sched out from CPU0. and suppose scheulded time as t1.
- In ctx_sched_out(), the task_ctx->time updated with t1.
- e0 was ACTIVE, so In event_sched_out(),
e0->total_enable_time and e0->total_running_time are updated with t1.
- e1 was !ACTIVE, so it doesn't update
e1->total_enable_time and e1->total_running_time.

at this point:
task_ctx->time = t1
e0->total_enable_time = t1
e0->total_running_time = t1
e1->total_enable_time = 0
e1->total_running_time = 0

4. the task_ctx is sched in to CPU1.
- task_ctx->time doesn't changed.
- In event_sched_in() e0 is activated, but soon becomes !ACTIVE
because pmu1 doesn't support CPU1 so it failed in pmu1->add().
- In event_sched_in() e1 is activated, and when it changes event
state to ACTIVE, e1->total_enable_time is updated with
task_ctx->time.

at this point:
task_ctx->time = t1
e0->total_enable_time = t1
e0->total_running_time = t1
e1->total_enable_time = t1
e1->total_running_time = 0

5. the task_ctx is sched out from CPU1. and suppose scheulded time as t2.
- In ctx_sched_out(), the task_ctx->time updated with t1 + t2.
- e0 was !ACTIVE, so it doesn't update
e0->total_enable_time and e0->total_running_time.
- e1 was ACTIVE, so In event_sched_out(),
e1->total_enable_time and e1->total_running_time are updated with t2.

at this point:
task_ctx->time = t1 + t2
e0->total_enable_time = t1
e0->total_running_time = t1
e1->total_enable_time = t1 + t2
e1->total_running_time = t2

6. the task_ctx is sched in CPU0 again.
- task_ctx->time doesn't changed.
- In event_sched_in() e0 is activated, and when it changes event
state to ACTIVE, e0->total_enable_time is updated with
task_ctx->time.
- In event_sched_in() e1 is activated, but soon becomes !ACTIVE
because pmu1 doesn't support CPU1 so it failed in pmu1->add().

at this point:
task_ctx->time = t1 + t2
e0->total_enable_time = t1 + t2
e0->total_running_time = t1
e1->total_enable_time = t1 + t2
e1->total_running_time = t2

7. the task exit and task_ctx is sched out. and suppose scheulded time as t3.
- In ctx_sched_out(), the task_ctx->time updated with t1 + t2 + t3.
- e0 was ACTIVE, so In event_sched_out(),
e0->total_enable_time and e1->total_running_time are updated with t3.
- e1 was !ACTIVE, so it doesn't update
e1->total_enable_time and e1->total_running_time.

at this point:
task_ctx->time = t1 + t2 + t3
e0->total_enable_time = t1 + t2 +t3
e0->total_running_time = t1 + t3
e1->total_enable_time = t1 + t2
e1->total_running_time = t2

In step 7, amount of t3 enable time is missing in e1.
So, the perf prints error ratio for total_running_time/total_enable_time
for e1.

Here is the summary of change of time as timeline:

CPU0 CPU1 CPU0
| in | out | | in | out | | in | out (exit) |
------------------------------------------------------------------------------
ctx->time | 0 | t1 | | t1 | t1 + t2 | | t1 + t2 | t1 + t2 + t3 |
------------------------------------------------------------------------------
e0->ena | 0 | t1 | | t1 | t1 | | t1 + t2 | t1 + t2 + t3 |
------------------------------------------------------------------------------
e0->run | 0 | t1 | | t1 | t1 | | t1 + t2 | t1 + t3 |
------------------------------------------------------------------------------
e1->ena | 0 | 0 | | t1 | t1 + t2 | | t1 + t2 | t1 + t2 |
------------------------------------------------------------------------------
e1->run | 0 | 0 | | 0 | t2 | | t1 + t2 | t2 |
------------------------------------------------------------------------------

To address this, update total_enable_time in event_sched_out() when event state
is PERF_EVENT_STATE_INACTIVE.

After this patch, this problem is gone like:

sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 10s
...
armv8_pmuv3_0/event=0x08/: 15396770398 32157963940 21898169000
armv8_pmuv3_1/event=0x08/: 22428964974 32157963940 10259794940

Performance counter stats for 'stress-ng --pthread=2 -t 10s':

15,396,770,398 armv8_pmuv3_0/event=0x08/ (68.10%)
22,428,964,974 armv8_pmuv3_1/event=0x08/ (31.90%)

Signed-off-by: Yeoreum Yun <yeoreum.yun@xxxxxxx>
---
kernel/events/core.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 065f9188b44a..308850449a84 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2311,6 +2311,11 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
WARN_ON_ONCE(event->ctx != ctx);
lockdep_assert_held(&ctx->lock);

+ if (event->state == PERF_EVENT_STATE_INACTIVE) {
+ perf_event_update_time(event);
+ return;
+ }
+
if (event->state != PERF_EVENT_STATE_ACTIVE)
return;

@@ -2351,7 +2356,7 @@ group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
{
struct perf_event *event;

- if (group_event->state != PERF_EVENT_STATE_ACTIVE)
+ if (group_event->state < PERF_EVENT_STATE_INACTIVE)
return;

perf_assert_pmu_disabled(group_event->pmu_ctx->pmu);
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}