Re: [External] Re: [PATCH] cgroup/rstat: record the cumulative per-cpu time of cgroup and its descendants

From: Hao Jia
Date: Sun Aug 06 2023 - 23:36:34 EST




On 2023/8/3 Tejun Heo wrote:
I couldn't come up with an answer. Let's go ahead with adding the field but
can you please do the followings?


Thank you for your suggestion. I am very willing to do it.

* Name it to something like subtree_bstat instead of cumul_bstat. The
counters are all cumulative.

I did it in v2 patch.


* Are you sure the upward propagation logic is correct? It's calculating
global delta and then propagating to the per-cpu delta of the parent. Is
that correct because the two delta calculations always end up the same?

Sorry, I made a mistake and misled you. These two deltas are not always equal.

I found and reproduced a bug case:
We build /sys/fs/cgroup/test /sys/fs/cgroup/test/t1, /sys/fs/cgroup/test/t1/tt1 in turn.
And there are only tasks in /sys/fs/cgroup/test/t1/tt1. After applying this patch, some operations and corresponding data changes are as follows:

Step 1、cat /sys/fs/cgroup/test/t1/tt1/cpu.stat
*cpu 6* current flush cgroup /test/t1/tt1
per-cpu delta utime 0 stime 0 sum_exec_runtime 257341
/test/t1/tt1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 257341
/test/t1/tt1 cgrp->bstat utime 0 stime 0 sum_exec_runtime 257341
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 0 sum_exec_runtime 257341
parent(/test/t1) ->bstat utime 0 stime 0 sum_exec_runtime 257341
parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test/t1) cumul_bstat utime 0 stime 0 sum_exec_runtime 257341
}


Step 2、cat /sys/fs/cgroup/test/t1/tt1/cpu.stat
*cpu 12* current flush cgroup /test/t1/tt1
per-cpu delta utime 0 stime 1000000 sum_exec_runtime 747042
/test/t1/tt1 cumul_bstat utime 0 stime 1000000 sum_exec_runtime 747042
/test/t1/tt1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1004383
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 1000000 sum_exec_runtime 747042
parent(/test/t1) ->bstat utime 0 stime 1000000 sum_exec_runtime 1004383
parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test/t1) cumul_bstat utime 0 stime 1000000 sum_exec_runtime 747042
}


Step 3、cat /sys/fs/cgroup/test/cpu.stat
(cgroup fulsh /test/t1/tt1 -> /test/t1 -> /test in turn)

*cpu 6* current flush cgroup /test/t1/tt1
per-cpu delta utime 0 stime 0 sum_exec_runtime 263468
/test/t1/tt1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 520809
/test/t1/tt1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 0 sum_exec_runtime 263468
parent(/test/t1) ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
parent(/test/t1) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test/t1) cumul_bstat utime 0 stime 0 sum_exec_runtime 520809
}

*cpu 6* current flush cgroup /test/t1
per-cpu delta utime 0 stime 0 sum_exec_runtime 0
/test/t1 cumul_bstat(cpu 6) utime 0 stime 0 sum_exec_runtime 520809
/test/t1 cgrp->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
if (cgroup_parent(parent)) {
cgrp delta utime 0 stime 1000000 sum_exec_runtime 1267851 <---
parent(/test) ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
parent(/test) last_bstat utime 0 stime 0 sum_exec_runtime 0
parent(/test) cumul_bstat (cpu 6) utime 0 stime 1000000 sum_exec_runtime 1267851 <--- *error*
******
Here cgrp delta is *not equal* to per-cpu delta.
The frequency of cgroup (/test) and its chiled cgroup (/test/t1/tt1) flush is inconsistent.
In other words (when we call cgroup_base_stat_flush(), we will not necessarily flush to the highest-level cgroup except root(like step 1 and 2 above)).
Therefore, cgrp delta may contain the cumulative value of multiple per-cpu deltas.

The correct value of parent(/test) cumul_bstat should be utime 0 stime 0 sum_exec_runtime 520809.
******
}

*cpu 6* current flush cgroup /test
per-cpu delta utime 0 stime 0 sum_exec_runtime 0
cumul_bstat utime 0 stime 1000000 sum_exec_runtime 1267851
/test ->bstat utime 0 stime 1000000 sum_exec_runtime 1267851
cgroup_parent(parent) is NULL end.


So we should add a per-cpu variable subtree_last_bstat similar to cgrp->last_bstat to record the last value.

I have sent v2 patch, please review it again.
v2 link:
https://lore.kernel.org/all/20230807032930.87785-1-jiahao.os@xxxxxxxxxxxxx


* Please add a comment explaining that the field is not currently used
outside of being read from bpf / drgn and what not and that we're still
trying to determine how to expose that in the cgroupfs interface.


Thanks, I did it in v2 patch.

Thanks,
Hao