Hi, Michael
Actually, I just created a container which had just 1 CPU, and I ran two tasks(while 1 loop) in this container. Then I read cpu.stat at 1 second interval ,but the wait_sum difference is almost 0.
Frankly speaking this sounds like a supplement rather than a missing piece,
I implement 'hierarchy wait_sum' referred to cpuacct.usage. The hierarchy wait_sum will summarize all the tasks' se wait_sum within the hierarchy of this task group. We
can calculate the wait rate of a task group based on hierarchy wait_sum and cpuacct.usage
The extra overhead for calculating the hierarchy wait_sum is traversing the cfs_rq's se from the target task's se to the root_task_group children's se.
Regartds
Yuzhoujian
çè <yun.wang@xxxxxxxxxxxxxxxxx <mailto:yun.wang@xxxxxxxxxxxxxxxxx>> ä2019å1æ25æåä äå11:12åéï
On 2019/1/23 äå5:46, ufo19890607@xxxxxxxxx <mailto:ufo19890607@xxxxxxxxx> wrote:
> From: yuzhoujian <yuzhoujian@xxxxxxxxxxxxxxx <mailto:yuzhoujian@xxxxxxxxxxxxxxx>>
>
> We can monitor the sum wait time of a task group since 'commit 3d6c50c27bd6
> ("sched/debug: Show the sum wait time of a task group")'. However this
> wait_sum just represents the confilct between different task groups, since
> it is simply sum the wait time of task_group's cfs_rq. And we still cannot
> evaluate the conflict between all the tasks within hierarchy of this group,
> so the hierarchy wait time is still needed.
Could you please give us a scene that we do need this hierarchy wait_sum,
despite the extra overhead?
Regards,
Michael Wang
>
> Thus we introduce hierarchy wait_sum which summarizes the total wait sum of
> all the tasks in the hierarchy of a group.
>
> The 'cpu.stat' is modified to show the statistic, like:
>
>Â Â Â Ânr_periods 0
>Â Â Â Ânr_throttled 0
>Â Â Â Âthrottled_time 0
>Â Â Â Âintergroup wait_sum 2842251984
>Â Â Â Âhierarchy wait_sum 6389509389332798
>
>Â From now on we can monitor both the wait_sum of intergroup and hierarchy,
> which will inevitably help a system administrator know how intense the CPU
> competition is within a task group and between different task groups. We
> can calculate the wait rate of a task group based on hierarchy wait_sum and
> cpuacct.usage.
>
> For example:
>Â Â Â ÂX% = (current_wait_sum - last_wait_sum) / ((current_usage -
> last_usage) + (current_wait_sum - last_wait_sum))
>
> That means the task group paid X percentage of time on runqueue waiting
> for the CPU.
>
> Signed-off-by: yuzhoujian <yuzhoujian@xxxxxxxxxxxxxxx <mailto:yuzhoujian@xxxxxxxxxxxxxxx>>
> ---
> Âkernel/sched/core.c | 11 +++++++----
> Âkernel/sched/fair.c | 17 +++++++++++++++++
>Â Âkernel/sched/sched.h |Â 3 +++
>Â Â3 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee77636..172e6fb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6760,13 +6760,16 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
>Â Â Â Âseq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
>
>Â Â Â Âif (schedstat_enabled() && tg != &root_task_group) {
> -Â Â Â Â Â Â Âu64 ws = 0;
> +Â Â Â Â Â Â Âu64 inter_ws = 0, hierarchy_ws = 0;
>Â Â Â Â Â Â Â Âint i;
>
> -Â Â Â Â Â Â Âfor_each_possible_cpu(i)
> -Â Â Â Â Â Â Â Â Â Â Âws += schedstat_val(tg->se[i]->statistics.wait_sum);
> +Â Â Â Â Â Â Âfor_each_possible_cpu(i) {
> +Â Â Â Â Â Â Â Â Â Â Âinter_ws += schedstat_val(tg->se[i]->statistics.wait_sum);
> +Â Â Â Â Â Â Â Â Â Â Âhierarchy_ws += tg->cfs_rq[i]->hierarchy_wait_sum;
> +Â Â Â Â Â Â Â}
>
> -Â Â Â Â Â Â Âseq_printf(sf, "wait_sum %llu\n", ws);
> +Â Â Â Â Â Â Âseq_printf(sf, "intergroup wait_sum %llu\n", inter_ws);
> +Â Â Â Â Â Â Âseq_printf(sf, "hierarchy wait_sum %llu\n", hierarchy_ws);
>Â Â Â Â}
>
>Â Â Â Âreturn 0;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e2ff4b6..35e89ca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -858,6 +858,19 @@ static void update_curr_fair(struct rq *rq)
>Â Â}
>
>Â Âstatic inline void
> +update_hierarchy_wait_sum(struct sched_entity *se,
> +Â Â Â Â Â Â Â Â Â Â Âu64 delta_wait)
> +{
> +Â Â Âfor_each_sched_entity(se) {
> +Â Â Â Â Â Â Âstruct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +Â Â Â Â Â Â Âif (cfs_rq->tg != &root_task_group)
> +Â Â Â Â Â Â Â Â Â Â Â__schedstat_add(cfs_rq->hierarchy_wait_sum,
> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âdelta_wait);
> +Â Â Â}
> +}
> +
> +static inline void
>Â Âupdate_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
>Â Â{
>Â Â Â Âstruct task_struct *p;
> @@ -880,6 +893,7 @@ static void update_curr_fair(struct rq *rq)
>Â Â Â Â Â Â Â Â Â Â Â Âreturn;
>Â Â Â Â Â Â Â Â}
>Â Â Â Â Â Â Â Âtrace_sched_stat_wait(p, delta);
> +Â Â Â Â Â Â Âupdate_hierarchy_wait_sum(se, delta);
>Â Â Â Â}
>
>Â Â Â Â__schedstat_set(se->statistics.wait_max,
> @@ -10273,6 +10287,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>Â Â#ifndef CONFIG_64BIT
>Â Â Â Âcfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
>Â Â#endif
> +#ifdef CONFIG_SCHEDSTATS
> +Â Â Âcfs_rq->hierarchy_wait_sum = 0;
> +#endif
>Â Â#ifdef CONFIG_SMP
>Â Â Â Âraw_spin_lock_init(&cfs_rq->removed.lock);
>Â Â#endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d27c1a5..c01ab99 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -496,6 +496,9 @@ struct cfs_rq {
>Â Â#ifndef CONFIG_64BIT
>Â Â Â Âu64Â Â Â Â Â Â Â Â Â Â Âmin_vruntime_copy;
>Â Â#endif
> +#ifdef CONFIG_SCHEDSTATS
> +Â Â Âu64Â Â Â Â Â Â Â Â Â Â Âhierarchy_wait_sum;
> +#endif
>
>   Âstruct rb_root_cached Âtasks_timeline;
>
>