Re: [PATCH] sched/debug: Show intergroup and hierarchy sum wait time of a task group

From: çè
Date: Sun Jan 27 2019 - 21:53:18 EST




On 2019/1/25 äå3:55, çèé wrote:
Hi, Michael
Actually, I just created a container which had just 1 CPU, and I ran two tasks(while 1 loop) in this container. Then I read cpu.stat at 1 second interval ,but the wait_sum difference is almost 0.

Task competition inside a cgroup won't be considered as cgroup's
competition, please try create another cgroup with dead loop on
each CPU.

I check your patch and find that you just summary the task group's se wait_sum rather than tasks' se wait sum. So, we can see that the increment of 'wait_sum' will always be 0 as long as there are running tasks in this task group.

Running tasks doesn't means no competition, only if that cgroup occupied
the CPU exclusively at that moment.

Unfortunately, this is not what we want. Our expectation is to evaluate the CPU contention in a task group just as I said. We need to ensure that containers with high priority have sufficient CPU resources.

No offense but I'm afraid you misunderstand the problem we try to solve
by wait_sum, if your purpose is to have a way to tell whether there are
sufficient CPU inside a container, please try lxcfs + top, if there are
almost no idle and load is high, then the CPU resource is not sufficient.


I implement 'hierarchy wait_sum' referred to cpuacct.usage. The hierarchy wait_sum will summarize all the tasks' se wait_sum within the hierarchy of this task group. We
can calculate the wait rate of a task group based on hierarchy wait_sum and cpuacct.usage
Frankly speaking this sounds like a supplement rather than a missing piece,
although we don't rely on lxcfs and modify the kernel ourselves to support
container environment, I still don't think such kind of solutions should be
in kernel.

Regards,
Michael Wang


The extra overhead for calculating the hierarchy wait_sum is traversing the cfs_rq's se from the target task's se to the root_task_group children's se.

Regartds
Yuzhoujian


çè <yun.wang@xxxxxxxxxxxxxxxxx <mailto:yun.wang@xxxxxxxxxxxxxxxxx>> ä2019å1æ25æåä äå11:12åéï



On 2019/1/23 äå5:46, ufo19890607@xxxxxxxxx <mailto:ufo19890607@xxxxxxxxx> wrote:
> From: yuzhoujian <yuzhoujian@xxxxxxxxxxxxxxx <mailto:yuzhoujian@xxxxxxxxxxxxxxx>>
>
> We can monitor the sum wait time of a task group since 'commit 3d6c50c27bd6
> ("sched/debug: Show the sum wait time of a task group")'. However this
> wait_sum just represents the confilct between different task groups, since
> it is simply sum the wait time of task_group's cfs_rq. And we still cannot
> evaluate the conflict between all the tasks within hierarchy of this group,
> so the hierarchy wait time is still needed.

Could you please give us a scene that we do need this hierarchy wait_sum,
despite the extra overhead?

Regards,
Michael Wang

>
> Thus we introduce hierarchy wait_sum which summarizes the total wait sum of
> all the tasks in the hierarchy of a group.
>
> The 'cpu.stat' is modified to show the statistic, like:
>
>Â Â Â Ânr_periods 0
>Â Â Â Ânr_throttled 0
>Â Â Â Âthrottled_time 0
>Â Â Â Âintergroup wait_sum 2842251984
>Â Â Â Âhierarchy wait_sum 6389509389332798
>
>Â From now on we can monitor both the wait_sum of intergroup and hierarchy,
> which will inevitably help a system administrator know how intense the CPU
> competition is within a task group and between different task groups. We
> can calculate the wait rate of a task group based on hierarchy wait_sum and
> cpuacct.usage.
>
> For example:
>Â Â Â ÂX% = (current_wait_sum - last_wait_sum) / ((current_usage -
> last_usage) + (current_wait_sum - last_wait_sum))
>
> That means the task group paid X percentage of time on runqueue waiting
> for the CPU.
>
> Signed-off-by: yuzhoujian <yuzhoujian@xxxxxxxxxxxxxxx <mailto:yuzhoujian@xxxxxxxxxxxxxxx>>
> ---
> Âkernel/sched/core.c | 11 +++++++----
> Âkernel/sched/fair.c | 17 +++++++++++++++++
>Â Âkernel/sched/sched.h |Â 3 +++
>Â Â3 files changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ee77636..172e6fb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6760,13 +6760,16 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
>Â Â Â Âseq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time);
>
>Â Â Â Âif (schedstat_enabled() && tg != &root_task_group) {
> -Â Â Â Â Â Â Âu64 ws = 0;
> +Â Â Â Â Â Â Âu64 inter_ws = 0, hierarchy_ws = 0;
>Â Â Â Â Â Â Â Âint i;
>
> -Â Â Â Â Â Â Âfor_each_possible_cpu(i)
> -Â Â Â Â Â Â Â Â Â Â Âws += schedstat_val(tg->se[i]->statistics.wait_sum);
> +Â Â Â Â Â Â Âfor_each_possible_cpu(i) {
> +Â Â Â Â Â Â Â Â Â Â Âinter_ws += schedstat_val(tg->se[i]->statistics.wait_sum);
> +Â Â Â Â Â Â Â Â Â Â Âhierarchy_ws += tg->cfs_rq[i]->hierarchy_wait_sum;
> +Â Â Â Â Â Â Â}
>
> -Â Â Â Â Â Â Âseq_printf(sf, "wait_sum %llu\n", ws);
> +Â Â Â Â Â Â Âseq_printf(sf, "intergroup wait_sum %llu\n", inter_ws);
> +Â Â Â Â Â Â Âseq_printf(sf, "hierarchy wait_sum %llu\n", hierarchy_ws);
>Â Â Â Â}
>
>Â Â Â Âreturn 0;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e2ff4b6..35e89ca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -858,6 +858,19 @@ static void update_curr_fair(struct rq *rq)
>Â Â}
>
>Â Âstatic inline void
> +update_hierarchy_wait_sum(struct sched_entity *se,
> +Â Â Â Â Â Â Â Â Â Â Âu64 delta_wait)
> +{
> +Â Â Âfor_each_sched_entity(se) {
> +Â Â Â Â Â Â Âstruct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +Â Â Â Â Â Â Âif (cfs_rq->tg != &root_task_group)
> +Â Â Â Â Â Â Â Â Â Â Â__schedstat_add(cfs_rq->hierarchy_wait_sum,
> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âdelta_wait);
> +Â Â Â}
> +}
> +
> +static inline void
>Â Âupdate_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
>Â Â{
>Â Â Â Âstruct task_struct *p;
> @@ -880,6 +893,7 @@ static void update_curr_fair(struct rq *rq)
>Â Â Â Â Â Â Â Â Â Â Â Âreturn;
>Â Â Â Â Â Â Â Â}
>Â Â Â Â Â Â Â Âtrace_sched_stat_wait(p, delta);
> +Â Â Â Â Â Â Âupdate_hierarchy_wait_sum(se, delta);
>Â Â Â Â}
>
>Â Â Â Â__schedstat_set(se->statistics.wait_max,
> @@ -10273,6 +10287,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>Â Â#ifndef CONFIG_64BIT
>Â Â Â Âcfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
>Â Â#endif
> +#ifdef CONFIG_SCHEDSTATS
> +Â Â Âcfs_rq->hierarchy_wait_sum = 0;
> +#endif
>Â Â#ifdef CONFIG_SMP
>Â Â Â Âraw_spin_lock_init(&cfs_rq->removed.lock);
>Â Â#endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d27c1a5..c01ab99 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -496,6 +496,9 @@ struct cfs_rq {
>Â Â#ifndef CONFIG_64BIT
>Â Â Â Âu64Â Â Â Â Â Â Â Â Â Â Âmin_vruntime_copy;
>Â Â#endif
> +#ifdef CONFIG_SCHEDSTATS
> +Â Â Âu64Â Â Â Â Â Â Â Â Â Â Âhierarchy_wait_sum;
> +#endif
>
>   Âstruct rb_root_cached Âtasks_timeline;
>
>