Re: [PATCH v2 for-4.12-fixes 1/2] sched/fair: Use task_groups instead of leaf_cfs_rq_list to walk all cfs_rqs

From: Tim Chen
Date: Fri May 26 2017 - 21:25:39 EST




On 05/25/2017 07:39 AM, Tejun Heo wrote:
On Wed, May 24, 2017 at 04:40:34PM -0700, Tim Chen wrote:
We did some preliminary testing of this patchset for a well
known database benchmark on a 4 socket Skylake server system.
It provides a 3.7% throughput boost which is significant for
this benchmark.

That's great to hear. Yeah, the walk can be noticeably expensive even
with moderate number of cgroups. Thanks for sharing the result.


Yes, the walk in update_blocked_averages has bad scaling property as it
iterates over *all* cfs_rq's leaf tasks, making it very expensive. It
consumes 11.7% of our cpu cycles for this benchmark when CGROUP
is on. Your patchset skips unused cgroup and reduce the overhead to
10.4%. CPU cycles profile is attached below for your reference.

The scheduler's frequent update of cgroup's laod averages, and
having to iterate all the leaf tasks for each load balance causes
update_blocked_averages to be one of the most expensive functions in the
system, making CGROUP costly. Without CGROUP, schedule only cost 3.3%
of cpu cycles vs 16.4% with CGROUP turned on. Your patchset does reduce
it to 14.9%.

This benchmark has thousands of running tasks, so it puts a good
deal of stress to the scheduler.

Tim


CPU cycles profile:

4.11 Before your patchset with CGROUP:
---------------------------------------

16.42% 0.03% 280 [kernel.vmlinux] [k] schedule
|
--16.39%--schedule
|
--16.31%--__sched_text_start
|
|--12.85%--pick_next_task_fair
| |
| --11.71%--update_blocked_averages
| |
| --5.00%--update_load_avg
|
|--2.04%--finish_task_switch
| |
| |--0.85%--ret_from_intr
| | |
| | --0.85%--do_IRQ
| |
| --0.75%--apic_timer_interrupt
| |
| --0.75%--smp_apic_timer_interrupt
| |
| --0.55%--irq_exit
| |
| --0.55%--__do_softirq
|
--0.51%--deactivate_task


4.11 After your patchset with CGROUP:
-------------------------------------

14.90% 0.04% 337 [kernel.vmlinux] [k] schedule
|
--14.86%--schedule
|
--14.78%--__sched_text_start
|
|--11.51%--pick_next_task_fair
| |
| --10.37%--update_blocked_averages
| |
| --4.55%--update_load_avg
|
|--1.79%--finish_task_switch
| |
| |--0.77%--ret_from_intr
| | |
| | --0.77%--do_IRQ
| |
| --0.65%--apic_timer_interrupt
| |
| --0.65%--smp_apic_timer_interrupt
|
--0.53%--deactivate_task

4.11 with No CGROUP:
--------------------

3.33% 0.04% 336 [kernel.vmlinux] [k] schedule
|
--3.29%--schedule
|
--3.19%--__sched_text_start
|
--1.45%--pick_next_task_fair
|
--1.15%--load_balance
|
--0.87%--find_busiest_group
|
--0.82%--update_sd_lb_stats