First of.. WTF is v1?
Secondly, please always CC the authors of the code you're changing.
On Wed, Jan 15, 2014 at 09:22:36PM -0500, Waiman Long wrote:It was found that with a perf profile of a compute workload (at 1500So why not for SMP?
users) of the AIM7 benchmark running on a glueless 4-socket 40-core
Westmere-EX system (HT on) on a 3.13-rc8 kernel that the scheduling
tick related functions account for quite a significant portion of
the total kernel cpu cycles.
0.62% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
0.47% reaim [kernel.kallsyms] [k] entity_tick
0.10% reaim [kernel.kallsyms] [k] update_cfs_shares
0.03% reaim [kernel.kallsyms] [k] update_curr
The scheduling tick functions account for about 1.22% of the total
CPU cycles. Of the top 2 function in the above list, the reading
and writing of the tg->load_avg variable account for over 90% of the
CPU cycles:
atomic_long_add(tg_contrib,&tg->load_avg);
atomic_long_read(&tg->load_avg) + 1);
This patch reduces the contention on the load_avg variable (and
secondarily on the runnable_avg variable) by the following 2 measures:
1. Make the load_avg and runnable_avg fields of the task_group
structure sit in their own cacheline without sharing it with others.
This only applies if the kernel is built for NUMA systems with
multiple sockets.
Also, what's the difference between having both of them in the same
cacheline as opposed to a cacheline each?
They're both touched from the same tick, so it makes sense to have them
in one cacheline. Now you get to move two lines into exclusive state,
instead of just the one.
2. Use atomic_long_add_return() to update the fields and save theThen why aren't this two patches?
returned value in a temporary location in the cfs structure to
be used later instead of reading the fields directly.
Furthermore, I completely hate the way you implemented this; the stuff
like in the first hunk below makes the entire code flow horrid. Its
already difficult code, using conditional variables makes it even worse.
Who's to say your 'cached' value is recent? You didn't put in a call
chain analysis to show you always first pass through the add_return()
before using the cached value.
The second change does require some changes in the ordering of howMight have is no good, either you work through it and make damn sure its
some of the average counts are being computed and hence may have a
slight effect on their behavior.
solid or you walk.
Preserved the rest for the added Cc's.