Re: [PATCH v4 2/5] sched_clock: Optimize cache line usage

From: Stephen Boyd
Date: Mon Feb 09 2015 - 21:37:47 EST


On 02/09/15 01:47, Daniel Thompson wrote:
> On 09/02/15 09:28, Will Deacon wrote:
>> On Sun, Feb 08, 2015 at 12:02:37PM +0000, Daniel Thompson wrote:
>>> Currently sched_clock(), a very hot code path, is not optimized to
>>> minimise its cache profile. In particular:
>>>
>>> 1. cd is not ____cacheline_aligned,
>>>
>>> 2. struct clock_data does not distinguish between hotpath and
>>> coldpath data, reducing locality of reference in the hotpath,
>>>
>>> 3. Some hotpath data is missing from struct clock_data and is marked
>>> __read_mostly (which more or less guarantees it will not share a
>>> cache line with cd).
>>>
>>> This patch corrects these problems by extracting all hotpath data
>>> into a separate structure and using ____cacheline_aligned to ensure
>>> the hotpath uses a single (64 byte) cache line.
>> Have you got any performance figures for this change, or is this just a
>> theoretical optimisation? It would be interesting to see what effect this
>> has on systems with 32-byte cachelines and also scenarios where there's
>> contention on the sequence counter.
> Most of my testing has focused on proving the NMI safety parts of the
> patch work as advertised so its mostly theoretical.
>
> However there are some numbers from simple tight loop calls to
> sched_clock (Stephen Boyd's results are more interesting than mine
> because I observe pretty wild quantization effects that render the
> results hard to trust):
> http://thread.gmane.org/gmane.linux.kernel/1871157/focus=1879265
>
> Not sure what useful figures would be useful for a contended sequence
> counter. Firstly the counter is taken for write at 7/8 wrap time of the
> times so even for the fastest timers the interval is likely to be >3s
> and is very short duration. Additionally, the NMI safety changes make it
> possible to read the timer whilst it is being updated so it is only
> during the very short struct-copy/write/struct-copy/write update
> sequence that we will observe the extra cache line used for a read.
> Benchmarks that show the effect of update are therefore non-trivial to
> construct.
>

Here's the raw numbers for the tight loop. I noticed that if I don't use
perf I get a larger number of calls per 10s, most likely because we
aren't doing anything else. These are very lightly loaded systems, i.e.
busybox ramdisk with nothing going on. Kernel version is v3.19-rc4. The
CPU is Krait on msm8960 and msm8974, except on msm8974 it has the arm
architected timers backing sched_clock() vs. our own custom timer IP on
msm8960. The cache line size is 64 bytes. I also ran it on msm8660 which
is a Scorpion CPU with the same timer as msm8960 (custom timer IP) and a
cache line size of 32 bytes. Unfortunately nobody has ported Scorpion
over to perf events, so we don't hardware events.

msm8960 (before patch)
----------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 14528449 calls in 10000000290 ns
Made 14528925 calls in 10000000142 ns
Made 14524549 calls in 10000000587 ns
Made 14528164 calls in 10000000734 ns
Made 14524468 calls in 10000000290 ns
Made 14527198 calls in 10000000438 ns
Made 14523508 calls in 10000000734 ns
Made 14527894 calls in 10000000290 ns
Made 14529609 calls in 10000000734 ns
Made 14523114 calls in 10000000142 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10009.635016 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
7 context-switches # 0.001 K/sec ( +- 16.16% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
4003806350 cycles # 0.400 GHz ( +- 0.00% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
921924235 instructions # 0.23 insns per cycle ( +- 0.01% )
0 branches # 0.000 K/sec
58521151 branch-misses # 5.846 M/sec ( +- 0.01% )

10.011767657 seconds time elapsed ( +- 0.00% )

msm8960 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 19626366 calls in 10000000587 ns
Made 19623708 calls in 10000000142 ns
Made 19623282 calls in 10000000290 ns
Made 19625304 calls in 10000000290 ns
Made 19625151 calls in 10000000291 ns
Made 19624906 calls in 10000000290 ns
Made 19625383 calls in 10000000143 ns
Made 19625235 calls in 10000000290 ns
Made 19624969 calls in 10000000290 ns
Made 19625209 calls in 10000000438 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10009.883401 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
7 context-switches # 0.001 K/sec ( +- 15.88% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
4003901511 cycles # 0.400 GHz ( +- 0.00% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
1164635790 instructions # 0.29 insns per cycle ( +- 0.00% )
0 branches # 0.000 K/sec
20039814 branch-misses # 2.002 M/sec ( +- 0.00% )

10.012092383 seconds time elapsed ( +- 0.00% )

msm8974 (before patch)
----------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 21289694 calls in 10000000083 ns
Made 21289072 calls in 10000000082 ns
Made 21289550 calls in 10000000395 ns
Made 21288892 calls in 10000000291 ns
Made 21288987 calls in 10000000135 ns
Made 21289140 calls in 10000000395 ns
Made 21289161 calls in 10000000395 ns
Made 21288911 calls in 10000000239 ns
Made 21289204 calls in 10000000135 ns
Made 21288738 calls in 10000000135 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10003.839348 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
4 context-switches # 0.000 K/sec ( +- 3.70% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
6146323757 cycles # 0.614 GHz ( +- 0.00% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
1155527762 instructions # 0.19 insns per cycle ( +- 0.00% )
107186099 branches # 10.714 M/sec ( +- 0.00% )
35548359 branch-misses # 33.17% of all branches ( +- 0.00% )

10.004769053 seconds time elapsed ( +- 0.00% )

msm8974 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 21289357 calls in 10000000239 ns
Made 21384961 calls in 10000000396 ns
Made 22105925 calls in 10000000238 ns
Made 27384126 calls in 10000000239 ns
Made 22107737 calls in 10000000134 ns
Made 21368867 calls in 10000000239 ns
Made 22106065 calls in 10000000395 ns
Made 27384196 calls in 10000000083 ns
Made 22107334 calls in 10000000291 ns
Made 21365426 calls in 10000000291 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10003.753333 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
7 context-switches # 0.001 K/sec ( +- 18.18% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
6837664600 cycles # 0.684 GHz ( +- 6.74% )
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
1148993903 instructions # 0.17 insns per cycle ( +- 3.32% )
115049358 branches # 11.501 M/sec ( +- 3.31% )
42520513 branch-misses # 36.96% of all branches ( +- 5.00% )

10.004769533 seconds time elapsed ( +- 0.00% )

msm8660 (before patch)
----------------------

# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 14099029 calls in 10000000586 ns
Made 14099227 calls in 10000000735 ns
Made 14098763 calls in 10000000439 ns
Made 14099042 calls in 10000000291 ns
Made 14099273 calls in 10000000290 ns
Made 14100377 calls in 10000000586 ns
Made 14100183 calls in 10000000586 ns
Made 14099220 calls in 10000000586 ns
Made 14098853 calls in 10000000587 ns
Made 14099368 calls in 10000000142 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10006.700528 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
11 context-switches # 0.001 K/sec ( +- 10.38% )
0 cpu-migrations # 0.000 K/sec
56 page-faults # 0.006 K/sec
0 cycles # 0.000 GHz
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
0 instructions
0 branches # 0.000 K/sec
0 branch-misses # 0.000 K/sec

10.008796161 seconds time elapsed ( +- 0.00% )

msm8660 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 20555901 calls in 10000000438 ns
Made 15510019 calls in 10000000142 ns
Made 15510371 calls in 10000000587 ns
Made 15509184 calls in 10000000439 ns
Made 15509068 calls in 10000000291 ns
Made 15510719 calls in 10000000439 ns
Made 15508899 calls in 10000000291 ns
Made 15509206 calls in 10000000587 ns
Made 15509057 calls in 10000000290 ns
Made 15509178 calls in 10000000735 ns

Performance counter stats for 'modprobe sched_clock_test' (10 runs):

10009.491416 task-clock (msec) # 1.000 CPUs utilized ( +- 0.00% )
13 context-switches # 0.001 K/sec ( +- 10.82% )
0 cpu-migrations # 0.000 K/sec
58 page-faults # 0.006 K/sec
0 cycles # 0.000 GHz
0 stalled-cycles-frontend # 0.00% frontend cycles idle
0 stalled-cycles-backend # 0.00% backend cycles idle
0 instructions
0 branches # 0.000 K/sec
0 branch-misses # 0.000 K/sec

10.011834087 seconds time elapsed ( +- 0.00% )

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/