[RFC v8 27/28] clockevents: optimize struct clock_event_device layout

From: Nicolai Stange
Date: Sat Nov 19 2016 - 11:12:39 EST


Benchmarks have shown that clockevents_adjust_all_freqs() is memory bound.
The members of struct clock_event_device accessed therein, namely
->features, ->mult_adjusted, ->shift, ->mult and ->list are split across
two cachelines.

In what follows, a cacheline size of 64 bytes will be assumed.

I.) sizeof(long) == 4
At least for sizeof(long) == sizeof(void*) == 4, e.g. on ARM, this
situation can be improved by reordering the members of
struct clock_event_device.

This patch exploits the fact that with the current layout, there is some
padding inserted within the first cacheline on ARM: right now, the first
three members, which are of pointer type and 12 bytes in total, are
followed by a ktime_t. Baically being a s64, the latter gets aligned on
an 8 byte boundary. Thus, bytes 12 to 16 are occupied by unused padding.

Furthermore, for sizeof(unsigned long) == 4, this patch series has achieved
another net saving of 4 bytes within the first cacheline so far.

Summarizing: through reordering of the first cacheline's members, a total
space of 8 bytes can be made available, compared to the state before this
series. This allows for the placement of the 12 bytes made up by ->list and
->mult within the first cacheline while moving only one 4 byte member out.

This victim will be ->set_state_periodic() which will be moved to its
->set_state_*() relatives within the second cacheline.

II.) sizeof(long) == 8
For the case of sizeof(long) == sizeof(void*) == 8, the first cacheline's
contents will be unchanged.
The ->list and ->mult members are moved from the third to the start of
the second cacheline. Compared to the situation before this series,
this will move ->broadcast() and ->suspend() from the end of the second
cacheline to the start of the third one.
A nice side effect is that due to the elimination of padding in the third
cacheline, together with the patch ("clockevents: degrade ->retries to
unsigned int"), the struct clock_event_device now fits exactly into a
multiple of three cachelines.

Thus, in struct clock_event_device:
- move the ->event_handler(), ->set_next_event() and ->set_next_ktime()
in order to avoid padding in front of ->next_event on ARM.
- move the ->list and ->mult members right before the ->retries member.

Benchmark results:
The following measurements have been carried out on a Raspberry Pi 2B
(armv7, 4 cores, 900MHz). The adjustment process has been driven by
periodically injecting a certain offset via adjtimex(2) approximately
every 3.7h. A 'stress --vm 8 --vm-bytes 32M' was running. The runtime of
clockevents_adjust_all_freqs() has been measured.

- Before this patch:
Mean: 8750.73+-1034.77
Quantiles:
0% 25% 50% 75% 100%
3854 8171 8901 9452 13593 (ns)

- After this patch:
Mean: 6916.90+-782.63
Quantiles:
0% 25% 50% 75% 100%
3072 6412 6927 7430 10989 (ns)

Signed-off-by: Nicolai Stange <nicstange@xxxxxxxxx>
---
include/linux/clockchips.h | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 878a802..17ef940 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -74,10 +74,6 @@ enum clock_event_state {

/**
* struct clock_event_device - clock event device descriptor
- * @event_handler: Assigned by the framework to be called by the low
- * level handler of the event source
- * @set_next_event: set next event function using a clocksource delta
- * @set_next_ktime: set next event function using a direct ktime value
* @next_event: local storage for the next event in oneshot mode
* @max_delta_ns: maximum delta value in ns
* @max_delta_ticks: maximum delta value in ticks
@@ -86,6 +82,12 @@ enum clock_event_state {
* @shift: nanoseconds to cycles divisor (power of two)
* @state_use_accessors:current state of the device, assigned by the core code
* @features: features
+ * @event_handler: Assigned by the framework to be called by the low
+ * level handler of the event source
+ * @set_next_event: set next event function using a clocksource delta
+ * @set_next_ktime: set next event function using a direct ktime value
+ * @list: list head for the management code
+ * @mult: ns to cycles multiplier stored for reconfiguration
* @retries: number of forced programming retries
* @set_state_periodic: switch state to periodic
* @set_state_oneshot: switch state to oneshot
@@ -95,18 +97,13 @@ enum clock_event_state {
* @broadcast: function to broadcast events
* @name: ptr to clock event name
* @min_delta_ticks: minimum delta value in ticks stored for reconfiguration
- * @mult: ns to cycles multiplier stored for reconfiguration
* @rating: variable to rate clock event devices
* @irq: IRQ number (only for non CPU local devices)
* @bound_on: Bound on CPU
* @cpumask: cpumask to indicate for which CPUs this device works
- * @list: list head for the management code
* @owner: module reference
*/
struct clock_event_device {
- void (*event_handler)(struct clock_event_device *);
- int (*set_next_event)(unsigned long evt, struct clock_event_device *);
- int (*set_next_ktime)(ktime_t expires, struct clock_event_device *);
ktime_t next_event;
u64 max_delta_ns;
unsigned long max_delta_ticks;
@@ -115,6 +112,11 @@ struct clock_event_device {
u32 shift;
enum clock_event_state state_use_accessors:8;
unsigned int features:24;
+ void (*event_handler)(struct clock_event_device *);
+ int (*set_next_event)(unsigned long evt, struct clock_event_device *);
+ int (*set_next_ktime)(ktime_t expires, struct clock_event_device *);
+ struct list_head list;
+ u32 mult;
unsigned int retries;

int (*set_state_periodic)(struct clock_event_device *);
@@ -129,12 +131,10 @@ struct clock_event_device {

const char *name;
unsigned int min_delta_ticks;
- u32 mult;
int rating;
int irq;
int bound_on;
const struct cpumask *cpumask;
- struct list_head list;
struct module *owner;
} ____cacheline_aligned;

--
2.10.2