[RFC 0/1] Alternate history mechanism for the TEO governor

From: Pratik Rajesh Sampat
Date: Mon May 11 2020 - 10:11:26 EST

Next message: Wei Hu: "RE: [PATCH v3 2/2] PCI: hv: Retry PCI bus D0 entry when the first attempt failed with invalid device state"
Previous message: Luis Chamberlain: "Re: [PATCH 00/15] net: taint when the device driver firmware crashes"
Next in thread: Pratik Rajesh Sampat: "[RFC 1/1] Weighted approach to gather and use history in TEO governor"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

First RFC posting: https://lkml.org/lkml/2020/2/22/27
Changelog
1. Upated benchmarks

Currently the TEO governor apart from the TEO timer and hit/miss/early
hit buckets; especially for non-timer events, also gathers history of 8
intervals and if there are significant idle durations less than the
current, then it decides if a shallower state must be chosen.

The current sliding history window does do fair job for the prediction
of the non-timer events, so to improve upon the history mechanism,
one may just hypothesize increasing the interval window so that more
history can influence better decisions.

However, this may not be true. The current algorithm can be limited due
to its linearity, while also average of interval durations may not be a
suitable metric when the data presents skewness. This can especially be
true for the when number of intervals taken into account increases.

Benchmark: Schbench
Provides latency statistics for scheduler wakeups
Machine - IBM Power 9

Metric of measurement
1. Performance - latency 99th percentile - usec
2. Power - watts
Performance and Power statistics are normalized

Latency - Normalized
8 intervals is baseline
+---------+-------------+--------------+--------------+--------------+
| Threads | 8 Intervals | 16 Intervals | 32 Intervals | 64 Intervals |
+---------+-------------+--------------+--------------+--------------+
| 2 | 100 | 102.81 | 106.93 | 100.86 |
+---------+-------------+--------------+--------------+--------------+
| 4 | 100 | 104.67 | 96.58 | 96.58 |
+---------+-------------+--------------+--------------+--------------+
| 8 | 100 | 116.26 | 100 | 121.44 |
+---------+-------------+--------------+--------------+--------------+
| 16 | 100 | 85.94 | 81.46 | 80.86 |
+---------+-------------+--------------+--------------+--------------+
+---------+---------------+
| Threads | 128 Intervals |
+---------+---------------+
| 2 | 83.09 |
+---------+---------------+
| 4 | 74.75 |
+---------+---------------+
| 8 | 87.15 |
+---------+---------------+
| 16 | 76.98 |
+---------+---------------+

The latency number are kind of a mixed bag. Some intervals perform
similar, while some slightly better, while some considerably worse

Standout: 128 latency intervals always performs well in terms of latency

Power - Normalized
8 intervals in baseline
+---------+-------------+--------------+--------------+--------------+
| Threads | 8 Intervals | 16 Intervals | 32 Intervals | 64 Intervals |
+---------+-------------+--------------+--------------+--------------+
| 2 | 100 | 100 | 85.29 | 97.05 |
+---------+-------------+--------------+--------------+--------------+
| 4 | 100 | 115.9 | 111.36 | 145.45 |
+---------+-------------+--------------+--------------+--------------+
| 8 | 100 | 109.63 | 114.45 | 102.4 |
+---------+-------------+--------------+--------------+--------------+
| 16 | 100 | 104.2 | 103.36 | 102.52 |
+---------+-------------+--------------+--------------+--------------+
+---------+---------------+
| Threads | 128 Intervals |
+---------+---------------+
| 2 | 123.52 |
+---------+---------------+
| 4 | 159.09 |
+---------+---------------+
| 8 | 114.45 |
+---------+---------------+
| 16 | 106.72 |
+---------+---------------+

Power considerably regresses in almost all cases, showing that with
increasing history there seems to be skewness presented to shallow
states.

Weighted TEO Governor
---------------------

To complement the current moving window history, an approach is devised
where each idle state separately maintains a weight for itself and its
counterpart idle states to form a probability distribution.

When a decision needs to be made, the TEO governor selects an idle state
based on its timer and other hits/early hits metric. After which, the
probability distribution of that selected idle state is looked at which
gives insight into how probable that state is to occur if picked.

The probability distribution is nothing but a n*n matrix, where
n - drv->state_count.
Each entry in the array signifies a weight for that row.
The weights can vary from the range [0-10000].

For example:
state_mat[2][1] - 3000 means that when state 2 is entered idle with, the
probability that the interval will last long enough to satisfy state 1's
residency is 30%.
The trailing zeros correspond to having more resolution while increasing
or reducing the weights for correction.

Currently, for selection of an idle state based on probabilities, a
weighted random number generator is used to choose one of the idle
states. Naturally, the states with higher weights are more likely to be
chosen.

On wakeup, the weights are updated. The state with which it should have
woken up with (could be the hit / miss / early hit state) is increased
in weight by the "LEARNING_RATE" % and the rest of the states for that
index are reduced by the same factor.

A possible disadvantage of this approach is that history for that
thread may make the weights stale and influence the choosing of idle
states which may not be relevant anymore. Aging the weights could be a
solution for that, where over time, the highest weight for that index
can be decayed and the weight equally spread across the rest of the
states.
Although this RFC does not cover the implementation for that as there
seemed to be too much run to run variance with this approach.

Having a finer view of the history in addition to weighted randomized
salt seems to show some promise in terms of saving power without
compromising performance.

Why not just pick the state with the highest probability?
- If there are multiple states close in probabilities competing for
dominance they should be given a fair chance
Of course, if the prediction was wrong the algorithm will self correct,
however by that time the pattern may have changed
- If this contention pattern is exhibited often then the prediction
algorithm will always stay playing catch-up.

Benchmarks:

Metric of measurement:
1. Performance (latency / throughput)
2. Power (watts)
3. Accuracy %
a. Correct prediction - The idle state predicted aligns with the
actual sleep duration
b. Undershoot prediction (US) - The idle state predicted is deeper for
the actual sleep duration
c. Overshoot prediction (OS) - The idle state predicted is shallower
for the actual sleep duration

Performance and Power numbers are normalized.
However accuracy numbers are deliberately kept as-is to show how well
the vanilla governor performed in the first place

Schbench
--------
Benchmarks scheduler wakeup latencies

1. latency 99th percentile - usec
2. Power - watts
Machine - IBM Power 9

Latency and Power - Normalized
+---------+--------------+-----------------+---------------+
| Threads | TEO Baseline | Wt. TEO Latency | Wt. TEO Power |
+---------+--------------+-----------------+---------------+
| 2 | 100 | 101.3 | 85.29 |
+---------+--------------+-----------------+---------------+
| 4 | 100 | 105.06 | 113.63 |
+---------+--------------+-----------------+---------------+
| 8 | 100 | 92.32 | 90.36 |
+---------+--------------+-----------------+---------------+
| 16 | 100 | 99.1 | 92.43 |
+---------+--------------+-----------------+---------------+

Accuracy

Vanilla TEO Governor - Prediction distribution %
+---------+------+------+------+-------+-------+-------+---------+
| Threads | US 1 | US 2 | US 3 | US 4 | US 5 | US 6 | Correct |
+---------+------+------+------+-------+-------+-------+---------+
| 2 | 6.12 | 1.08 | 1.76 | 20.41 | 9.2 | 28.74 | 22.51 |
+---------+------+------+------+-------+-------+-------+---------+
| 4 | 8.54 | 1.56 | 1.25 | 20.24 | 10.75 | 25.17 | 22.67 |
+---------+------+------+------+-------+-------+-------+---------+
| 8 | 5.88 | 2.67 | 1.09 | 13.72 | 17.08 | 32.04 | 22.95 |
+---------+------+------+------+-------+-------+-------+---------+
| 16 | 6.29 | 2.43 | 0.86 | 13.21 | 15.33 | 26.52 | 29.34 |
+---------+------+------+------+-------+-------+-------+---------+
+---------+------+------+------+
| Threads | OS 1 | OS 2 | OS 3 |
+---------+------+------+------+
| 2 | 1.77 | 1.27 | 7.14 |
+---------+------+------+------+
| 4 | 1.8 | 1.31 | 6.71 |
+---------+------+------+------+
| 8 | 0.65 | 0.72 | 3.2 |
+---------+------+------+------+
| 16 | 0.63 | 1.71 | 3.68 |
+---------+------+------+------+

Weighted TEO Governor - Prediction distribution %
+---------+------+------+------+-------+-------+-------+---------+
| Threads | US 1 | US 2 | US 3 | US 4 | US 5 | US 6 | Correct |
+---------+------+------+------+-------+-------+-------+---------+
| 2 | 7.26 | 2.07 | 0.02 | 15.85 | 13.29 | 36.26 | 22.13 |
+---------+------+------+------+-------+-------+-------+---------+
| 4 | 4.33 | 1.45 | 0.15 | 14.17 | 14.68 | 40.36 | 21.01 |
+---------+------+------+------+-------+-------+-------+---------+
| 8 | 4.73 | 2.46 | 0.12 | 12.48 | 14.68 | 32.38 | 28.9 |
+---------+------+------+------+-------+-------+-------+---------+
| 16 | 7.68 | 1.25 | 0.98 | 12.15 | 11.19 | 24.91 | 35.92 |
+---------+------+------+------+-------+-------+-------+---------+
+---------+------+------+------+
| Threads | OS 1 | OS 2 | OS 3 |
+---------+------+------+------+
| 2 | 0.39 | 0.42 | 2.31 |
+---------+------+------+------+
| 4 | 0.45 | 0.51 | 2.89 |
+---------+------+------+------+
| 8 | 0.53 | 0.66 | 3.06 |
+---------+------+------+------+
| 16 | 0.97 | 1.9 | 3.05 |
+---------+------+------+------+

Sleeping Ebizzy
---------------
Program to generate workloads resembling web server workloads.
The benchmark is customized to allow for a sleep interval -i

1. Number of records
2. Power - watts
Machine - IBM Power 9

Parameters:
1. -m -> Always use mmap instead of malloc
2. -M -> Never use mmap
3. -S <seconds> -> Number of seconds to run
4. -i <interval> -> Sleep interval

Number of records and power normalized
+-------------------+---------------+------------------+-----------------+
| Parameters | TEO baseline | Wt TEO records | Wt. TEO Power |
+-------------------+---------------+------------------+-----------------+
| -S 60 -i 10000 | 100 | 106.56 | 93.95 |
+-------------------+---------------+------------------+-----------------+
| -m -S 60 -i 10000 | 100 | 100.62 | 82.14 |
+-------------------+---------------+------------------+-----------------+
| -M -S 60 -i 10000 | 100 | 104.97 | 95.19 |
+-------------------+---------------+------------------+-----------------+

Accuracy

Vanilla TEO Governor - Prediction distribution %
+-------------------+-------+------+------+-------+------+-------+
| Parameters | US 1 | US 2 | US 3 | US 4 | US 5 | US 6 |
+-------------------+-------+------+------+-------+------+-------+
| -S 60 -i 10000 | 45.46 | 0.52 | 1.5 | 15.34 | 2.44 | 8.61 |
+-------------------+-------+------+------+-------+------+-------+
| -m -S 60 -i 10000 | 4.22 | 2.08 | 0.71 | 90.01 | 0 | 0.01 |
+-------------------+-------+------+------+-------+------+-------+
| -M -S 60 -i 10000 | 15.78 | 1.42 | 2.4 | 22.39 | 1.68 | 11.25 |
+-------------------+-------+------+------+-------+------+-------+
+-------------------+---------+------+------+------+------+
| Parameters | Correct | OS 1 | OS 2 | OS 3 | OS 4 |
+-------------------+---------+------+------+------+------+
| -S 60 -i 10000 | 17.03 | 1.73 | 1.1 | 6.27 | 0 |
+-------------------+---------+------+------+------+------+
| -m -S 60 -i 10000 | 2.44 | 0.18 | 0.13 | 0.22 | 0 |
+-------------------+---------+------+------+------+------+
| -M -S 60 -i 10000 | 31.65 | 3.45 | 1.8 | 8.18 | 0 |
+-------------------+---------+------+------+------+------+

Weigted TEO Governor - Prediction distribution %
+-------------------+-------+------+------+-------+------+-------+
| Parameters | US 1 | US 2 | US 3 | US 4 | US 5 | US 6 |
+-------------------+-------+------+------+-------+------+-------+
| -S 60 -i 10000 | 8.25 | 0.87 | 0.98 | 19.23 | 4.05 | 26.35 |
+-------------------+-------+------+------+-------+------+-------+
| -m -S 60 -i 10000 | 7.69 | 4.35 | 0.93 | 82.74 | 0.01 | 0.01 |
+-------------------+-------+------+------+-------+------+-------+
| -M -S 60 -i 10000 | 3.73 | 3.29 | 0.73 | 13.33 | 7.38 | 18.61 |
+-------------------+-------+------+------+-------+------+-------+
+-------------------+---------+------+------+------+------+
| Parameters | Correct | OS 1 | OS 2 | OS 3 | OS 4 |
+-------------------+---------+------+------+------+------+
| -S 60 -i 10000 | 32.86 | 3.27 | 2.05 | 2.09 | 0 |
+-------------------+---------+------+------+------+------+
| -m -S 60 -i 10000 | 3.4 | 0.29 | 0.28 | 0.3 | 0 |
+-------------------+---------+------+------+------+------+
| -M -S 60 -i 10000 | 48.19 | 1.8 | 0.93 | 1.97 | 0.04 |
+-------------------+---------+------+------+------+------+

Pgbench
-------
pgbench is a simple program for running benchmark tests on PostgreSQL.
It runs the same sequence of SQL commands over and over, possibly in
multiple concurrent database sessions, and then calculates the average
transaction rate (transactions per second).
Scale of measurement:
1. Number of transactions
2. Power (watts)
Machine - IBM Power 9

Number of transactions and power is normalized

+---------+---------------+---------------------+-----------------+
| Clients | TEO Baseline | Wt. TEO transations | Wt. TEO power |
+---------+---------------+---------------------+-----------------+
| 4 | 100 | 105.93 | 85.18 |
+---------+---------------+---------------------+-----------------+
| 8 | 100 | 98.11 | 100 |
+---------+---------------+---------------------+-----------------+
| 16 | 100 | 98.73 | 104.16 |
+---------+---------------+---------------------+-----------------+
| 32 | 100 | 101.05 | 95 |
+---------+---------------+---------------------+-----------------+

Accuracy

Vanilla TEO Governor - Prediction distribution %
+---------+-------+-------+-------+-------+------+------+---------+
| Clients | US 1 | US 2 | US 3 | US 4 | US 5 | US 6 | Correct |
+---------+-------+-------+-------+-------+------+------+---------+
| 4 | 59.97 | 2.56 | 0.46 | 1.45 | 0.13 | 0.87 | 32.09 |
+---------+-------+-------+-------+-------+------+------+---------+
| 8 | 82.76 | 7.31 | 0.12 | 1.23 | 0.08 | 1.22 | 7.2 |
+---------+-------+-------+-------+-------+------+------+---------+
| 16 | 2 | 13.46 | 73.27 | 10.7 | 0.04 | 0.12 | 0.38 |
+---------+-------+-------+-------+-------+------+------+---------+
| 32 | 1.33 | 0.76 | 44.9 | 40.8 | 0.07 | 12 | 0.12 |
+---------+-------+-------+-------+-------+------+------+---------+
+---------+------+------+------+
| Clients | OS 1 | OS 2 | OS 3 |
+---------+------+------+------+
| 4 | 2.44 | 0.01 | 0.02 |
+---------+------+------+------+
| 8 | 0.04 | 0.02 | 0.02 |
+---------+------+------+------+
| 16 | 0.02 | 0 | 0.01 |
+---------+------+------+------+
| 32 | 0.02 | 0 | 0 |
+---------+------+------+------+

Weigted TEO Governor - Prediction distribution %
+---------+-------+-------+-------+-------+------+------+---------+
| Clients | US 1 | US 2 | US 3 | US 4 | US 5 | US 6 | Correct |
+---------+-------+-------+-------+-------+------+------+---------+
| 4 | 42.79 | 1.79 | 0.01 | 0.72 | 0.1 | 0.78 | 46.99 |
+---------+-------+-------+-------+-------+------+------+---------+
| 8 | 76.52 | 8.02 | 0.12 | 0.42 | 0.06 | 0.5 | 14.35 |
+---------+-------+-------+-------+-------+------+------+---------+
| 16 | 0.72 | 51.95 | 42.02 | 4.65 | 0.05 | 0.31 | 0.3 |
+---------+-------+-------+-------+-------+------+------+---------+
| 32 | 1.06 | 1.59 | 42.89 | 53.11 | 0.05 | 0.42 | 0.87 |
+---------+-------+-------+-------+-------+------+------+---------+
+---------+------+------+------+
| Clients | OS 1 | OS 2 | OS 3 |
+---------+------+------+------+
| 4 | 6.81 | 0 | 0.01 |
+---------+------+------+------+
| 8 | 0 | 0 | 0.01 |
+---------+------+------+------+
| 16 | 0 | 0 | 0 |
+---------+------+------+------+
| 32 | 0 | 0 | 0.01 |
+---------+------+------+------+

Hackbench
---------
Creates a specified number of pairs of schedulable entities
which communicate via either sockets or pipes and time how long it
takes for each pair to send data back and forth.

Machine - IBM Power 9

Scale of measurement:
1. Time (s)
2. Power (watts)
Time is normalized

+---------+----------+----------------------+-------------------+
| Loops | TEO Time | Wt. TEO Time Sockets | Wt. TEO Time Pipe |
+---------+----------+----------------------+-------------------+
| 100 | 100 | 95.23 | 87.09 |
+---------+----------+----------------------+-------------------+
| 1000 | 100 | 105.81 | 98.67 |
+---------+----------+----------------------+-------------------+
| 10000 | 100 | 99.33 | 92.73 |
+---------+----------+----------------------+-------------------+
| 100000 | 100 | 98.88 | 101.99 |
+---------+----------+----------------------+-------------------+
| 1000000 | 100 | 100.04 | 100.2 |
+---------+----------+----------------------+-------------------+

Power :Socket: Consistent between 135-140 watts for both TEO and Wt. TEO
Pipe: Consistent between 125-130 watts for both TEO and Wt. TEO

Pratik Rajesh Sampat (1):
Weighted approach to gather and use history in TEO governor

drivers/cpuidle/governors/teo.c | 96 +++++++++++++++++++++++++++++++--
1 file changed, 91 insertions(+), 5 deletions(-)

--
2.17.1

Next message: Wei Hu: "RE: [PATCH v3 2/2] PCI: hv: Retry PCI bus D0 entry when the first attempt failed with invalid device state"
Previous message: Luis Chamberlain: "Re: [PATCH 00/15] net: taint when the device driver firmware crashes"
Next in thread: Pratik Rajesh Sampat: "[RFC 1/1] Weighted approach to gather and use history in TEO governor"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]