Issue: measure /proc/loadavg is inaccurate and missing proper documentation
From: Rolf Stenholm
Date: Wed Oct 23 2024 - 05:13:38 EST
Issue : critical measure /proc/loadavg is inaccurate and
missing proper documentation
Kernel Files : kernel/sched/loadavg.c, include/linux/sched/loadavg.h
Version : Any kernal before 2024 october, from
https://github.com/torvalds/linux 6.12-rc4 the issue is exist
since the first linux kernels from the 90s.
Reproduce : sample from /proc/loadavg while creating load on the machine
(type "openssl speed -multi $(grep -ci processor
/proc/cpuinfo)" for creating 100% load and
"watch -n 1 'cat /proc/loadavg >> /tmp/loadavg.txt' "
to create a 'CSV' file to get statistics )
or use a math package to examine the quality of the
algorithm (or lack thereof)
Desciption : Measure /proc/loadavg is critical for administrators
everywhere. The measure has inaccurate
documentation and output is inaccurate. This is a bug
report to try to at least improve documentation
and in the future improve the accuracy of /proc/loadavg
values. You cannot understand the measure
unless you actually read the kernel source code and
then only if you know similar of types of
methods used for similar problems. Comments like 'Its a
silly number but people think its important.'
are not helpful as loadavg is important for admins
everywhere to understand if the system is
overloaded.
Load avg in /proc/loadavg is thouroughly discussed online but often
wihtout undestanding the measure ,for instance
a good article about loadavg is (which includes history and background)
https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
The article misses some critical ideas behind the calculation but
shows accuracy issues with current loadavg and
that the documentation of loadavg is lacking. The kernel code shows no
understanding of the original reason
for the measurment, therefore the following text can fill the gap one
the measure and why it was created.
The purpose of the calculation loadavg that dates back to the 70s can
be summed up in the following sentence
(Linux loadavg includes some idle processes in loadavg that are not
CPU related).
Approximate the average cpu load over 1 min, 5 min, 15 min.
(for linux this is approximate average load over 1 min, 5 min, 15 min).
Old 70s TENEX system (see link above) needed a cheap way to calculate
a loadavg without a large computation
footprint and memory footprint. The solutions was to approximate
normal average load (T1 + T2 .. TN)/N with the
differential equation
ApproxAvg[N+1] = ApproxAvg[N] * C1 + Load[N] * C2
The idea is that it will roughly replicate the load average over time
with correct constants C1 and C2,
however it should be noted that when a system goes from no load to
full load the 1 minute measure is at 0.62
at 60 seconds and above 0.9 after 150 seconds making the measure quite
inaccurate.
There is a dangerous belief that the measure is exponential and indeed
there is an exponential constant in the
code, this does not make the actual diff equation exponential. The
TENEX code used C1 = EXP(-T/C), C2 = 1 -C1
where T is 5 seconds and C is 60, 300, 900 which are stored as
constants in the original code.
Why C1 = (C-T) / C was not used is not documented from the 70s TENEX
code but the taylor polynomial for the
C1 TENEX constant is E = 1-T/C+T^2/(C^2*2)... which is roughly the
same as C1 = (C-T)/C.
Because loadavg is an approximation of actual mean it is possibly to
create an easy least square error measure of
loadavg to quickly evalute other loadavg alternatives methods, the
equation is simply
Square( (ApproxAvg[N+1]-Avg[N+1])*(ApproxAvg[N+1]-Avg[N+1]) + ...
(ApproxAvg[1]-Avg[1])*(ApproxAvg[1]-Avg[1]) )
With this measure we can get some alt calulations using 100 measuring
points where load goes from
0 to 1 instantly and we only include points where the actual average
goes from 0 to 100.
Examples:
C1=99/100, C2=1/100 , E = 12.4
C1=69/70 , C2=1/70 , E = 6.98
C1=exp(-1/100), C2=1-exp(-1/100) , E = 12.57
C1=69/70 , C2=1/70 avg over last 10 values , E = 5.23
Which shows that current constants used in the calculation are not
optimal and can be improved. Using additional
datapoints will improve the measure but only slowly. If there is no
memory considerations for storing
all load samples for a given timeframe an integer queue with a stored
sum could be used, where sum is updated on
dequeue and enqueue of the queue allowing constant time calulation of
the actual average.
It can be noted that in the current code loadavg.c that:
- Function fixed_power_int calculates exponential despite the metric
not benefiting from this
- There is a distributed summation over CPU which generally isn't
required in a differential equation
- The code comments are "funny" but not very helpful to understand the
metric and this metric is important
And that in loadavg.h that:
-- The EXP_1, EXP_5, EXP_15 all look like copies of the 70s TENEX
system, the values could be improved
Because loadavg is critical measure for many admins and not understood
it would be good to document what loadavg
is trying to calculate and how it can be improved in later kernel
versions. Improvement of loadavg could save quite
a bit of resources for certain admins and data centers.
The exact solution may depend on how easy alternatives are to
implement but allowing loadavg to be sample from
another loaded kernel module or read from a user space application to
replace current values could help admins
everywhere to get a better customized measurement point with the aim
to improve the measurement loadavg. For
instance the kernal could read from a mounted device file like
/dev/myloadavg and use that as loadavg measure
based on hardware measure or clever algorithms.