I have become accustom to the new way of calculating load averages,
and normally see a load of around 0.2-0.5 on my machine when it's
idling.
However, a couple of days ago I noticed that the load was always over
1.2, never dropping below 1.0. At the time I put it down to real load
on the system, since quite a bit of activity was going on.
Later, when no-one other than myself was logged on, I still had this
persistently latched-up load of 1.2 or more. This was causing jobs I'd
submitted via batch(1) not to be executed, so I began to investigate.
(At this point, I might add that the reason I was still running 1.3.77
was that I had booted 1.3.77 when it was released, and had not had to
reboot until this incident a couple of days ago. If it's relevant,
/proc/uptime's second value (idle time) was always 0 for this
version).
top(1) showed no processes chewing cpu (infact it claimed 80% idle),
and "ps -aux" showed that all processes were either running ('R') or
sleeping ('S'), ie there were no processes blocked in uninterruptable
sleep ('D'), or stopped ('T'). 3 processes were zombied, but they are
a (normal) product of my .xsession, and had not affected load in the
past.
Seeing no processes causing the load, I concluded it must be interrupt
load on the kernel. tcpdump showed a light trickle of packets being
handled, nothing vast. The counters in /proc/interrupts weren't
increasing unusually fast. So, it didn't seem to be the kernel either.
No messages appeared in any logs.
I wrote a little program to read and write arbitrary peices of kernel
memory, and used it to write the avenrun[0] location with the value
0. This immediately took the load down to 0, of course. But it crept
back up to 1.2: for some reason the kernel definately thought there
was some cause for this load.
The next step was clearly to walk the process table. Doing this showed
that all processes were in the -1 state (sleeping), other than the
swapper (pid 0) and the process reading kernel memory. I walked the
process table in 2 different ways:
1) starting at task[0] and following next_task pointers
2) starting at task[511] and going backwards through the table of
pointers to struct task_structs, view each non-NULL entry.
Both came to the same conclusion: no processes in any unusual states.
I then re-implemented the load calculation code in user-space, using
method (2) to walk the process table, and a sleep(5) to get
approximately the right frequency of calculation. The results were in
the region 0.2 to 0.8 or so, much more in keeping with the actual
loading of the system.
So, I have 2 questions, really:
a) has anyone else seen this load latch-up phenomenon?
b) the calc_load() code is called from do_timer(). Could it be that
a race in the timer interrupt handling code causes a glitch,
whereby an extra process is marked as running when calc_load() is
called.
Ideas, people?
Austin