Stats and Algorithms and the Kernel...need suggestions for a Resource Monitor

Russell Leighton (russ@taylor.digex.net)
Sun, 01 Nov 1998 22:05:19 -0500


I am developing resource monitoring software for Linux.
The short version is that this is an extensible
framework for monitoring the system at runtime.
For more detailed info see bottom of the message,
but I have a specific request for kernel developers
and hardcore sysadmins for what resources they
think would be most useful to monitor either
for developement or production systems.

One (of many) of the uses I forsee for this software
is to make it *really easy* to calculate stats for
various runtime parameters. This information can be
used to evaluate algorithms and/or tune the system.
The key idea is here that it is easy to run and easy
to extend, I know full well the people reading this
list can write software to compute stats for the
things they care about, but I want to make it *really
easy*. Note: Initially, I am targeting stats over long periods
of time (e.g., samplings of > 1/2 second).

I hope this software will be able to help resolve
some issues I have seen discussed on this list that
take the following form:

The [fill in your favorite algorithm here]
is good because it uses the theoretical
usage statistics as its assumptions.

Nay! some say, [fill in your favorite algorithm here]
may be "optimal" theoretically, BUT this is
the real world and the assumptions used to
derive the theoretical usage statistics do
not reflect the real world (or [fill in your
special case here] situation).

The idea is that there is an easy to configure (config
files first, GNOME/GTK app later) daemon
where you name the parameters you want to sample and
how often and set it to running for long periods of
time. For all parameters we can optionally:
1. log values to a file for later analysis
2. compute: min,max,mean and a histogram
^^^^^^^^^
3. define any action you want...

It's the histogram that is the powerful concept. This
can be used to tell if your stats really reflect
the designers assumptions (many times people assume
unimodal but *really* it's bi-modal!). It is also
a good visual guide for spotting trends.

So the questions:

What are the really *interesting* values to
monitor when running Linux?

What combinations of these would be useful
to measure at the same time (e.g., on a heavily
used web server we might want to monitor
memory usage, load, http transaction time...
then record histograms of the distributions
of these parameters every 15min...might be
interesting to compare heavily loaded to
lightly loaded times...)

I would like to deliver the first version of this
software with a deamon setup to monitor many of
the suggested parameters.

Any help with making this useful to the community
would be very welcome.

Thanks.

Russ

------------------ More on the Software -------------------

The software is conceptually simple. The core library
implements a "Monitor" which has 2 threads:

Probe Thread: executes "Probes" which sample
system resources periodically and send events
to the Handler Thread:

Handler Thread: Each declared "Handler" registers
interests for certain events which it will receive
when any "Probe" generates such events.

I have tried to code this very tightly s.t. we impose
a minimal disturbance to the system. For example, the
dispatching of events to handlers is very efficient
and should scale to very large numbers of "Handlers".
All C, no C++...gotta be lean and mean.

Applications link to the core library, declare "Probes"
and "Handlers", create a "Monitor" , then start the
"Monitor" (which can be pause/restarted/stopped).

I have created a few "Probes" that will come
with the distribution:
- number of users
- free memeory
- load average

Handlers I currently have are:
- dump to ascii log
- compute stats (see above), which
can be dumped to file periodically

The first release will have a daemon that
should be pretty easy to configure to monitor
the system.

The software package is called "Aware" and will
be released in 6-8 weeks...depends on
how busy I am...Open Source license
of some sort...

The feature list for the project is long
(e.g., "alarm" handlers, networked comms
between "monitors" so whole groups of
systems can be centrally monitored, feedback
control systems s.t. the system can
control/tune itself, nice visualization
front end (GNOME/Gtk/Guppie)....).
Anyone interested in helping, drop me a note.

-- 
----------------------------------------------------------------
Russell Leighton  russ@taylor.digex.net  http://taylor.digex.net
      Less is More....Speed is Everything....Adapt or Die
----------------------------------------------------------------

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/