On Thu, 29 Jun 2006 09:44:08 -0700First off, just a reminder that this is inherently a netlink flow control issue...which was being exacerbated
Paul Jackson <pj@xxxxxxx> wrote:
You're probably correct on that model. However, it all depends on the actualIt may well be mostly as you say - the large-CPU systems not running
workload. Are people who actually have large-CPU (>256) systems actually
running fork()-heavy things like webservers on them, or are they running things
like database servers and computations, which tend to have persistent
processes?
the fork() heavy jobs.
Sooner or later, someone will want to run a fork()-heavy job on a
large-CPU system. On a 1024 CPU system, it would apparently take
just 14 exits/sec/CPU to hit this bottleneck, if Jay's number of
14000 applied.
Chris Sturdivant's reply is reasonable -- we'll hit it sooner or later,
and deal with it then.
I agree, and I'm viewing this as blocking the taskstats merge. Because if
this _is_ a problem then it's a big one because fixing it will be
intrusive, and might well involve userspace-visible changes.
The only ways I can see of fixing the problem generally are to eitherOne of the unused features of genetlink that's meant for high volume data output from the kernel is
a) throw more CPU(s) at stats collection: allow userspace to register for
"stats generated by CPU N", then run a stats collection daemon on each
CPU or
b) make the kernel recognise when it's getting overloaded and switch to
some degraded mode where it stops trying to send all the data to
userspace - just send a summary, or a "we goofed" message or something.