Re: [Lse-tech] [rebalance at: do_fork() vs. do_execve()] NUMA scheduling

From: Andy Pfiffer (andyp@osdl.org)
Date: Mon Feb 25 2002 - 18:35:11 EST


> > Would be interesting to hear oppinions on initial balancing. What are the
> > pros and cons of balancing at do_fork() or do_execve()? And it would be
> > interesting to learn about other approaches, too...

I worked on a system several years ago that supported single-system
image and shared no memory between nodes (NORMA == NO Remote Memory
Access), but did have a very high performance, low-latency interconnect
(100's of megabytes/sec, a few 10's of ns latency for com startup).

The ratios between CPU Clock Rate / Local Memory / Offboard Memory were
(at a gross level) similar to today's GHz CPU's with on-chip L1,
off-die L2, local dram, wires + state machines + dram on some other
node.

There was initially much debate about load balancing at fork time or at
exec time (static), followed by when and how often to rebalance already
running processes (dynamic).

We eventually chose to statically balance at exec time, using a possibly
stale metric, because we wouldn't have to spend time to create address
space remotely (parent on node A, child on node B), only to have it torn
down a few clocks later by a subsequent exec. (Our workload consisted
almost entirely of fork+exec rather than fork+fork+fork... ).

The analogy here is that commiting modified dcache lines owned by CPU A
and reheating the cache with them on CPU B, only to throw them away by
an exec a few clocks later may be similar to the "rfork() vs. rexec()"
choice we faced.

If you rebalance do_exec(), you are starting with an empty working set
and a cold cache.

To balance at exec time, we used a separate process that would
periodically query (via a spanning tree) nodes within the "interactive
partition" for their current load average, compute which was the least
loaded (a heuristic that used a combination of factors: cpu utilization,
# of processes, memory utilization, is there a shell over there, etc.),
and then update the nodes (again via a spanning tree) with the current
global opinion as to the least loaded node.

In the context of this discussion, computing the loading metric at
regular intervals *when otherwise idle* would appear to be similar to
the approach we used.

The static inter-node load leveling worked pretty well in practice
(although some said it wasn't much better than pure round-robin across
nodes), and it was non-fatal if the initial node selection was a poor
choice

The main problem was minimizing the storm of inbound rexec()'s when a
sudden burst of activity (say, with a make -j) on multiple nodes at once
could cause N-1 nodes to throw *everything* at the least loaded node. I
don't think this would be a problem in this case because there is a
single entity making a single choice, not mutiple entities making
multiple choices in isolation.

Dynamic load leveling (moving an entire process from one node to
another) was always problematic for highly interactive workloads and a
rash of complexity issues well off topic from this discussion, but
worked well for long-running CPU bound tasks.

Andy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Feb 28 2002 - 21:00:20 EST