Re: [Lse-tech] NUMA scheduling

From: Davide Libenzi (davidel@xmailserver.org)
Date: Mon Feb 25 2002 - 15:18:43 EST


On Mon, 25 Feb 2002, Larry McVoy wrote:

> On Mon, Feb 25, 2002 at 02:49:40PM -0500, Bill Davidsen wrote:
> > On Mon, 25 Feb 2002, Larry McVoy wrote:
> >
> > > If you read the early hardware papers on SMP, they all claim "Symmetric
> > > Multi Processor", i.e., you can run any process on any CPU. Skip forward
> > > 3 years, now read the cache affinity papers from the same hardware people.
> > > You have to step back and squint but what you'll see is that these papers
> > > could be summarized on one sentence:
> > >
> > > "Oops, we lied, it's not really symmetric at all"
> > >
> > > You should treat each CPU as a mini system and think of a process reschedule
> > > someplace else as a checkpoint/restart and assume that is heavy weight. In
> > > fact, I'd love to see the scheduler code forcibly sleep the process for
> > > 500 milliseconds each time it lands on a different CPU. Tune the system
> > > to work well with that, then take out the sleep, and you'll have the right
> > > answer.
> >
> > Unfortunately this is an overly simple view of how SMP works. The only
> > justification for CPU latency is to preserve cache contents. Trying to
> > express this as a single number is bound to produce suboptimal results.
>
> And here is the other side of the coin. Remember what we are doing.
> We're in the middle of a context switch, trying to figure out where we
> should run this process. We would like context switches to be fast.
> Any work we do here is at direct odds with our goals. SGI took the
> approach that your statements would imply, i.e., approximate the
> cache footprint, figure out if it was big or small, and use that to
> decide where to land the process. This has two fatal flaws:
> a) Because there is no generic hardware interface to say "how many cache
> lines are mine", you approximate that by looking at how much of the
> process timeslice this process used, if it used a lot, you guess it
> filled the cache. This doesn't work at all for I/O bound processes,
> who typically run in short bursts. So IRIX would bounce these around
> for no good reason, resulting in crappy I/O perf. I got about another
> 20% in BDS by locking down the processes (BDS delivered 3.2GBytes/sec
> of NFS traffic, sustained, in 1996).
> b) all of the "thinking" you do to figure out where to land the process
> contributes directly to the cost of the context switch. Linux has
> nice light context switches, let's keep it that way.

Obviously Larry you do not want to do such work on an active CPU. Idle
time is balancing time ...

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Feb 28 2002 - 21:00:18 EST