Re: latest linus-2.5 BK broken

From: Larry McVoy (lm@bitmover.com)
Date: Thu Jun 20 2002 - 00:24:44 EST


> I totally agree, mostly I was playing devils advocate. The model
> actually in my head is when you have multiple kernels but they talk
> well enough that the applications have to care in areas where it
> doesn't make a performance difference (There's got to be one of those).

....

> The compute cluster problem is an interesting one. The big items
> I see on the todo list are:
>
> - Scalable fast distributed file system (Lustre looks like a
> possibility)
> - Sub application level checkpointing.
>
> Services like a schedulers, already exist.
>
> Basically the job of a cluster scheduler gets much easier, and the
> scheduler more powerful once it gets the ability to suspend jobs.
> Checkpointing buys three things. The ability to preempt jobs, the
> ability to migrate processes, and the ability to recover from failed
> nodes, (assuming the failed hardware didn't corrupt your jobs
> checkpoint).
>
> Once solutions to the cluster problems become well understood I
> wouldn't be surprised if some of the supporting services started to
> live in the kernel like nfsd. Parts of the distributed filesystem
> certainly will.

http://www.bitmover.com/cc-pitch

I've been trying to get Linus to listen to this for years and he keeps
on flogging the tired SMP horse instead. DEC did it and Sun has been
passing around these slides for a few weeks, so maybe they'll do it too.
Then Linux can join the party after it has become a fine grained,
locked to hell and back, soft "realtime", numa enabled, bloated piece
of crap like all the other kernels and we'll get to go through the
"let's reinvent Unix for the 3rd time in 40 years" all over again.
What fun. Not.

Sorry to be grumpy, go read the slides, I'll be at OLS, I'd be happy
to talk it over with anyone who wants to think about it. Paul McKenney
from IBM came down the San Francisco to talk to me about it, put me
through an 8 or 9 hour session which felt like a PhD exam, and
after trying to poke holes in it grudgingly let on that maybe it was
a good idea. He was kind of enough to write up what he took away
from it, here it is.

--lm

From: "Paul McKenney" <Paul.McKenney@us.ibm.com>
To: lm@bitmover.com, tytso@mit.edu
Subject: Greatly enjoyed our discussion yesterday!
Date: Fri, 9 Nov 2001 18:48:56 -0800

Hello!

I greatly enjoyed our discussion yesterday! Here are the pieces of it that
I recall, I know that you will not be shy about correcting any errors and
omissions.

                         Thanx, Paul

Larry McVoy's SMP Clusters

Discussion on November 8, 2001

Larry McVoy, Ted T'so, and Paul McKenney

What is SMP Clusters?

     SMP Clusters is a method of partioning an SMP (symmetric
     multiprocessing) machine's CPUs, memory, and I/O devices
     so that multiple "OSlets" run on this machine. Each OSlet
     owns and controls its partition. A given partition is
     expected to contain from 4-8 CPUs, its share of memory,
     and its share of I/O devices. A machine large enough to
     have SMP Clusters profitably applied is expected to have
     enough of the standard I/O adapters (e.g., ethernet,
     SCSI, FC, etc.) so that each OSlet would have at least
     one of each.

     Each OSlet has the same data structures that an isolated
     OS would have for the same amount of resources. Unless
     interactions with the OSlets are required, an OSlet runs
     very nearly the same code over very nearly the same data
     as would a standalone OS.

     Although each OSlet is in most ways its own machine, the
     full set of OSlets appears as one OS to any user programs
     running on any of the OSlets. In particular, processes on
     on OSlet can share memory with processes on other OSlets,
     can send signals to processes on other OSlets, communicate
     via pipes and Unix-domain sockets with processes on other
     OSlets, and so on. Performance of operations spanning
     multiple OSlets may be somewhat slower than operations local
     to a single OSlet, but the difference will not be noticeable
     except to users who are engaged in careful performance
     analysis.

     The goals of the SMP Cluster approach are:

     1. Allow the core kernel code to use simple locking designs.
     2. Present applications with a single-system view.
     3. Maintain good (linear!) scalability.
     4. Not degrade the performance of a single CPU beyond that
          of a standalone OS running on the same resources.
     5. Minimize modification of core kernel code. Modified or
          rewritten device drivers, filesystems, and
          architecture-specific code is permitted, perhaps even
          encouraged. ;-)

OS Boot

     Early-boot code/firmware must partition the machine, and prepare
     tables for each OSlet that describe the resources that each
     OSlet owns. Each OSlet must be made aware of the existence of
     all the other OSlets, and will need some facility to allow
     efficient determination of which OSlet a given resource belongs
     to (for example, to determine which OSlet a given page is owned
     by).

     At some point in the boot sequence, each OSlet creates a "proxy
     task" for each of the other OSlets that provides shared services
     to them.

     Issues:

     1. Some systems may require device probing to be done
          by a central program, possibly before the OSlets are
          spawned. Systems that react in an unfriendly manner
          to failed probes might be in this class.

     2. Interrupts must be set up very carefully. On some
          systems, the interrupt system may constrain the ways
          in which the system is partitioned.

Shared Operations

     This section describes some possible implementations and issues
     with a number of the shared operations.

     Shared operations include:

     1. Page fault on memory owned by some other OSlet.
     2. Manipulation of processes running on some other OSlet.
     3. Access to devices owned by some other OSlet.
     4. Reception of network packets intended for some other OSlet.
     5. SysV msgq and sema operations on msgq and sema objects
          accessed by processes running on multiple of the OSlets.
     6. Access to filesystems owned by some other OSlet. The
          /tmp directory gets special mention.
     7. Pipes connecting processes in different OSlets.
     8. Creation of processes that are to run on a different
          OSlet than their parent.
     9. Processing of exit()/wait() pairs involving processes
          running on different OSlets.

     Page Fault

          As noted earlier, each OSlet maintains a proxy process
          for each other OSlet (so that for an SMP Cluster made
          up of N OSlets, there are N*(N-1) proxy processes).

          When a process in OSlet A wishes to map a file
          belonging to OSlet B, it makes a request to B's proxy
          process corresponding to OSlet A. The proxy process
          maps the desired file and takes a page fault at the
          desired address (translated as needed, since the file
          will usually not be mapped to the same location in the
          proxy and client processes), forcing the page into
          OSlet B's memory. The proxy process then passes the
          corresponding physical address back to the client
          process, which maps it.

          Issues:

          o How to coordinate pageout? Two approaches:

               1. Use mlock in the proxy process so that
                    only the client process can do the pageout.

               2. Make the two OSlets coordinate their
                    pageouts. This is more complex, but will
                    be required in some form or another to
                    prevent OSlets from "ganging up" on one
                    of their number, exhausting its memory.

          o When OSlet A ejects the memory from its working
               set, where does it put it?

               1. Throw it away, and go to the proxy process
                    as needed to get it back.

               2. Augment core VM as needed to track the
                    "guest" memory. This may be needed for
                    performance, but...

          o Some code is required in the pagein() path to
               figure out that the proxy must be used.

               1. Larry stated that he is willing to be
                    punched in the nose to get this code in. ;-)
                    The amount of this code is minimized by
                    creating SMP-clusters-specific filesystems,
                    which have their own functions for mapping
                    and releasing pages. (Does this really
                    cover OSlet A's paging out of this memory?)

          o How are pagein()s going to be even halfway fast
               if IPC to the proxy is involved?

               1. Just do it. Page faults should not be
                    all that frequent with today's memory
                    sizes. (But then why do we care so
                    much about page-fault performance???)

               2. Use "doors" (from Sun), which are very
                    similar to protected procedure call
                    (from K42/Tornado/Hurricane). The idea
                    is that the CPU in OSlet A that is handling
                    the page fault temporarily -becomes- a
                    member of OSlet B by using OSlet B's page
                    tables for the duration. This results in
                    some interesting issues:

                    a. What happens if a process wants to
                         block while "doored"? Does it
                         switch back to being an OSlet A
                         process?

                    b. What happens if a process takes an
                         interrupt (which corresponds to
                         OSlet A) while doored (thus using
                         OSlet B's page tables)?

                         i. Prevent this by disabling
                              interrupts while doored.
                              This could pose problems
                              with relatively long VM
                              code paths.

                         ii. Switch back to OSlet A's
                              page tables upon interrupt,
                              and switch back to OSlet B's
                              page tables upon return
                              from interrupt. On machines
                              not supporting ASID, take a
                              TLB-flush hit in both
                              directions. Also likely
                              requires common text (at
                              least for low-level interrupts)
                              for all OSlets, making it more
                              difficult to support OSlets
                              running different versions of
                              the OS.

                              Furthermore, the last time
                              that Paul suggested adding
                              instructions to the interrupt
                              path, several people politely
                              informed him that this would
                              require a nose punching. ;-)

                    c. If a bunch of OSlets simultaneously
                         decide to invoke their proxies on
                         a particular OSlet, that OSlet gets
                         lock contention corresponding to
                         the number of CPUs on the system
                         rather than to the number in a
                         single OSlet. Some approaches to
                         handle this:

                         i. Stripe -everything-, rely
                              on entropy to save you.
                              May still have problems with
                              hotspots (e.g., which of the
                              OSlets has the root of the
                              root filesystem?).

                         ii. Use some sort of queued lock
                              to limit the number CPUs that
                              can be running proxy processes
                              in a given OSlet. This does
                              not really help scaling, but
                              would make the contention
                              less destructive to the
                              victim OSlet.

          o How to balance memory usage across the OSlets?

               1. Don't bother, let paging deal with it.
                    Paul's previous experience with this
                    philosophy was not encouraging. (You
                    can end up with one OSlet thrashing
                    due to the memory load placed on it by
                    other OSlets, which don't see any
                    memory pressure.)

               2. Use some global memory-pressure scheme
                    to even things out. Seems possible,
                    Paul is concerned about the complexity
                    of this approach. If this approach is
                    taken, make sure someone with some
                    control-theory experience is involved.

     Manipulation of Processes Running on Some Other OSlet.

          The general idea here is to implement something similar
          to a vproc layer. This is common code, and thus requires
          someone to sacrifice their nose. There was some discussion
          of other things that this would be useful for, but I have
          lost them.

          Manipulations discussed included signals and job control.

          Issues:

          o Should process information be replicated across
               the OSlets for performance reasons? If so, how
               much, and how to synchronize.

               1. No, just use doors. See above discussion.

               2. Yes. No discussion of synchronization
                    methods. (Hey, we had to leave -something-
                    for later!)

     Access to Devices Owned by Some Other OSlet

          Larry mentioned a /rdev, but if we discussed any details
          of this, I have lost them. Presumably, one would use some
          sort of IPC or doors to make this work.

     Reception of Network Packets Intended for Some Other OSlet.

          An OSlet receives a packet, and realizes that it is
          destined for a process running in some other OSlet.
          How is this handled without rewriting most of the
          networking stack?

          The general approach was to add a NAT-like layer that
          inspected the packet and determined which OSlet it was
          destined for. The packet was then forwarded to the
          correct OSlet, and subjected to full IP-stack processing.

          Issues:

          o If the address map in the kernel is not to be
               manipulated on each packet reception, there
               needs to be a circular buffer in each OSlet for
               each of the other OSlets (again, N*(N-1) buffers).
               In order to prevent the buffer from needing to
               be exceedingly large, packets must be bcopy()ed
               into this buffer by the OSlet that received
               the packet, and then bcopy()ed out by the OSlet
               containing the target process. This could add
               a fair amount of overhead.

               1. Just accept the overhead. Rely on this
                    being an uncommon case (see the next issue).

               2. Come up with some other approach, possibly
                    involving the user address space of the
                    proxy process. We could not articulate
                    such an approach, but it was late and we
                    were tired.

          o If there are two processes that share the FD
               on which the packet could be received, and these
               two processes are in two different OSlets, and
               neither is in the OSlet that received the packet,
               what the heck do you do???

               1. Prevent this from happening by refusing
                    to allow processes holding a TCP connection
                    open to move to another OSlet. This could
                    result in load-balance problems in some
                    workloads, though neither Paul nor Ted were
                    able to come up with a good example on the
                    spot (seeing as BAAN has not been doing really
                    well of late).

                    To indulge in l'esprit d'escalier... How
                    about a timesharing system that users
                    access from the network? A single user
                    would have to log on twice to run a job
                    that consumed more than one OSlet if each
                    process in the job might legitimately need
                    access to stdin.

               2. Do all protocol processing on the OSlet
                    on which the packet was received, and
                    straighten things out when delivering
                    the packet data to the receiving process.
                    This likely requires changes to common
                    code, hence someone to volunteer their nose.

     SysV msgq and sema Operations

          We didn't discuss these. None of us seem to be SysV fans,
          but these must be made to work regardless.

          Larry says that shm should be implemented in terms of mmap(),
          so that this case reduces to page-mapping discussed above.
          Of course, one would need a filesystem large enough to handle
          the largest possible shmget. Paul supposes that one could
          dynamically create a memory filesystem to avoid problems here,
          but is in no way volunteering his nose to this cause.

     Access to Filesystems Owned by Some Other OSlet.

          For the most part, this reduces to the mmap case. However,
          partitioning popular filesystems over the OSlets could be
          very helpful. Larry mentioned that this had been prototyped.
          Paul cannot remember if Larry promised to send papers or
          other documentation, but duly requests them after the fact.

          Larry suggests having a local /tmp, so that /tmp is in effect
          private to each OSlet. There would be a /gtmp that would
          be a globally visible /tmp equivalent. We went round and
          round on software compatibility, Paul suggesting a hashed
          filesystem as an alternative. Larry eventually pointed out
          that one could just issue different mount commands to get
          a global filesystem in /tmp, and create a per-OSlet /ltmp.
          This would allow people to determine their own level of
          risk/performance.

     Pipes Connecting Processes in Different OSlets.

          This was mentioned, but I have forgotten the details.
          My vague recollections lead me to believe that some
          nose-punching was required, but I must defer to Larry
          and Ted.

          Ditto for Unix-domain sockets.

     Creation of Processes on a Different OSlet Than Their Parent.

          There would be a inherited attribute that would prevent
          fork() or exec() from creating its child on a different
          OSlet. This attribute would be set by default to prevent
          too many surprises. Things like make(1) would clear
          this attribute to allow amazingly fast kernel builds.

          There would also be a system call that would cause the
          child to be placed on a specified OSlet (Paul suggested
          use of HP's "launch policy" concept to avoid adding yet
          another dimension to the exec() combinatorial explosion).

          The discussion of packet reception lead Larry to suggest
          that cross-OSlet process creation would be prohibited if
          the parent and child shared a socket. See above for the
          load-balancing concern and corresponding l'esprit d'escalier.

     Processing of exit()/wait() Pairs Crossing OSlet Boundaries

          We didn't discuss this. My guess is that vproc deals
          with it. Some care is required when optimizing for this.
          If one hands off to a remote parent that dies before
          doing a wait(), one would not want one of the init
          processes getting a nasty surprise.

          (Yes, there are separate init processes for each OSlet.
          We did not talk about implications of this, which might
          occur if one were to need to send a signal intended to
          be received by all the replicated processes.)

Other Desiderata:

1. Ability of surviving OSlets to continue running after one of their
     number fails.

     Paul was quite skeptical of this. Larry suggested that the
     "door" mechanism could use a dynamic-linking strategy. Paul
     remained skeptical. ;-)

2. Ability to run different versions of the OS on different OSlets.

     Some discussion of this above.

The Score.

     Paul agreed that SMP Clusters could be implemented. He was not
     sure that it could achieve good performance, but could not prove
     otherwise. Although he suspected that the complexity might be
     less than the proprietary highly parallel Unixes, he was not
     convinced that it would be less than Linux would be, given the
     Linux community's emphasis on simplicity in addition to performance.

-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jun 23 2002 - 22:00:21 EST