Re: proc fs and shared pids

Linus Torvalds (torvalds@cs.helsinki.fi)
Wed, 7 Aug 1996 20:04:47 +0300 (EET DST)


On Wed, 7 Aug 1996, Al Longyear wrote:
>
> Does the cloning go forward in time as it does in a traditional thread
> model?

Yes. When you share some part of the context of execution with another
task, the data really _is_ shared, 100%.

> Suppose you have a process. It has one thread. It opens a file by
> calling the open procedure. Ok. It now has the file open.
>
> It then calls "create_thread" or "clone(0)" or whatever to create a
> new thread.
>
> Both threads now have the same file open. That is understood by all
> models.
>
> Now, the second thread opens yet another file by calling open().
>
> Does the first thread get a new descriptor for the newly opened file?

Not if you do a "clone(0)". As I mentioned, "clone(0)" is exactly the
same as fork(), and essentially copies all of the context of execution so
that we have two totally separate tasks (*).

(*) Slight simplifications: clone() has more than just one argument, and
the bitmask argument also contains a "exit signal" mask, but those are
details, not really relevant to the basic ideas.

But the argument to clone() is just a bit-mask of which parts of the
execution context we want to share, so if you want to share the open files,
you do a clone(CLONE_FILES), and now you have created a new task ("context of
execution") that shares the files structure with the original one.

Thus, if the parent closes or opens a file, that action shows up in the
child too (and vice versa, of course).

> And, if it is given to the first thread, what do we do about closing
> the file? If the file is stored as a process structure item for each
> thread then both threads must close the file as there would be two
> references to the opened file. Yet, only one thread really opened the
> file and only one thread should close it.

A close() will close the file in both (or "all" - it doesn't have to
be just two tasks) tasks that share the same files. You can open the file
in one task and close it in another if you want to (although I suspect
that the programmer _really_ has to know what he is doing in order to not
mess up if he starts doing stuff like that ;^)

> If you don't like files, then substitute signal processing procedures,
> memory allocation or semaphores or any other shared resource that you
> would want to allocate/reserve for the term 'file' above.

Sure, just use "CLONE_SIGHAND" for signal handlers that are shared (when one
task installs a signal handler, it shows up in the other tasks too), and
"CLONE_VM" when you want to share the virtual memory of two processes.
Similarly, CLONE_FS shares "generic filesystem" state (currently that just
means pwd/cwd).

If you want pthreads behaviour, you probably want to use

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_PID)

so that you essentially share everything (except the CPU state: due to
hardware limitations two tasks cannot have the same register state, for
understandable reasons - that would be a struly SIMD "clone()" ;)

> The kernel dispatches THREADS, not PROCESSES. The threads contain only
> the context needed to dispatch the instructions. This would be the
> current registers, priority, state, wait list, etc.

You cannot dispatch a thread without a process. A "thread" doesn't exist
on it's own. In my opinion a "thread" doesn't even make sense at all:
what is the CPU state if there is no MMU state? That's why Linux doesn't
dispatch threads: it dispatches the totality, the "task".

> A thread then has a pointer to the process context. The process
> context would contain things such as memory maps, open file lists,
> identification (owner and group), etc. It would have all of the
> information which was not in the TSS (and I don't consider the map
> table, pointed to by the the TSS's cr3 to be "in the TSS") or needed
> to dispatch the unit of execution.

The "current" pointer is the pointer to the current "struct task_struct".
That is the task descriptor, and it has pointers to all of the state of
the task ("current->mm" points to the VM description of the task,
"current->sig" points to the signal handler state etc).

The "current" pointer is neither thread or process. It is _both_. It is
complete in itself (it doesn't need any external "process" container).

> The value of 'current' would point to the thread storage, not the
> process storage.

It _does_ point to the thread storage, but it _also_ points to the
process storage. They aren't separate entities under Linux. You could try
to make up something that is the "process" part, and another part that is
the "thread" part, but that's not how the kernel actually uses it or how
it should be thought about.

> Now, if you wish to duplicate all of the process releated information
> such as map tables, etc. on a per-thread basis, then that is OK as
> well ( for the time being, that is. :) )
>
> I am sure that whatever scheme that is devised will work. I only
> suggest that the more common problems of "well, threads are easy so we
> can do them" need to be considered.

Note that this all is not something that is being devised. It already
exists. It does work. People are actually using it for threads already,
and as such the basic approach has validated itself.

The thing being discussed is the "frills": the stuff to make it easier to
create a 100% pthreads compatible library efficiently. Stuff like hiding
an extra "thread ID" inside the pid, so that we have a good interface to
do thr_kill() that just directly maps on top of the native "kill()"
system call. The details that haven't been needed yet and thus haven't
crystallized completely because clone() is only now starting to get used
for real..

> If you duplicate the information then you need to be concerned about
> currency. (This is true even if you declare that you won't keep the
> copies current -- at least you did consider the problem at one point.)
> Again, there is no "one" solution. You just need to choose. Either
> solution is BOTH good AND bad.

Clone() doesn't duplicate any information: it uses shared pointers to shared
in-kernel data structures that have resource counters to make
allocation/deallocation work correctly. So when you use "CLONE_VM", what
happens is that instead of doing a COW of the page tables and creatign a new
"struct mm_struct" etc, Linux just points the new task to point to the same
old "struct mm_struct", and increments the usage pointer. Same goes for
sharing file descriptors etc. (And for this reason a clone() that shares
everything is _really_ quick under Linux).

Plan-9 has a similar "rfork()" interface, and SGI has another thing that
looks pretty much like the Linux clone(). So it's not a totally new way of
looking at this. The plan-9 rfork() has some serious design problems in the
MM department, though, that Linux avoided. I haven't looked into the details
of the SGI thing.

Linus