Some more on POSIX.1b

Ulrich Drepper (drepper@myware.rz.uni-karlsruhe.de)
23 Nov 1996 04:50:35 +0100


Hi,

I've collected some more information on the POSIX.1b implementation.
I've hopefully incorporated the results of earlier discussions.

The part about semaphores is quite complete but I mentioned
shared memory and message queues only in a few sentences.
The implementation should follow what is done for semaphores.

Comments welcome.

-- Uli
--------------. drepper@cygnus.com ,-. Rubensstrasse 5
Ulrich Drepper \ ,--------------------' \ 76149 Karlsruhe/Germany
Cygnus Support `--' drepper@gnu.ai.mit.edu `------------------------

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
POSIX.1b for Linux

Just a proposal, by Ulrich Drepper <drepper@cygnus.com>

POSIX.1b defines a number of new programming means which support
inter-process communication:

- semaphores
- shared memory
- message queues

It's up to the system to decide about to optimal implementation
considering the situation for the kernel. The POSIX documents do not
specify the exact implementation but instead define the C language
API.

For Linux in special we have to consider several points before making
any decisions:

- we have two kinds of processes: real processes and threads. The
real meaning of "thread" in the Linux kernel is left open since
the underlying kernel mechanism provides the user many options.
Here we understand "thread" in the context of POSIX threads and
all threads in a process share all resources but their own stack.

Because of this last point we are able to reduce the whole problem
of implementing the POSIX interface significantly. See below.

- A left-over from the old days are function to provide the very
same services which were defined in SysV systems and are now
standardized in the X/Open Programmers Guide. To reduce the kernel
code one should try to unify the implementation.

- Linux is (primarily) designed for modern systems which provide
hardware support for memory management. For the sake of speed an
simplicity this should be used.

Threads and Processes

As we said above threads in Linux share everything but the stack. But
even the stack of each threads is accessible to the other threads.
The POSIX functions for the new entities explicitly makes a difference
between local between global and local semaphore objects.

Implementing local semaphore can be done very effective inside the
thread library. No system calls are needed to handle the semaphores
except for except for re-scheduling.

So the whole only deals with global objects.

The two Interfaces

The normal interface to all the new object types consist of an open()
like function. This function takes a name, flag and mode
descriptors similar to open(). It is left undefined how the how
exactly the names are handled. More on this later.

Again the semaphore handling functions require a second interface to
be implemented. POSIX defines unnamed semaphores. I.e., the
appropriate functions simply return a descriptor for a currently
unused semaphore. No name is required. Somehow this should fit in
the scheme how named semaphores are handled.

What has to be implemented?

Each of the three objects require of few operations to be defined on
them. Let's start with semaphores:

Semaphore Action. First we have the two already mentioned functions
to create/attach to a semaphore object.

int sem_init (sem_t *sem, int pshared, unsigned int value)
and
sem_t *sem_open (const char *name, int oflag, ...)

sem_init initializes the semaphore object pointed to by sem. The
pshared parameter is not of interest here since we always expect to
handle global objects in the kernel. value is is the initial value of
this counting semaphore.

sem_open returns a pointer to a semaphore object. The name is used to
find out whether there is already an object with this name. This
allows easily to attach to an existing semaphore object if this is
wanted. oflag and the optional arguments match the arguments of the
open() function.

Here we must note several points:

- It is left unspecified how exactly the name argument to sem_open is
handled. The standard allows the name to appear in the real
file system. The only restriction is that the name must follow the
restrictions which are true for other file names.

Beside this the standard specifies that the effect of sem_open for
names starting not with a slash character is undefined. This leaves
open some options. First the user-level sem_open can make sure the
kernel will never see names without a leading slash. This can be
easily be achieved by the following code:

if (name[0] != '\0')
{
char cwd[PATH_MAX + 1], *tmp;
if (getcwd (cwd, sizeof (cwd)) == NULL)
return -1;
tmp = alloca (strlen (cwd) + 1 + strlen (name) + 1)
name = stpcpy (stpcpy (stpcpy (tmp, cwd), "/"), name);
}

For the sem_init function we do not have to specify the a name, so a
NULL pointer is unique (for user-level implementation of sem_open
should catch this error case). But this still leaves us another
possibility: names not starting with a slash. Here we should forget
the old SysV interface. The function is defined as

int semget (key_t key, int nsems, int semflg)

where key_t is a numeric type. So an idea to cover all three
semaphore creation in one system call could be achieved by using the
textual representation of the numeric value of key as the name,
without a leading slash to make it unique.

All this together suggests an entry code for the semaphore creation
in the kernel like this:

if (name == NULL)
type = SEM_UNNAMED;
else
type = name[0] == '/' ? SEM_NAMED : SEM_SYSV;

The other parameters needed for a unified interface are:
* initial value
* access mode and rights
The internal information in the kernel must also cover
* UID (for SysV semaphores)
* GID (for SysV semaphores)
* CLOSEXC flag (for sem_open)

- The nature of names for the named semaphores allow a very nice
mean to control the semaphores. There are utilities needed to list
available semaphores, remove them, etc. and these must somehow find
out which semaphores are available.

A possible interface would be to create a pseudo file systems which
is made available as part of the /proc file system. One could imagine
to implement a layout like this

/proc/
/proc/semaphores/
/proc/semaphores/named/
/proc/semaphores/named/some-app/sema-1
/proc/semaphores/unnamed/
/proc/semaphores/unnamed/9A87F312
/proc/semaphores/sysv/
/proc/semaphores/sysv/98765432

The ipcs set utility for control the SysV object could simply
operate on the /proc/semaphores/sysv/ directory. Executing

# cat /proc/semaphores/sysv/98765432

could produce the following output:

semid: 98765432
owner: 101
perms: 777
nsems: 1
value: 1

from which the control program easily can produce the needed output.

One could even think that the whole interface is handled by this
pseudo file system. I.e. the various open calls could be mapped
to real open() calls of these pseudo files and all other operations
could be handled by fctnl() and close() calls. The POSIX standard
allows this kind of implementation but the total number of available
file descriptors might be a problem.

One problem with the pseudo file system is that it might be more
complex to implement. But placing the the names in a real file system
is not really an option since

- there would have to be introduced three more special files and
the underlying file system must know about them.

- when a process (or the machine) crashes objects are left behind
which must be handled manually.

The next group of functions deal with closing and removing:

int sem_destroy (sem_t *sem)

Destroy an unnamed semaphore.

int sem_close (sem_t *sem)

Close descriptor for named semaphore but don't remove it.

int sem_unlink (const char *name)

Remove named semaphore.

The way how these functions shall work must not be discussed a lot.
It only depends on how the implementation of the semaphores are
chosen to work. It could simply be file operations like close() and
unlink() if the semaphores are implemented via a pseudo file system.

It is still not clear how the operations on the semaphores now
should work. We have to be able to:

- unlock the semaphore

- lock the semaphore

- try to lock the semaphore

- read the value of the semaphore

Following the proposal of the pseudo file system one could implement
all these operations using the normal ioctl call using a descriptor
for the pseudo file.

But the read operation might be used frequently and so it would
be good to optimize this operation. Another possibility for the
interface would be to make the value of the semaphore directly
available for reading by mapping a read-only page with the variables
into each process image.

I.e., a pointer to a page of memory is used as the pointer to an
array of struct which are used to represent the semaphores in
the kernel. So the process could get the current value of the
semaphore by directly reading the struct in this memory. No system
call is necessary.

The other operations could also be implemented using this interface.
Since the memory page is read-only any attempt to write would result
in an error, caught be the kernel. The kernel could find out which
address was tried to use and depending on this the operation could
be determined. The semaphore structure could look like this:

struct
{
int value;
int post;
int wait;
int trywait;
};

The values for `post', `wait', `trywait' could be arbitrary, only
the address is important.

The decision which interface is used shall be made based on
experiences which is faster and/or easier to implement. The reading
interface can be made available in any case since inside the kernel
there must be an array of struct describing the semaphores and it
is easy to map them.

>>>>>>>>>>>>>>
Why is the reading operation important?
It was noted that the implementation of sem_trywait() could implemented
like this:

int
sem_trywait (sem_t *sem)
{
int val;
if (sem_getvalue (sem, &val) == 0 && val != 0)
{
/* While reading the semaphore was ready to locking. */
__syscall_trywait()
if (successul)
return 0;
}
else
errno = EAGAIN;
return -1;
}

This is not 100% correct but should show the possibilities.

It is questionable whether it is really possible to make the
information available in this form. It is certainly possible for
unnamed semaphores but named and SysV semaphores have access rights.
And implementing this features only for unnamed semaphores is perhaps
not worth the work since unnamed semaphores are not really useful.
<<<<<<<<<<<<<<<<

Shared Memory.

The handling of shared memory is similar but more simple. The POSIX
functions require only to open and close memory. These descriptors
must be handled using mmap later.

What remains true is that the name space must be handled somehow.
There have to be support for names POSIX and SysV shared memory
segments. So again the pseudo file system could help. Everything
said about this in the description of semaphores is also true here.
We only need no mapping of any information. A possible layout
could look like this:

/proc/
/proc/shared-memory/
/proc/shared-memory/named/
/proc/shared-memory/named/my-app/shm-1
/proc/shared-memory/sysv/
/proc/shared-memory/sysv/76543210

I.e., it would prevent complications when the name space of shared
memory segments does not conflict with the other name spaces
(normal files, semaphores, or message queues).

Message Queues.

Message Queues are again a bit more complicated to handle.
The creation again has to happen in a name space which should not
conflict with the others.

In addition to the operations what we saw before message queues
also have attributes which must be settable. The ioctl() interface
still is usable.

>>>>>>>>>>>>
An important restriction for implementing message queues is that
currently POSIX.1b signals are not implemented. But these are
necessary for the complete implementation of message queues.
<<<<<<<<<<<<

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Final words (for now):

- merging the old SysV definitions and the new POSIX definitions
should be possible

- the pseudo file system is not the easiest way of implementation
but it is perhaps easiest to control by the user

- all operations could be mapped to normal open/read/write/ioctl/close
operations on the pseudo file system.

- It might be possible to implement the whole support for the IPC methods
as a new file system.