POSIX.4?

Dave Wreski (dwreski@ultrix.ramapo.edu)
Tue, 16 Apr 1996 19:10:27 -0400 (EDT)


Hi all. More than a year ago I caught this message on this channel. I
was wondering how much of it is true for the upcoming release of linux
kernel. Not too long ago I had the need to do file descriptor passing,
and I was unable to do so. I was hoping that I could find a list of the
available function calls, ie, updated man pages or somesuch that lists
the new capabilities..

Thanks,
Dave

Date: Tue, 21 Mar 1995 17:08:32 +0100 (MET)
From: Markus Kuhn (CIP 90) <mskuhn@faui01.informatik.uni-erlangen.de>
To: Linux Kernel Mailing List <linux-kernel@vger.rutgers.edu>
Subject: A Vision for Linux 1.4: POSIX.4 Compatibility

A Vision for Linux 1.4 -- POSIX.4 Compatibility
- -----------------------------------------------

Today, the Linux kernel and libc is quite well compatible with the
POSIX.1 and POSIX.2 standards, which specify system calls, library
functions and shell command compatibility for UNIX-style operating
systems. However the POSIX.1 system calls and library functions define
only a minimum core functionality required by anything that looks like
UNIX. Many slightly more advanced functions like mmap(), fsync(),
timers, modifyable scheduling algorithms, IPC, etc. which are
essential for many real world applications (databases, real-time
applications, MPEG player, etc.) have not been standardized by
POSIX.1.

The new POSIX.4 standard (now officially called IEEE 1003.1b-1993,
ISBN 1-55937-375-X) corrects this and I believe POSIX.4 contains a
large number of useful ideas for further development on Linux.

In the very short introduction below, I hope to rise your interest in
POSIX.4 and in real-time problems in general. Happy reading!

POSIX.4 defines in addition to POSIX.1 the following new concepts and
functions:

Improved Signals
- ----------------

POSIX.4 adds a new class of signals. These have the following new
features:

- there are much more user specified signals now, not only SIGUSR1
and SIGUSR2.

- The additional POSIX.4 signals can now carry a little bit data (a
pointer or an integer value) that can be used to transfer to the
signal handler information about why the signal has been caused.

- The new signals are queued, which means that if several signals of
the same type arrive before the signal handler is called, all of
them are delivered.

- POSIX.4 signals have a well-defined delivery order, i.e. you can
work with signal priorities.

- A new function sigwaitinfo() allows to wait on signals and
contines quickly with program execution without the overhead of
calling a signal handler first.

Most new extensions defined by POSIX.4 are optional, only the extended
signals are mandatory, because this facility is used by many other
POSIX.4 facilities (asynchronous i/o, itimers, etc.). So if POSIX.4
compatibility is a design goal of Linux 1.4 (yes, please !!!), the
extended signals should get a high priority on the to-do list.

New functions for signals are:

sigwaitinfo(), sigtimedwait(), sigqueue().

Inter Process Communication (IPC) and memory mapped files
- ---------------------------------------------------------

POSIX.4 now defines shared memory, messages and semaphores. The
functionality and design of these is much better than the System V IPC
mechanisms which we have already in Linux. I guess it would be
possible to remove the old SysV IPC from the kernel and emulate it
completely in libc by using only the new POSIX-syle system calls. The
major extensions are:

- Strings (like filename paths) instead of integers are used now to
identify IPC resources. This will allow to avoid IPC ressource
collisions much easier than in SysV.

- Semaphores come in two flavours: kernel based semaphores (as in
System V, which requires a system call for each P/V operation) and
now also user memory based semaphores. Kernel based semaphores are
sometimes necessary for security reasons, however they are a real
pain if you want to build a high performance database. Suppose
there are 20 server processes operating on a single B-tree in a
memory mapped database file. Inserting a node with minimal
blocking of other concurrent accesses by the other 19 processes in
a large B-tree can require around 100 semaphore operations, i.e.
currently 100 kernel calls :-(. With POSIX.4's user memory based
semaphores, you put all your semaphores in a piece of shared
memory and the library accesses them with highly efficient
test-and-set machine code. System calls are now only necessary in
the rare case of a blocking P operation. A database programmer's
dream and easy to implement!!!

- In POSIX.4, both memory mapped files and shared memory are
done with the mmap() system call.

The new functions for IPC are:

mmap(), munmap(), shm_open(), shm_close(), shm_unlink(), ftruncate(),
sem_init(), sem_destroy(), sem_open(), sem_close(), sem_unlink(),
sem_wait(), sem_trywait(), sem_post(), sem_getvalue(), mq_open(),
mq_close(), mq_mq_unlink(), mq_send(), mq_receive(), mq_notify(),
mq_setattr(), mq_getattr(), mprotect().

Memory locking
- --------------

Four new functions mlock(), munlock(), mlockall() and munlockall()
allow to disable paging for either specified memory regions (mlock())
or for all pages (code, stack, data, shared memory, mapped files,
shared libraries) to which a process has access (mlockall()). This
allows to guarantee that e.g. small time-critical daemons stay in
memory which can help to guarantee response time of these daemons.
Under Linux, this (like most other real-time related features) should
of course only be allowed for root processes in order to avoid abuse
of this feature by normal users in large time-sharing systems.

Synchronous I/O
- ---------------

Databases, e-mail systems, etc. require to be sure that the written
piece of data has actually reached the harddisk, because transaction
protocols require that a power failure after the write command can not
harm the data. POSIX.4 defines the fsync() and O_SYNC mechanisms which
Linux already has.

In addition, there is a very useful new function fdatasync() which
requires that the data block is flushed to disk, however which does
NOT require that the inode with the latest access/modification time is
also flushed each time. With fdatasync(), the inode has only to be
written in case the file length has changed. In database applications
with mostly constant file sizes, were you sometimes require a fsync()
after each few written blocks, but don't care about whether the access
times in the physical inodes are up-to-date, fdatasync() can easily
double the performance of your system.

There is also a msync() function for flushing a range of pages from
memory mapped files to the disk.

Timers
- ------

- Instead of the old BSD style gettimeofday()/settimeofday() calls,
POSIX.4 defines clock_gettimer(), clock_settimer() and
clock_getres(). They offer nanosecond resolution instead of
microseconds as with the old BSD calls (at least on Pentiums, it
is not difficult to implement a timer with a resolution much
better than a microsecond). In addition, you can query now the
actual resolution of the timer with clock_getres() (this might
e.g. be higher on a Pentium than on an i386 if the Pentium
clock count registers are utilized).

- A new function nanosleep() allows to sleep also for less than a
second (the old sleep had only second resolution). In addition,
nanosleep won't interfere with SIGALRM and in case of EINTR, it
returns the time left, so you can easily continue in a while loop.

In order to implement this correctly with really high resolution
(i.e. with better than 10 ms resolution), the 100 Hz interrupt in
sched.c would have to check each time whether during the next time
slice, a nanosleep() is scheduled to wake up and it would have to
reprogram the interrupt timer to interrupt at precisely this time.
If well done, this could be implemented without performance
reduction for users of systems which do not use a nanosleep() at
the moment and it would bring Linux (together with the POSIX.4
scheduler extensions below) a lot towards real-time capability.

- POSIX.4 provides also itimers, however now you can deal with
several timers (at least 32 per process) and you have again up to
nanosecond resolution. The old itimer functions can still easily
be implemented in libc for compatibility reasons using new
POSIX-style itimer system calls.

Scheduling
- ----------

Linux has so far been optimized a lot as a time sharing system, were
several people run application programs like editors, compilers,
debuggers, X window servers, networking daemons, etc. and do word
processing, software development, etc.

However there are a lot of applications for which Linux is currently
unusable and for which even hard-line Linux enthusiasts have to keep a
stand-alone DOS version on their disk. For >90% of these applications,
the fact that Linux is uncapable of guaranteeing the response time of
an application is the major problem. Software for controlling e.g. an
EPROM programmer, a robot arm or an astronomical CCD camera is
currently not realizable under Linux if there is no dedicated
real-time controller present in the controlled device. A lot of
commercially available hardware has been designed with the real-time
capability of DOS in mind and has no own microcontroller for
time-critical actions, so this is a real world problem. I have myself
spent a long frustrating time of trying to implement an interface to a
pay-TV decoder for Linux (which emulates a chip card and allows you to
watch pay-TV for free :-). In this application, you have to wait for
an incoming byte on the serial port, then you have to wait for around
0.7 to 2 ms (never shorter, never longer!) before returning an answer
byte. It is virtually impossible to implement a user process for this
task under Linux, while it is trivial to do this under DOS.

For these and similar real-time applications, POSIX.4 specifies three
different schedulers, each with static priorities:

SCHED_FIFO A preemptive, priority based scheduler. Each process
managed under this scheduling priority possesses the
CPU as long as it doesn't block itself and there comes
no interrupt which puts another process into a higher
priority queue. There exists a FIFO queue for each
priority level and every process which gets runable
again is inserted into the queue behind all other
processes. This is the most popular scheduler used
in typical real-time operating systems. Function
sched_yield() allows the process to go to the end
of the FIFO queue without blocking.

SCHED_RR A preemptive, priority based round robin scheduling
strategy with quanta. It is a very similar to
SCHED_FIFO, however each process has a time quantum and
the process becomes preempted and is inserted at the
end of the FIFO for the same priority level if it
runs longer than the time quantum and other processes
of the same priority level are waiting in the queue.
Processes of lower priorities will like in SCHED_FIFO
never get the CPU as long as a higher level process
is in a ready queue and if a higher priority process
becomes ready to run, it also gets the CPU immediately.

SCHED_OTHER This is any implementation defined scheduler and would
for Linux obviously be the the current time-sharing
scheduler with nice values, etc. For simplicity, I
suggest that under Linux 1.4, all SCHED_OTHER
processes should have the lowest static priority
level and that all SCHED_RR or SCHED_FIFO processes
can only have higher priorities. Inside this common
lowest SCHED_OTHER priority level, the classic Linux
scheduling algorithm would determine the Linux
scheduler priority which decides which process gets
the CPU next depending on nice levels, how long the
process has already had the CPU, etc. as it is done
already now.

For security reasons, only root processes should under Linux be
allowed to get any static priority higher than the one for
SCHED_OTHER, because if these real-time scheduling mechanisms are
abused, the whole system can be blocked.

If one is developping a real-time application, it is a very good idea
to have a shell with a higher SCHED_FIFO priority somewhere open in
order to be able to kill the tested application in case something goes
wrong. If you use X11, not only the shell, but also the X server, the
window manager and the xterm will require a higher SCHED_FIFO or
SCHED_RR priority in order to stop processes blocking the rest of the
system.

With this POSIX.4 functionality, it would be possible to run real-time
software under Linux by giving it root permissions and assigning it a
SCHED_FIFO strategy and a higher static priority than all other
classic SCHED_OTHER Linux processors. In addition, this real-time
application would lock its pages with mlockall() into the memory in
order to avoid being swapped out. This will guarantee that the
real-time application can react as soon as possible on any interrupts
and that the response time will not be influenced by the complicated
Linux time-sharing priority mechanism or by pages which have been
moved to the swap space. Then the only final piece missing towards a
full real-time OS like QNX or LynxOS would be a preemptable kernel
(BTW: has Windows NT a preemptable kernel?). However this is a much
more complicated task (as the kernel won't be a monitor any more) and
I have some doubts whether implementing this is possible without a
noticeable performance loss.

The new functions are here:

sched_setparam(), sched_getparam(), sched_setscheduler(),
sched_getscheduler(), sched_yield(), sched_get_priority_max(),
sched_get_priority_min(), sched_rr_get_interval().

Ok, now the final new functionality:

Asynchronous I/O (aio)
- ----------------------

POSIX.4 defines a number of functions which allow to send a long list
of read/write requests at various seek positions in various files to
the kernel with one single lio_listio() system call. While the process
continues to execute the next instructions, the kernel will
asynchronously read or write the requested pages and will send signals
when the task has been completed (if this is desired).

This is e.g. very nice for a database which knows that it will require
a lot of different blocks scattered on a file. It will simply pass a
list of the blocks to the kernel, and the kernel can optimize the disk
head movement before sending the requests to the device. In addition
this minimizes the number of kernel calls and allows the database to
do something else in the meantime (e.g. waiting for the client process
sending an abort instruction in which case the database server can
cancel the async i/o requests with aio_cancel()).

Another important application of aio are multimedia systems (e.g. MPEG
players) which want to preload the next few seconds of the MPEG video
data stream from harddisk into locked memory, but also want to
continue showing the video on the screen at the same time.

POSIX.4 also defines priorities for asynchronous I/O, i.e. there is a
way to tell the kernel that the read request for the MPEG player is
more important than the read request of gcc. On a future real-time
Linux, you don't want to see any image distortions while watching MPEG
video and compiling a kernel at the same time if you gave the MPEG
player a higher static priority.

New functions in this area are:

aio_read(), aio_write(), lio_listio(), aio_suspend(), aio_cancel(),
aio_error(), aio_return(), aio_fsync().

For those of you who have become interested in POSIX.4 (I certainly
hope so! :-), there exists a good book

Bill O. Gallmeister, POSIX.4 -- Programming for the Real World,
O'Reilly & Associates, 1995, ISBN 1-56592-074-0.

This book is not only a good introduction into POSIX.4, it is also an
easy reading nice way into the world of real-time operating systems
for those developers who have so far been very UNIX and time-sharing
oriented.

And you can order the POSIX.4 standard (IEEE 1003.1b) as well as the
other POSIX standards (IEEE 1003.1 (the classic one), .1a (symbolic
links, etc.), .1c (threads), .2 (shell) and .3 (testing)) directly
from IEEE:

phone: +1 908 981 1393 (TZ: estern standard time)
+1 800 678 4333 (from US+Canada only)
fax: +1 908 981 9667
e-mail: customer.services@ieee.org

Markus

- --
Markus Kuhn, Computer Science student -- University of Erlangen,
Internet Mail: <mskuhn@cip.informatik.uni-erlangen.de> - Germany
WWW Home: <http://wwwcip.informatik.uni-erlangen.de/user/mskuhn>