[ANN] Userspace M-on-N threading model implementation. Alpha release.

From: Evgeniy Polyakov
Date: Mon Jan 29 2007 - 09:54:11 EST


Hello.

I'm pleased to announce initial userspace M-on-N threading model
implementation (for hackers) called NTL.
This is first alpha release, which indeed has bugs and limitations.

Userspace M-on-N threading model is based on the idea, that when signal
is delivered, kernel saves all information related to previous context
in stack, so it is possible to find it and replace.

M-on-N threading model compared to usual NPTL 1-on-1 model has following
advantages and disadvantages:

Benefits.

1. Fast scheduling.
There is no need to cross userspace/kernelspace boundary to schedule new
thread execution (just watch what happens with userspace network stack
compared to kernel's one when there are a lot of syscalls performed for
small packets receiving/sending).

2. Fast thread creation and destruction.
It just becomes an allocation of the structure in the userspace, no need
for full creation process which is performed in clone() syscall.

3. Smaller number of cache misses.
Since there is only one process instead of several threads, cache
locality is increased greatly with reduced number of misses.

Drawbacks.

1. Scheduling fairness.
Since kernel does not know about multiple threads behind given process,
it can not add it appropriate number of timeslices for execution.
Can be solved either by more tight collaboarion of the userspace and
kernelspace schedulers or simply by increasing process' nice value.

2. All communications are performed through one kevent pipe. (TODO)
All syscalls are going to be converted into non-blocking operations
(including nanosleep() and the like), and keep a track of what each
context performed. In practice glibc rewrite is not what I would like to
do, but instead some layer on top of it will be implemented, which will
convert syscalls into kevent operations, and become a rescheduling
point.

3. Complex code for good SMP scalability and userspace scheduler.
Not a problem. (TESTING)

SMP scalability in M-on-N threading model.

Since only kernel can schedule thread (actually not even thread or
process, but its own kernel's representation, so called kernel's virtual
process) to run on specified CPU, M-on-N threading model should have
several real threads (for example several current POSIX threads), its
number should be equal to number of real CPUs, and then library layer
will schedule execution of context of different real threads, each of
which in turn can run on separate CPU.

So, userspace will create new real threads when pthread_create() is
called until number of them is less than number of real CPUs, each real
thread in turn is a context in the global set of contexts, where fake
context will be added with all subsequent pthread_create() calls, and
userspace scheduler (backed by real threads) will pick up several
contexts from the tree and execute them on the real CPUs.

I would be possible to use existing Linux clone() syscall, but due to
complete absence of hte documentation (which is sometimes plain wrong)
and ery strong encryption of glibc sources it is quite complex task.

As NPTL, M-on-N threading library uses stack rlimit for thread stack
allocation.

Benchmarks.

I only ran simple benchmark of empty thread creation (its function just
exits).
After I started to use atomic locks ("lock" prefix on x86) instead of
semaphores, thread start/empty exec/stop was reduced down to 0.3
microseconds compared to 14 microsecods for POSIX NPTL case.

But there are problems.
First one is that I perform initial context setup through signal
invokation, which is at least two syscalls. They are slow.
Another one is that thread is really started only after rescheduling,
which is another signal, so another two syscalls.
Third on is that there must exist different locking primitives - for
signal context and for process context, which must block signals, which
in turn adds additional overhead of sigprocmask() syscall.

After I fixed all above issues (actually not fixed, but confirmed that
they must exist), performance reduced to 9 microseconds compared to 14
microsecods for POSIX NPTL case for empty thread creation/destruction.

(Test machine is Core Duo 2.4 Ghz (run at 3.7) with 2 GB of ram).

This can be fixed, if I would have created arch-specific
getcontext()-like calls, which would be mutually transformable into
signal context information (existing getcontext() and friends produces
different data than signal context has at least on x86). But I can not
right now, since I do not know enough x86 ABI (I learned a lot for past
several days, as you can notice from this blog, but it is still even
remotely not enough).

Currently M-on-N threading model uses ugly arch-specific hacks to start
new threads, which actually are something remotely similar to
makecontext().
So, the solution, which will rock M-on-N threading implementation is to
convert or create getcontext() and friends calls which can be used with
signal context information.

Another limitations are:
* x86 only (I do not have different test boxes to learn different asm)
* does not work if compiled with position-independent code support
* does not work if some functions are inlined (so -fno-inline flag)
* no support for run-time syscall substitution (to make it rescheduling
point) yet
* looks like a real hack

Advantages:
* it is faster (noticebly faster)
* it is simpler
* code is not encrypted like in glibc sources

Sources and developement comments can be downloaded from NTL homepage at
http://tservice.net.ru/~s0mbre/old/?section=projects&item=threading

Archive is also attached for interested reader.

Thank you.

Signed-off-by: Evgeniy Polyakov <johnpol@xxxxxxxxxxx>

--
Evgeniy Polyakov

Attachment: threading-2007_01_29.tar.gz
Description: application/tar-gz