[RFC/PATCH] FUSYN 1/10: documentation files

From: inaky . perez-gonzalez
Date: Wed Jan 14 2004 - 17:59:14 EST


fusyn.txt | 559 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 559 insertions(+)

--- /dev/null Wed Jan 14 14:39:30 2004
+++ linux/Documentation/fusyn.txt Mon Jan 5 10:22:44 2004
@@ -0,0 +1,559 @@
+
+FUSYN - Fast User SYNChronization primitives
+
+http://developer.osdl.org/dev/robustmutexes/
+
+I am calling these things FUSYNs to distinguish them from the original
+futexes they base on, as they behave kind of different (the naming
+sucks, you are welcome to suggest new names).
+
+This is my second big time attempt to implement real-time locking in
+the Linux kernel to solve the short comings of a locking system based
+on futexes, being mainly:
+
+ - no robustness support without kernel cooperation
+
+ - no priority inheritance/protection
+
+ - no real-time wake up (priority based, no priority inversion
+ holes, etc)
+
+The main objects to implement are:
+
+ - fuqueues: Priority-sorted wait queues -- other than that, mostly
+ equal to futexes. Usable from kernel and user space.
+
+ - fulock:
+
+ This is a full blown "mutex" implementation that can be used from
+ kernel and user space (with user-space fast [un]locking on
+ non-contended situations), robustness (if owner dies, ownership is
+ passed on), priority inheritance and (FUTURE) priority protection.
+
+ They are just a fuqueue that supports the ownership concept, to
+ allow for robustness and priority inheritace/protection.
+
+ It also supports serialized and non-serialized unlocks [see FAQ].
+
+All the non-blocking calls (wake() and unlock() can be used from
+interrupt context; the data structures are protected by IRQ safe spin
+locks). This is heavier weight, but is needed to properly support
+priority inheritance and to avoid cache line bouncing of the spin
+locks that protect the structures.
+
+Released files:
+
+http://developer.osdl.org/dev/robustmutexes/
+
+fusyn-2.6.0XXX-2.X.patch Patch file against Linus' tree
+fusyn-test-2.X.tar.gz Sample test library and code
+
+
+Contents:
+---------
+
+- vlocators
+- fuqueues
+- fulocks
+- issues/future work
+- FAQ [some definitions here, try wild search]
+
+
+
+
+VLOCATORS
+---------
+
+This is an structure (struct vlocator) and the associated code to map
+a user space address to a kernel object that is kept in a hash
+table. As well, it provides and uses a reference count for the object,
+as well as a hash table cleanup function.
+
+It uses the 'struct futex_key' as done by Jamie Lokier; the code is in
+include/linux/vlocator.h and kernel/vlocator.c.
+
+Two very simple operations: find an object in 'current's space by
+address and find-or-allocate (vl_find() and vl_locate()).
+
+The cleanup function (or garbage collector) runs periodically and
+releases items with a reference count of zero. This allows the get/put
+operations to be lockless.
+
+
+FUQUEUES
+--------
+
+Fuqueues are just wait-queues, like futexes; the differences are in
+the wake up process, as it is done not in a FIFO order but by
+priority. As well, if the task changes its priority while waiting, its
+position in the list is updated. The code is in include/linux/fuqueue.h
+and kernel/fuqueue.c.
+
+They consist of a 'struct fuqueue' which has a priority-sorted wait
+list and a lock to protect access to it. They can be used in
+kernel-space as a wait queue.
+
+Entry points:
+
+fuqueue_wait() -- wait on a fuqueue
+fuqueue_wake() -- wake N waiter from a fuqueue with code C.
+
+The code is split in various functions to allow fulocks to use the
+fuqueue stuff for integration. The wlist*_t thing is a reminder of
+something that will go away; it stands for 'wait list' and is setup
+with a #define based redirection to support different types of sorted
+wait list implementations (for example, one that is O(1) using a
+priority array -- that is huge). That is going to be deprecated in
+favor of a O(1) priority sorted list that is not as big (see FUTURE
+WORK).
+
+'struct ufuqueue' is a fuqueue plus the stuff to link it to a possibly
+shared user space address (a vfuqueue) (the vlocator), so that is the
+real futex equivalent. The code is in kernel/ufuqueue.c and just
+consists on the syscall wrappers to associate the proper ufuqueue to
+the vfuqueue and then call the fuqueue layer.
+
+
+FULOCKS
+-------
+
+The mother of the whole thing. Fulocks are a full mutex
+implementation; it is basically the concept of an owner and a list of
+tasks waiting to own the mutex (implemented with a 'struct fuqueue').
+
+The 'struct fulock' holds everthing. To support the different modes
+of operation (priority inheritance, priority protection, deadlock
+check and sun-mode robustness [see FUTURE WORK]) there is a 'flags'
+member.
+
+As well, there is an ownership list node, where all the fulocks that a
+task currently owns are linked to the task (task->fulock_olist).
+
+As well, a fulock has the concept of 'state': healthy, dead-owner or
+not-recoverable (see the FAQ for the definitions).
+
+The entry points are [kernel/fulock.c, include/linux/fulock.h]:
+
+fulock_lock()
+fulock_unlock()
+fulock_consistency() [for manipulating the state]
+
+A user level fulock (struct ufulock) is a fulock that can be used from
+the user space--it is represented by a (potentially shared) memory
+address (a vfulock). A vlocator is used to track it. Implemented in
+kernel/ufulock.c.
+
+The vfulock may have different values that server to define the state
+of a lock:
+
+0 Unlocked [may be fast-locked]
+PID (< VFULOCK_KCO) Fast-locked by PID, no waiters in the
+ kernel. [May be fast-unlocked].
+VFULOCK_KCO Locked by someone, kernel knows who, waiters
+ in the kernel.
+VFULOCK_DEAD Previous owner died (maybe unlocked or
+ locked), the kernel keeps the status.
+VFULOCK_NR Not recoverable.
+
+Now, when user space goes to lock a ufulock with a fast operation, it
+issues an atomic compare and swap of its PID against 0; if it
+succeeds, its the owner, done; if not, it goes to the kernel
+(sys_ufulock_lock()), who will put it to wait [see
+test/src/include/kernel-lock.h:vfulock_lock() in the fusyn-test
+package].
+
+Unlock is fairly similar: if the value is VFULOCK_{KCO,DEAD}, go to
+the kernel, sys_ufulock_unlock(); if VFULOCK_NR, return error; if not,
+it is a PID and need to do an atomic compare and exchange of zero
+(unlock) against the PID [again, check vfulock_unlock()].
+
+Now, how it works is fairly simple: the kernel will always maintain
+the vfulock and the corresponding fulock in the 'struct ufulock' in
+sync [vfulock_sync() in kernel/ufulock.c], and will do that everytime
+we enter it through one of the fulock system calls
+(sys_ufulock_{[un]lock,consistency}().
+
+The kernel will use the PID set by the fast-locker to match who is the
+owner when he doesn't know about it [afterwards it will be registered
+in the kernel)--check __fulock_id_owner() for ideas on how to avoid
+collision due to PID reuse].
+
+Once that is done, what is left is a 'fulock' that can be handled by
+the fulock layer.
+
+Now [uv]fulocks support:
+
+ - Real time: the unlock procedure is realtime in the sense that it
+ is O(1) and the next owner is the highest priority one; as well,
+ the fulock (actually, the vfulock) is never unlocked in the
+ meantime, the ownership is transferred instead of unlocking the
+ lock, waking up the first waiter and waiting for it to acquire
+ it. This avoids priority inversions by lower priority threads
+ sneaking in from other processors at the worst time.
+
+ - Deadlock checking: complex dead lock scenarios where a
+ ownership/wait chain [see definition in FAQ] is involved are
+ catched if FULOCK_FL_ERROR_CHK is set.
+
+ - Priority change support: when the priority of the waiting task
+ is changed, it's position in the list is updated. See below for
+ effects on priority inheritance.
+
+ - Robustness: when a task who is a fulock owner dies and the
+ kernel knew about it (ie: it was registered in the
+ task->fulock_list), then the fulock is made dead-owner, unlocked
+ and the next waiter gets ownership, with a -EDEADOWNER return
+ code.
+
+ This is always enabled; user space can emulate the
+ hangs+timeouts that would happen if this were not detected.
+
+ If the kernel knew nothing about it (ie: it was fast-locked),
+ then __fulock_id_owner() will fail to map the PID in the vfulock
+ to an existing task; then the current claimer would be
+ considered the owner after marking the fulock dead-owner.
+
+ Note the comments in __fulock_id_owner() for ideas on how to
+ avoid collisions due to PID reuse.
+
+ - Priority protection: not implemented yet
+
+ - Priority inheritance: when a waiter queues for a fulock that has
+ the FULOCK_FL_PI bit set and its priority is higher than that of
+ the owner, it will boost the owner's priority to its own; this
+ will propagate in an ownership/wait chain (if the owner was
+ waiting on for a fulock, etc). As well, priority changes will
+ also be propagated.
+
+ This is done with __fulock_pi_boost(), fuqueue_chprio() and
+ __fulock_chprio() [through the fulock's 'struct
+ fuqueue_ops']. The unboosting is done in __fulock_unlock() or
+ __fulock_wait_cancel() [again, through the fulock's 'struct
+ fuqueue_ops'].
+
+
+FUTURE WORK
+-----------
+
+ - fucond: conditional variables; although they can be implemented
+ in user space + fuqueues, doing it in the kernel helps a lot in
+ atomicity issues (and the performance should be much better).
+
+ We tried doing that (see releases up to 1.12 in the website) and
+ generally it sucked because of the code bloat in the kernel, so
+ we decided to extirpate it.
+
+ - rw lock: only the first locker can do the fast operation; the
+ others go to the kernel to sign up. This way ownership is
+ supported. If a reader dies, nothing happens (after all, it is
+ supposed to be read-only access), but we need to keep track of
+ readers dying so they don't hold writers off. If a writer dies,
+ next locker (reader or writer) gets dead-owner.
+
+ These guys could also get, like this, PI and PP, as they would be
+ very similar to fulocks, but with two waiting lists. One for
+ writers, one for readers, and they allow many ownerships at the
+ same time (when there are readers).
+
+ Maybe different operation modes to primer writers over readers?
+ FIXME, need to explore.
+
+ - Spinlocks: they could be implemented as a trylock() on a fulock
+ for N loops, and after it'd degenerate into a mutex wait. This
+ wait they'd automagically support robustness, PI and PP.
+
+ - Barriers: futexes offer enough functionality for implementing
+ them, however wake up should be real-time (priority based). Not a
+ real issue though, as in barriers everybody is woken up. It can be
+ done also with fuqueues.
+
+ - Getting rid of the vlocator hash table and doing direct mapping
+ [so that we avoid the O(N) lookup] by storing in user space some
+ short of pointer to a in-kernel data struct. The pointer has to be
+ "validated", so that user space cannot have the kernel point to
+ some random or pontentially dangerous space.
+
+ A way would be to store two values, the pointer itself plus a
+ kernel-crypted copy that the can be used to verify.
+
+ Need more research into this.
+
+ - O(1) priority list: current plist is not O(1) in addition, because
+ it has to locate the proper position in the list where to add. I
+ plan to modify the plist code to be O(N) where N is the number of
+ priority levels, and as it is fixed at compilation time, it is
+ effectively O(1).
+
+ The idea is to have something similar to a priority array, but
+ instead of having N list heads, we have only the first node of
+ each priority being the list head, and the rest of the guys in
+ that prio hanging from him.
+
+ - Sun-mode robustness. Solaris implements robustness in a slightly
+ more restrictive way. We want to add an small compatibility layer
+ so both models can be used.
+
+ - Support for architectures that don't have atomic compare and
+ swap. Once priority protection is finished (that will involve that
+ a pure KCO method is developed), it can be solved using this;
+ however, they will loose the fast-lock path--maybe not an issue?
+ how many arches do not implement atomic compare and exchange?.
+
+ - That page pinning when waiting for a fulock...
+
+
+FAQ
+---
+
+This set of Q&A is what I use myself to track my ideas and concepts
+(and not to forget why did I decide anything).
+
+
+Q: What is PI?
+
+Priority Inheritance: when task A holds resource R and task B claims
+it, and prio (B) > prio (A), then B can force A to take its priority
+so it finishes sooner and B can take the resource ownership. The
+priority boost ends when A releases R.
+
+
+Q: What is PP?
+
+Priority Protection: resources have an associated priority ceiling;
+any task that acquires a resource will have its prio raised to that
+prioirty ceiling while holding it.
+
+
+Q: What is RM?
+
+Robust Mutex, or robustness, for short: when the owner of a resource
+dies, we want the next owner to know that somebody died while holding
+the resource, so s/he is able to determine if a cleanup is needed.
+
+
+Q: What is a healthy fulock?
+
+This is a fulock in its normal state, that is: initialized and not in
+dead-owner or not-recoverable states.
+
+
+Q: What is a dead-owner fulock?
+
+A fulock is in dead-owner state when it was locked (some task owned
+it) and the task died without unlocking it.
+
+
+Q: What is a not-recoverable fulock?
+
+A fulock is in not-recoverable state when it went into dead-owner
+state and some task that acquired it in dead-owner state decided that
+it had to be made not-recoverable.
+
+The rationale behind this is that normally you have some lock
+protecting access to some data. When the lock goes dead-owner, the
+task that owned it and died could have died in the middle of updating
+the data, and thus it can be inconsistent. Subsequent owners of the
+lock get it with the dead-owner state, so that they are aware of the
+situation. If any of them can fix it, it can move the lock back to
+healthy state and continue operating, but if there is no way to fix
+the data, it is moved to not-recoverable state.
+
+When moved, all the pending waiters are given an error code (EBADR)
+indicating the new state, so that they can bail out and report up to
+their managers for what to do. As well, new contenders that try to
+acquire the lock will get also the EBADR error code.
+
+The only way to make the fulock healthy again is to reinitialized it.
+
+
+Q: What is a dead-owner dead-lock?
+
+When some task that has to unlock a locked fulock dies and others are
+waiting for it to release the fulock.
+
+
+Q: What is a dead-owner recovery?
+
+When a lock owner dies, the next waiter or next guy who locks and gets
+ownership gets it with an special code that indicates that some
+previous owner died and that the state of the lock is "dead-owner",
+that recovery on the data structures protected by the lock must be
+done in order to ensure consistency.
+
+Once a fulock is in dead-owner state, it can be moved back to
+normal/healthy or made inconsistent (so only an initialization returns
+it to normal).
+
+
+Q: Why does the kernel have to set the value of the fulock?
+ Why cannot the value of the fulock after unlock be set by user
+ space?
+
+There is a risk of overwritten values and missed waiters.
+
+For example, task B claims fulock F (locked by task A) so it goes to
+the kernel to wait; now the fulock value is VFULOCK_KCO (kernel
+controlled ownership). Before it reaches the kernel, task C releases
+the fulock for task A; as there are no waiters, it returns UNLOCKED
+and task C has to set it to UNLOCKED, thus overwriting VFULOCK_KCO; as
+KCO is overwritten, task B is going to be dead-locked in the kernel,
+waiting.
+
+Furthermore, it helps guaranteeing robustness. If the just-woken up
+task that has to set the value the kernel passes dies before being
+able to do it, you hit a dead-owner dead-lock because nobody wakes up
+the waiters until somebody tries to lock and realizes the fulock is
+dead.
+
+
+Q: What are the two kinds of unlock?
+
+The two kinds of unlock are serialized and non-serialized.
+
+You need both because the serialized one can be slower, as it might
+force a context switch.
+
+I thought initially that this would show only in synthetic benchmarks
+(very tight loop acquiring and releasing the lock against some other
+threads doing the same thing), but I was wrong. Max Hailperin pointed
+to me that what I was seeing was the "Convoy Phenomenon", documented
+in by Mike Blasgen, Jim Gray, Mike Mitoma and Tom Price, in 1977
+[http://portal.acm.org/citation.cfm?id=850659&jmp=cit&dl=GUIDE&dl=ACM].
+
+After thinking about it, I concluded it would mostly apply in the
+following conditions:
+
+- user space only [in kernel space the lock itself is the spinlock
+ that protects the 'struct fulock'; as well, there is no preemption
+ and we spin]
+
+- non real-time environments/processes (where preemption is likely)
+
+- real-time SMP environments where two taks might compete for a lock
+ at the _same_ time (so to avoid it in this case, it might be
+ interesting to spin a wee bit before blocking).
+
+Now, in order to gain robustness you need serialization, so a
+userspace user is recommended to use serialized wake ups only when:
+
+- need *full* and complete robustness guarantee
+
+- needs real-time priority-inversion protection guarantee (in SMP, not
+ needed for UP).
+
+
+Q: What is a non-serialized unlock?
+
+A non-serialized unlock works by setting the fulock to unlocked and
+waking up as many waiters as desired. The waiters then re-contend for
+the fulock, the winner owns it and the others go back to wait on it.
+
+Main problem: can't find a way to guarantee robustness
+
+- Say we have a fulock with M guys waiting and we wake up N (1 < N <
+ M), a non-serialized wakeup. Thus, there are M - N guys still
+ waiting in the kernel.
+
+ In order for the unlock to be non-serialized, the waker first sets
+ the futex to UNLOCKED.
+
+ Now, how do the woken up processes know that there are still
+ processes in the kernel?
+
+ A solution is to set the lock not to UNLOCKED, but to
+ UNLOCKED+WP (ie: maintain the WP); this way, whowever tries to
+ acquire will see UNLOCKED+WP and will go down to the kernel to do
+ the lock operation. Also, the lock operation can return an special
+ code that says "contend, but set the lock to pid|WP" to indicate
+ waiters present.
+
+ However, it still does not solve the fact that when setting to
+ UNLOCKED+WP and waking N, if those N die before locking, the
+ waiters go into dead-owner dead-lock.
+
+ When somebody tries to lock that, the kernel should be able to
+ notice that there are waiters and it is unlocked and thus give
+ way to the first locker with dead-owner recover --it might be too late.
+
+ Another option might be to tag the woken-up processes before they
+ exit the kernel, so that if they die, do_exit can trap it (but there
+ are many side issues to this, like how do I make sure that the N who
+ I woke up have gone through it, one has locked, the other N-1 have
+ gone to sleep, how do I clean it up and stuff like that--makes it
+ pretty ugly, not to talk about how many resources it'd need to tag it).
+
+ Gosh, it is a mess -- I would say that robust mutexes have to
+ require serialized unlocks. Period.
+
+ Not even talk about adding a short timer to verify that the thing
+ was locked...shrivers
+
+ RESEARCH: tentative ownership: when we wake up some guys who are
+ supposed to go and try to lock again, tie the fulock they should
+ lock to their task_struct and on exit(), check they don't have it
+ there ... [many details need to be worked out].
+
+- It could also be solved by waking up _all_ the waiters; this way no
+ dead-owner dead-lock could ever happen; however, it is a sure recipe
+ for an scheduling storm some day.
+
+
+Q: What is a serialized unlock?
+
+A serialized unlock transfers the ownership of the fulock to the first
+waiter in the kernel.
+
+- Only one waiter can be woken up at the same time with this method.
+
+- It prevents priority inversion (as the fulock stays locked during
+ the whole operation no low priority thread can acquire it in the
+ meantime).
+
+- dead-owner dead-lock is not possible, because the owner is always
+ known during the operation. As well, if the new owner dies on it's
+ way up to user space, its ownership is also known.
+
+Slower (still not proved seriously--postulated and proven in some
+very synthetic benchmarks) because it forces a context switch.
+
+
+Q: What is an vfulock?
+
+It is the address in user space associated to a fulock in kernel
+space.
+
+
+Q: What is owner identification?
+
+Owner identification is a property that the fulocks have: basically,
+they can identify who is the owner based on the vfulock (in user
+space) or the internal kernel data structures that refer to it (if the
+vfulock is VFULOCK_KCO, that means that the kernel tracks the
+ownership); if vfulock < VFULOCK_KCO, it means that the ownership is
+tracked only in user space, and the vfulock is the PID of the owner.
+
+
+Q: What is a kernel-controlled-ownership fulock? (kco)
+
+A vfulock whose value is VFULOCK_KCO or VFULOCK_DEAD, indicating that
+the ownership for the fulock is tracked by the kernel [FIXME: also KCO
+fulock are those called with the FULOCK_FL_KCO flag--not implemented].
+
+This happens when:
+
+ - The fulock is locked and there are waiters on the kernel
+ - The fulock is dead (and the ownership keeps track for it)
+ - The fulock is a priority protected fulock
+
+Basically it is a way to indicate that the fastpath for
+locking/unlocking cannot be taken.
+
+
+Q: What is an ownership/wait chain?
+
+The structure that is formed when task A owns lock F and is waiting
+for lock G, owned by task B that is waiting for lock H, that is owned
+be task C that is waiting for lock I ... etc.
+
+When this chain is circular (eg: lock I is owned by A) then there is a
+deadlock.
\ No newline at end of file
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/