Re: [PATCH v2 2/2] doc: detailed documentation for completion

From: Jonathan Corbet
Date: Wed Dec 31 2014 - 17:50:52 EST


On Tue, 23 Dec 2014 20:41:40 +0100
Nicholas Mc Guire <der.herr@xxxxxxx> wrote:

> diff --git a/Documentation/scheduler/completion-design.txt b/Documentation/scheduler/completion-design.txt
> new file mode 100644
> index 0000000..ec3e320
> --- /dev/null
> +++ b/Documentation/scheduler/completion-design.txt
> @@ -0,0 +1,498 @@
> +completion - wait for completion handler
> +========================================
> +
> +Origin: Linus Torvalds, kernel 2.4.7, 2001 [1]
> +Location: kernel/sched/completion.c [11]
> + include/linux/completion.h
> +Users: all subsystems - mostly wait_for_completion and
> + wait_for_completion_timeout is in use.

Again, I'm not sure we need this bit of stuff.

> +This document was originally written based on 3.18.0 (linux-next)
> +For general constraints and usage of completion see completion.txt

completion*s*

> +
> +Design Goal:
> +------------
> +
> + - Replace the semaphores used as "code"-locks - primarily for driver
> + synchronization.
> + - Provide a primitive that is optimized for the contended case and
> + not for the uncontended case (as semaphores are).
> + - Allow multiple threads to wait for the same condition to 'complete'.
> + - Provide a code-lock (similar to pthread_barriers or conditional
> + variables)
> + just a bit simpler.
> + - Have an API that makes the intent of the code clear (which the use
> + of locks did not).
> + - Provide a proper way to yield the CPU for waiting on events which
> + yield(); does not do [7]
> +
> +Design:
> +-------
> +
> + Locking, from the threads perspective, is intended to be used for
> + exclusion which raises issues of priority inversion, deadlocks etc.
> + while waiting for a specific state change to occur aka "completion"
> + is a thread synchronization point - so the exact opposite of an
> + exclusion point.
> +
> + Locks have been traditionally in (mis)use for code-locking while they
> + are intended for data locking only.
> +
> + Completion provides a simple counter to indicate how many waiting

"Completions provide..."

> + threads can continue. If a single thread should be allowed to
> + continue, the counter is incremented by 1 if ALL should be allowed to

semicolon after 1

> + continue it is simply incremented by UINT_MAX/2 which ensures that
> + ALL will be woken. Note that this behavior implies a one-shot nature
> + of completion - that means it is intended to signal one event only

completion*s*...I won't point this out everywhere.

> + and if a new event/condition is to be signaled then it needs to be
> + reinitialized. Completion has no notion of unlock/lock rather it is
> + initialized in a "not done"/locked state and can be unlocked once.
> +
> + The tasks that wait_for_completion are put on a wait-queue and when the
> + condition they are waiting for is signaled by a thread calling complete()
> + on the appropriate struct completion, the tasks waiting (one or all) is
> + woken by a call to try_to_wake_up().
> +
> + As the completion structure does not have an/ associated lock of its

Extraneous "/"?

> + own the wait-queue lock is used in cases where locking is needed e.g.

"wait queue"

> + complete(),complete_all(),try_wait_for_completion() and
> + completion_done().
> +
> + Completion replaced the (mis)used semaphores in synchronizing execution
> + paths - notably when there were multiple waiters.
> +
> +<snip>
> + The basic summary is that we had this (fairly common) way of
> + waiting for certain events by having a locked semaphore on the
> + stack of the waiter, and then having the waiter do a "down()"
> + which caused it to block until the thing it was waiting for did
> + an "up()".
> +<snip> [1]

A quote like this could use some context. "As Linus put it in 2001:" or
some such?

> + The design and also the core API implementation has been stable
> + since 2004.
> +
> + The _timeout/_interruptible API extension was added in 2004 [10]
> + to allow converting the racy sleep_on() users to the sound
> + completion API.
> +
> + An extension initially for the XFS object flushing was added in

s/the//

> + 2008 extending the API by the try_wait_for_completion() and

here too

> + completion_done() allowing safe use with held locks.
> +
> + The last extension was to add the _io variants in 2013 [14]. These
> + were added to correct iowait time accounting when the completion
> + is used for waiting on IO (e.g. completion of a bio request in
> + the block layer). Currently this is only used in the block layer.
> +
> +
> +Design rational for a new primitive:

rationale

> +------------------------------------
> +
> + So why not extend semaphores rather than add a new primitive ?
> +
> + - Semaphores are optimized for the non-contention case. The "wait
> + for completion" usage has the opposite default case.
> + - The optimization of semaphores is architecture-specific, making
> + it hard to change this and probably would have not been possible
> + without a penalty. There as at least an attempt to provide
> + completion as a wrapper to semaphores [3] but this path was not
> + further followed.
> + - Problem with semaphores that are declared on the local stack
> + doing down(); return; [1] or:
> + down(); kfree() on the semaphore [2].

This is not a complete sentence, and it's not clear what the point is.

> + - The intent of the code is not made clear when using locks
> +
> +
> +Implementation:
> +===============
> +
> + Completion is built on top of the generic kernel event infrastructure,
> + it can be seen as a specialized front-end to events.
> +
> + Basically there are three parts to the API, the initialization
> + of the completion, the waiting part through a call to a variant
> + of wait_to_completion and the signaling side through a call to
> + complete(). The structure used for handling of completion is:
> +
> + struct completion {
> + unsigned int done;
> + wait_queue_head_t wait;
> + };
> +
> + completions primitives come in two big groups, blocking through
> + enqueuing the task on the wait-queue and non-blocking which will never
> + queue the task. The non-blocking variants are prefixed with a
> + try_ (e.g. try_wait_for_completion()).
> +
> +
> +init_completion()/reinit_completion:
> +
> + As the default state is "not available" all that needs to be done is
> + to initialize the counter to 0 and provide an initialized wait queue
> + for potential waiters to enqueue on. For struct completion *x this is
> + thus two lines:
> +
> + x->done = 0;
> + init_waitqueue_head(&x->wait);
> +
> + The re-initialization simply resets the done variable to "not done"
> + thus 0. So reinit_completion is one line:
> +
> + x->done = 0;
> +
> +
> +complete()/complete_all():
> +
> + A call to complete signals completion by incrementing the wait
> + condition, x->done++ for signaling only one waiter, respectively
> + adding UNIT_MAX/2 to signal all waiters. Both complete() and
> + complete_all() serialize by holding the associated wait-queue lock.
> +
> + The call chain for complete() to wake up waiters for one task is:
> +
> + wake_up_locked
> + -> __wake_up_locked((x), TASK_NORMAL, 1)
> + -> __wake_up_common(q, mode, nr, 0, NULL); nr == 1
> + -> __wake_up_common
> + will call the first waiter in the queue
> + by calling:
> + -> autoremove_wake_function
> + -> default_wake_function
> + -> try_to_wake_up
> +

I'm not sure this is useful to understand completions. It's also the kind
of thing that goes out of date quickly.

> + for ALL tasks at once:
> +
> + wake_up_all_locked
> + -> __wake_up_locked((x), TASK_NORMAL, 0)
> + -> __wake_up_common(q, mode, nr, 0, NULL); nr == 0 == ALL
> + -> __wake_up_common
> + will iterate over the wait-queue and
> + wake all waiting threads by calling:
> + -> autoremove_wake_function
> + -> default_wake_function
> + -> try_to_wake_up
> +
> + The autoremove_wake_function was registered through a call in
> + prepare_to_wait_event (wait->func = autoremove_wake_function;)
> +
> + Note that __wake_up_common is not specific to completion and would
> + allow waking a defined number of waiters, but the completion interface
> + only provides one or all.
> +
> +
> +wait_for_completion():
> +
> + The default behavior is to wait without a timeout and mark the task
> + as uninterruptible (ignoring all signals until woken).
> +
> + wait_for_completion default uninterruptiple and MAX_SCHEDULE_TIMEOUT
> + -> wait_for_common
> + -> __wait_for_common
> + -> do_wait_for_common
> + -> __add_wait_queue_tail_exclusive
> + which will conditionally add the task to the
> + wait queue (if done == 0) and also checks
> + signals before setting the tasks state to
> + TASK_UNINTERRUPTIBLE via __set_current_state.
> +
> + The timeout and interruptible variants use the same call chain but
> + start at wait_for_completion_timeout() passing in a timeout rather than
> + MAX_SCHEDULE_TIMEOUT and wait_for_completion_interruptible() passes
> + TASK_INTERRUPTIBLE as the tasks designated state rather than

task's

> + TASK_UNINTERRUPTIBLE. The combined type is then
> + wait_for_completion_interruptible_timeout() which passes both a timeout
> + and TASK_INTERRUPTIBLE. Finally a last type is _killable which passes
> + TASK_KILLABLE, that is equivalent to (TASK_WAKEKILL |
> + TASK_UNINTERRUPTIBLE) as the designated tasks state and thus does not
> + need to be explicitly woken up to receive a terminating signal. It will
> + return a -ERESTARTSYS if interrupted or else 0 (completion achieved)
> + [16][17].
> +
> + The _io variants of wait_for_completion behave the same except for
> + setting the tasks current->in_iowait indicating that the thread is
> + waiting on IO which is used to account waiting times for the thread
> + via iowait_count and iowait_sum [9].
> +
> +
> +try_wait_for_completion()/completion_done():
> +
> + The try_wait_for_completion will not put the thread on the wait-queue
> + but rather returns 0 if it would need to enqueue (block) the thread,

Again, it's a bool; it returns FALSE

> + else it consumes any posted completion (x->done--) and returns.
> + try_wait_for_completion is primarily used to consume possibly missed

try_wait_for_completion()

> + events or clear any pending completions from failed actions
> + (e.g. see sound/soc/codecs/wm8996.c:wm8996_set_fll).
> +
> + Finally to check state of a completion without changing it in any way
> + is provided by completion_done();
> +
> +
> +Declaration and initialization:
> +-------------------------------
> +
> +Note on naming:
> +
> + completions should be named to convey the intent of the waiter
> + e.g.:
> + wait_for_completion(&early_console_added);
> + ...
> + complete(&early_console_added);
> +
> + is easier to understand than a more or less meaningless variable
> + naming like:
> + wait_for_completion(&Completion);
> + ...
> + complete(&Completion);
> + unfortunately the later being not that uncommon.
> +
> +
> +dynamic declaration and initialization:
> +
> + struct completion setup_done;
> +
> + init_completion(&setup_done)
> +
> + Simply initialize "done" to 0 (calls to wait_for_completion() will
> + thus block) as well as initializing the wait queue.
> +
> +
> + reinit_completion(&setup_done)
> +
> + This reinitializes "done" to 0 and leaves the wait-queue untouched.
> +
> +
> +static declaration and initialization:
> +
> + static DECLARE_COMPLETION(setup_done)
> +
> + File scope static initialization.
> +
> + DECLARE_COMPLETION_ONSTACK(setup_done) [15]
> +
> + This declares and initializes a completion structure on the kernel
> + stack. This is for initializing local/automatic completion variables
> + on the kernel stack. Under the hood this calls the dynamic
> + init_completion on the struct completion passed.
> +
> +
> +Usage:
> +======
> +
> + The basic usage is simply described by the initial post [1]
> + describing a replacement for the misused semaphores.

This is fine, but why not just reference that nice usage document you
posted as part 1?

> +<snip>
> + So instead, I introduced the notion of "wait for completion":
> +
> + struct completion event;
> +
> + init_completion(&event);
> + ... pass of event pointer to waker ..
> + wait_for_completion(&event);
> +
> + where the thing we're waiting for just does "complete(event)"
> + and we're all done.
> +<snip>
> +
> + The API has been expanded a bit since the initial release to handle
> + timeouts and interruptible and killable completions as well as
> + non-blocking try_ variant.
> +
> + the most common usage is wait_for_completion and the timeout variant:
> +
> + wait_for_completion 562
> + wait_for_completion_timeout 428
> + wait_for_completion_interruptible_timeout 73
> + wait_for_completion_interruptible 68
> + try_wait_for_completion 8
> + wait_for_completion_io 6
> + wait_for_completion_io_timeout 1
> +
> + [number of call sites as of 3.18.0-rc7]

Not sure this is useful - and it will immediately go out of date.

> + Note that the number of call sites need to reflect the usage as some
> + calls sites are in wrapper functions - this should just show the
> + relative distribution and the rough overall usage.
> +
> +
> +complete()/complete_all():
> +
> +
> + complete(struct completion *x)
> +
> + Increment the "done" field (x->done++) allowing one waiter to proceed.

It also does a wakeup - seems worth a mention somehow.

> + complete_all(struct completion *x)
> +
> + Increment the counter (x->done += UNIT_MAX/2) to effectively a large
> + enough number to ensure that all waiters will be woken. Note that
> + the counter is not zeroed, so the value of x->done found there after
> + a call to complete_all() is more or less meaningless. Calling
> + complete_all() multiple times will role-over but always be "large"

"roll over" (or, better, "overflow").

> + never the less calling complete_all() multiple times on the same

"nevertheless"

> + completion object is most likely a bug.

If we're going to give that advice, it seems that mixing complete() and
complete_all() is also a path to little joy...

> +
> +wait_for_completion*():
> +
> + wait_for_completion(struct completion *x)
> +
> + for blocking wait on an event like a startup/shutdown or a sent
> + command to complete and (properly) yield the CPU for productive
> + use in the meantime.
> +
> +
> + wait_for_completion_interruptible(struct completion *x)
> +
> + Basically a blocking call but it may return immediately if a
> + restart has been received already by the time call. In most

Not sure what "restart has been received already by the time call" means.
It returns if a signal arrives...?

> + cases were _interruptible variants are used, the interruption
> + indicates an error condition, so the _interruptible versions
> + should always be checking the return value.

And that return value is, exactly...? :)

> + wait_for_completion_io_timeout(struct completion *x,
> + unsigned long timeout)
> +
> + Blocking call with a timeout rather than waiting for ever, the return

"forever"

> + value is the remaining wait time if completion occurred (at least 1
> + though to ensure the conditions can be differentiated) and 0 if
> + timeout occurred this somewhat strange handling of the timeout value

"occurred. This..."

> + is due to a race [4] where system load could lead to the condition
> + being achieved and the timeout occurring anyway..
> +
> +
> + wait_for_completion_interruptible_timeout(struct completion *x,
> + unsigned long timeout)
> +
> +
> + Blocking call combining the above two cases with the return value
> + differentiating the cases as follows:
> + -ERESTARTSYS if interrupted,
> + 0 if timed out,
> + >0 (1 or number of jiffies left till timeout) if completed.

s/till/until/

> +
> +
> + wait_for_completion_killable(struct completion *x)
> +
> + Blocking call like wait_for_completion_interruptible but will only
> + terminate wait if a SIGKILL (__fatal_signal_pending) is received

There are a number of fatal signals normally, not just SIGKILL

> + rather than on any signal as wait_for_completion_interruptible. Note
> + the return value is also -ERESTARTSYS in case a SIGKILL was received
> + and 0 if completion occurred.
> +
> +
> + wait_for_completion_killable_timeout(struct completion *x,
> + unsigned long timeout)
> +
> + Blocking call like wait_for_completion_interruptible_timeout just
> + that receiving -ERESTARTSYS indicates that SIGKILL was received.
> +
> +
> +_io() variants:
> +
> + wait_for_completion_io(struct completion *);
> + wait_for_completion_io_timeout(struct completion *x,
> + unsigned long timeout);
> +
> + These calls are essentially the same as there non-_io variants
> + except that they call io_schedule_timeout() rather than
> + schedule_timeout(). Essentially the difference is that the first
> + sets current->in_iowait = 1 indicating that the task is waiting for
> + IO and not just sleeping.
> +
> +
> +_try() variant:
> +
> + try_wait_for_completion(struct completion *)
> +
> + The non-blocking try_ variant was introduced to allow
> + use of completions to begin safely while holding locks. If it
> + would block the thread (x->done is 0) it returns 0 else 1

Again, it's bool

> + indicating that completion was achieved at the time of call.
> + Note, that it can block on the wait-queue lock if complete() or
> + complete_all() is in progress - the non-blocking refers only to
> + not enqueuing the calling thread on the wait-queue.

I'm not sure I'd use "block" in that context; it might wait for a spinlock
but that wait will be brief. We hope.

> + completion_done(struct completion *)
> +
> + This is a non-blocking check for completion in progress, returning
> + 0 if there are waiters and 1 otherwise. In some rare cases (e.g. in

returns a bool

> + kernel/stop_machine.c:stop_machine_from_inactive_cpu)
> + completion_done issued in a busy-wait loop like so:

s/issued/is called/

> + while (!completion_done(&done.completion))
> + cpu_relax();

...but do we want to encourage people to code that type of loop? It
shouldn't be needed in too many places...

> +ARM specific: per_cpu completion in ARM (Nicolas Pitre)
> +
> + This currently ARM specific extension provides basic IPI triggered
> + completion support through:
> +
> + register_ipi_completion - to register a per_cpu completion
> + ipi_complete - to signal completion
> +
> + currently this is used only on ARM (see: arch/arm/kernel/smp.c)
> +
> +
> +Notes on RT:
> +============

Again, I wonder if this should be in the -rt tree instead.

> +
> + * in PREEMPT_RT the wait-queue used in queuing tasks is changed to a
> + simple wait-queue to minimize the lock contention of the queue
> + related lock [6].
> + * PREEMPT_RT only changes the completion usage related to stop_machine
> + [13]
> +
> +
> +Notes on history of completion:

"Notes on the history of completions"

> +
> + After the initial API proposal focused on resolving the semaphore
> + misuse for code synchronization [1] the API was relatively quickly
> + extended by the _timeout and _interruptible variants (2004) [10].
> + The completion API was extended by the try_wait_for_completion()/
> + completion_done() to accommodate for XFS object flushing needs which
> + did not quite fit the available completion API [12]. The _killable
> + variant was added in 2007 [17] together with a whole set of event
> + subsystem primitives allowing asynchronous kill without wakeup [16]
> + to take care of unkillable tasks. _killable_timeout was later added
> + for Ceph waiting on MDS requests [16]. The final API extension was
> + for provision of proper scheduling accounting in the block layer which
> + added _io versions [14].
> + As completion is a "code-lock" it resided in the core scheduling
> + code (kernel/sched.c) for a long time, it was not until 2013 [11]
> + that it was moved into a separate kernel/sched/completion.c file.
> +
> +References:
> +===========
> +
> +Link: 1 http://lkml.iu.edu/hypermail/linux/kernel/0107.3/0674.html
> +tink: 2 http://lkml.iu.edu/hypermail/linux/kernel/0107.3/0733.html
> +Link: 3 http://lwn.net/Articles/277621/
> +Link: 4 /lkml.iu.edu/hypermail/linux/kernel/0806.2/1659.html
> +Link: 5 https://lkml.org/lkml/2014/7/5/229
> +Link: 6 https://www.kernel.org/pub/linux/kernel/projects/rt/
> + Thomas Gleixner completion-Use-simple-wait-queues.patch
> + part of the rt patch series
> +File: 7 see the notes in kernel/sched/core.c:yield why not to use it
> +Link: 8 LWN Porting Drivers to 2.6 series
> + http://lwn.net/Articles/23993/
> +File: 9 see kernel/sched/fair.c:enqueue_sleeper
> +Link: 10 http://lkml.org/lkml/2004/10/20/461
> +Link: 11 http://lkml.org/lkml/2013/11/6/166
> +Link: 12 http://oss.sgi.com/archives/xfs/2008-06/msg01320.html
> +Link: 13 https://www.kernel.org/pub/linux/kernel/projects/rt/
> + Thomas Gleixner stomp-machine-raw-lock.patch.patch
> + part of the rt patch series
> +Link: 14 http://lkml.org/lkml/2013/2/14/209
> +Link: 15 http://lkml.org/lkml/2006/9/28/83
> +Link: 16 http://lkml.org/lkml/2010/5/24/198
> +Link: 17 http://lwn.net/Articles/288056/
> +Link: 18 https://www.mail-archive.com/git-commits-head@xxxxxxxxxxxxxxx/msg37284.html

Again, thanks for doing this.

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/