Re: [PATCH 1/5] locking: Introduce range reader/writer lock
From: Laurent Dufour
Date: Thu Mar 30 2017 - 10:56:21 EST
On 07/03/2017 06:03, Davidlohr Bueso wrote:
> This implements a sleepable range rwlock, based on interval tree, serializing
> conflicting/intersecting/overlapping ranges within the tree. The largest range
> is given by [0, ~0 - 1] (inclusive). Unlike traditional locks, range locking
> involves dealing with the tree itself and the range to be locked, normally
> stack allocated and always explicitly prepared/initialized by the user in a
> [a0, a1] a0 <= a1 sorted manner, before actually taking the lock.
>
> We allow exclusive locking of arbitrary ranges. We guarantee that each
> range is locked only after all conflicting range locks requested previously
> have been unlocked. Thus we achieve fairness and avoid livelocks.
>
> When new range lock is requested, we add its interval to the tree and store
> number of intervals intersecting it to 'blocking_ranges'. When a range is
> unlocked, we again walk intervals that intersect with the unlocked one and
> decrement their 'blocking_ranges'. We wake up owner of any range lock whose
> 'blocking_ranges' drops to 0.
>
> For the shared case, the 'blocking_ranges' is only incremented if the
> intersecting range is not marked as a reader. In order to mitigate some of
> the tree walk overhead for non-intersecting ranges, the tree's min/max values
> are maintained and consulted in O(1) in the fastpath.
>
> How much does it cost:
> ----------------------
>
> The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> total number of ranges and R_int is the number of ranges intersecting the
> new range range to be added.
>
> Due to its sharable nature, full range locks can be compared with rw-sempahores,
> which also serves from a mutex standpoint as writer-only situations are
> pretty similar nowadays.
>
> The first is the memory footprint, rwsems are larger than tree locks: 40 vs
> 24 bytes, but the later requires an additional 64 bytes of stack for the range
> structure.
>
> Secondly, because every range call is serialized by the tree->lock, any lock()
> fastpath will at least have an interval_tree_insert() and spinlock lock+unlock
> overhead compared to a single atomic insn in the case of rwsems. Similar scenario
> obviously for the unlock() case.
>
> The torture module was used to measure 1-1 differences in lock acquisition with
> increasing core counts over a period of 10 minutes. Readers and writers are
> interleaved, with a slight advantage to writers as its the first kthread that is
> created. The following shows the avg ops/minute with various thread-setups on
> boxes with small and large core-counts.
>
> ** 4-core AMD Opteron **
> (write-only)
> rwsem-2thr: 4198.5, stddev: 7.77
> range-2thr: 4199.1, stddev: 0.73
>
> rwsem-4thr: 6036.8, stddev: 50.91
> range-4thr: 6004.9, stddev: 126.57
>
> rwsem-8thr: 6245.6, stddev: 59.39
> range-8thr: 6229.3, stddev: 10.60
>
> (read-only)
> rwsem-2thr: 5930.7, stddev: 21.92
> range-2thr: 5917.3, stddev: 25.45
>
> rwsem-4thr: 9881.6, stddev: 0.70
> range-4thr: 9540.2, stddev: 98.28
>
> rwsem-8thr: 11633.2, stddev: 7.72
> range-8thr: 11314.7, stddev: 62.22
>
> For the read/write-only cases, there is very little difference between the range lock
> and rwsems, with up to a 3% hit, which could very well be considered in the noise range.
>
> (read-write)
> rwsem-write-1thr: 1744.8, stddev: 11.59
> rwsem-read-1thr: 1043.1, stddev: 3.97
> range-write-1thr: 1740.2, stddev: 5.99
> range-read-1thr: 1022.5, stddev: 6.41
>
> rwsem-write-2thr: 1662.5, stddev: 0.70
> rwsem-read-2thr: 1278.0, stddev: 25.45
> range-write-2thr: 1321.5, stddev: 51.61
> range-read-2thr: 1243.5, stddev: 30.40
>
> rwsem-write-4thr: 1761.0, stddev: 11.31
> rwsem-read-4thr: 1426.0, stddev: 7.07
> range-write-4thr: 1417.0, stddev: 29.69
> range-read-4thr: 1398.0, stddev: 56.56
>
> While a single reader and writer threads does not show must difference, increasing
> core counts shows that in reader/writer workloads, writer threads can take a hit in
> raw performance of up to ~20%, while the number of reader throughput is quite similar
> among both locks.
>
> ** 240-core (ht) IvyBridge **
> (write-only)
> rwsem-120thr: 6844.5, stddev: 82.73
> range-120thr: 6070.5, stddev: 85.55
>
> rwsem-240thr: 6292.5, stddev: 146.3
> range-240thr: 6099.0, stddev: 15.55
>
> rwsem-480thr: 6164.8, stddev: 33.94
> range-480thr: 6062.3, stddev: 19.79
>
> (read-only)
> rwsem-120thr: 136860.4, stddev: 2539.92
> range-120thr: 138052.2, stddev: 327.39
>
> rwsem-240thr: 235297.5, stddev: 2220.50
> range-240thr: 232099.1, stddev: 3614.72
>
> rwsem-480thr: 272683.0, stddev: 3924.32
> range-480thr: 256539.2, stddev: 9541.69
>
> Similar to the small box, larger machines show that range locks take only a minor
> (up to ~6% for 480 threads) hit even in completely exclusive or shared scenarios.
>
> (read-write)
> rwsem-write-60thr: 4658.1, stddev: 1303.19
> rwsem-read-60thr: 1108.7, stddev: 718.42
> range-write-60thr: 3203.6, stddev: 139.30
> range-read-60thr: 1852.8, stddev: 147.5
>
> rwsem-write-120thr: 3971.3, stddev: 1413.0
> rwsem-read-120thr: 1038.8, stddev: 353.51
> range-write-120thr: 2282.1, stddev: 207.18
> range-read-120thr: 1856.5, stddev: 198.69
>
> rwsem-write-240thr: 4112.7, stddev: 2448.1
> rwsem-read-240thr: 1277.4, stddev: 430.30
> range-write-240thr: 2353.1, stddev: 502.04
> range-read-240thr: 1551.5, stddev: 361.33
>
> When mixing readers and writers, writer throughput can take a hit of up to ~40%,
> similar to the 4 core machine, however, reader threads can increase the number of
> acquisitions in up to ~80%. In any case, the overall writer+reader throughput will
> always be higher for rwsems. A huge factor in this behavior is that range locks
> do not have writer spin-on-owner feature.
>
> On both machines when actually testing threads acquiring different ranges, the
> amount of throughput will always outperform the rwsem, due to the increased
> parallelism; which is no surprise either. As such microbenchmarks that merely
> pounds on a lock will pretty much always suffer upon direct lock conversions,
> but not enough to matter in the overall picture.
>
> Signed-off-by: Davidlohr Bueso <dbueso@xxxxxxx>
> ---
> drivers/gpu/drm/Kconfig | 2 -
> drivers/gpu/drm/i915/Kconfig | 1 -
> include/linux/range_rwlock.h | 96 +++++++++
> kernel/locking/Makefile | 2 +-
> kernel/locking/range_rwlock.c | 462 ++++++++++++++++++++++++++++++++++++++++++
> lib/Kconfig | 14 --
> lib/Kconfig.debug | 1 -
> lib/Makefile | 2 +-
> 8 files changed, 560 insertions(+), 20 deletions(-)
> create mode 100644 include/linux/range_rwlock.h
> create mode 100644 kernel/locking/range_rwlock.c
>
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index 88e01e08e279..e4d9eadd2c47 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -154,7 +154,6 @@ config DRM_RADEON
> select HWMON
> select BACKLIGHT_CLASS_DEVICE
> select BACKLIGHT_LCD_SUPPORT
> - select INTERVAL_TREE
> help
> Choose this option if you have an ATI Radeon graphics card. There
> are both PCI and AGP versions. You don't need to choose this to
> @@ -174,7 +173,6 @@ config DRM_AMDGPU
> select HWMON
> select BACKLIGHT_CLASS_DEVICE
> select BACKLIGHT_LCD_SUPPORT
> - select INTERVAL_TREE
> help
> Choose this option if you have a recent AMD Radeon graphics card.
>
> diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
> index 183f5dc1c3f2..8a9154550f46 100644
> --- a/drivers/gpu/drm/i915/Kconfig
> +++ b/drivers/gpu/drm/i915/Kconfig
> @@ -3,7 +3,6 @@ config DRM_I915
> depends on DRM
> depends on X86 && PCI
> select INTEL_GTT
> - select INTERVAL_TREE
> # we need shmfs for the swappable backing store, and in particular
> # the shmem_readpage() which depends upon tmpfs
> select SHMEM
> diff --git a/include/linux/range_rwlock.h b/include/linux/range_rwlock.h
> new file mode 100644
> index 000000000000..174e13668f67
> --- /dev/null
> +++ b/include/linux/range_rwlock.h
> @@ -0,0 +1,96 @@
> +/*
> + * Range locking
> + *
> + * We allow exclusive locking of arbitrary ranges. We guarantee that each
> + * range is locked only after all conflicting range locks requested previously
> + * have been unlocked. Thus we achieve fairness and avoid livelocks.
> + *
> + * The cost of lock and unlock of a range is O(log(R_all)+R_int) where R_all is
> + * total number of ranges and R_int is the number of ranges intersecting the
> + * operated range.
> + */
> +#ifndef _LINUX_RANGE_RWLOCK_H
> +#define _LINUX_RANGE_RWLOCK_H
> +
> +#include <linux/rbtree.h>
> +#include <linux/interval_tree.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +/*
> + * The largest range will span [0,RANGE_RWLOCK_INFINITY].
> + */
> +#define RANGE_RWLOCK_INFINITY (~0UL - 1)
> +
> +struct range_rwlock {
> + struct interval_tree_node node;
> + struct task_struct *task;
> + /* Number of ranges which are blocking acquisition of the lock */
> + unsigned int blocking_ranges;
> + bool reader;
> +};
> +
> +struct range_rwlock_tree {
> + struct rb_root root;
> + spinlock_t lock;
> + struct interval_tree_node *leftmost; /* compute smallest 'start' */
> +};
> +
> +#define __RANGE_RWLOCK_TREE_INITIALIZER(name) \
> + { .leftmost = NULL \
> + , .root = RB_ROOT \
> + , .lock = __SPIN_LOCK_UNLOCKED(name.lock) }
> +
> +#define DEFINE_RANGE_RWLOCK_TREE(name) \
> + struct range_rwlock_tree name = __RANGE_RWLOCK_TREE_INITIALIZER(name)
> +
> +#define __RANGE_RWLOCK_INITIALIZER(start, last) { \
> + .node = { \
> + .start = (start) \
> + ,.last = (last) \
> + } \
> + }
> +
> +#define DEFINE_RANGE_RWLOCK(name, start, last) \
> + struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER((start), (last))
> +
> +#define DEFINE_RANGE_RWLOCK_INF(name) \
> + struct range_rwlock name = __RANGE_RWLOCK_INITIALIZER(0, RANGE_RWLOCK_INFINITY)
> +
> +static inline void range_rwlock_tree_init(struct range_rwlock_tree *tree)
> +{
> + tree->root = RB_ROOT;
> + spin_lock_init(&tree->lock);
> + tree->leftmost = NULL;
> +}
> +
> +void range_rwlock_init(struct range_rwlock *lock,
> + unsigned long start, unsigned long last);
> +void range_rwlock_init_inf(struct range_rwlock *lock);
> +
> +/*
> + * lock for reading
> + */
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +/*
> + * lock for writing
> + */
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock);
> +
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock);
> +
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index 760158d9d98d..bf054d0af82b 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -2,7 +2,7 @@
> # and is generally not a function of system call inputs.
> KCOV_INSTRUMENT := n
>
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o range_rwlock.o
>
> ifdef CONFIG_FUNCTION_TRACER
> CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
> diff --git a/kernel/locking/range_rwlock.c b/kernel/locking/range_rwlock.c
> new file mode 100644
> index 000000000000..4d92baa9bd80
> --- /dev/null
> +++ b/kernel/locking/range_rwlock.c
> @@ -0,0 +1,462 @@
> +/*
> + * Implementation of read/write range locks.
> + *
> + * We keep interval tree of locked and to-be-locked ranges. When new range lock
> + * is requested, we add its interval to the tree and store number of intervals
> + * intersecting it to 'blocking_ranges'.
> + *
> + * When a range is unlocked, we again walk intervals that intersect with the
> + * unlocked one and decrement their 'blocking_ranges'. We wake up owner of any
> + * range lock whose 'blocking_ranges' drops to 0. For the shared case, the
> + * 'blocking_ranges' is only incremented if the intersecting range is not marked
> + * as a reader. In order to mitigate some of the tree walk overhead for
> + * non-intersecting ranges, the tree's min/max values are maintained and consulted
> + * in O(1) in the fastpath.
> + */
> +#include <linux/rbtree.h>
> +#include <linux/spinlock.h>
> +#include <linux/range_rwlock.h>
> +#include <linux/sched/signal.h>
> +#include <linux/sched/debug.h>
> +#include <linux/sched/wake_q.h>
> +#include <linux/sched.h>
> +#include <linux/export.h>
> +
> +
> +#define range_entry(ptr, type, member) container_of(ptr, type, member)
> +
> +/*
> + * Range intersection/overlap between A: [a0, a1] and B: [b0, b1]
> + * is given by:
> + *
> + * a0 <= b1 && b0 <= a1
> + *
> + * ... where A holds the lock range and B holds the smallest 'start' and
> + * largest 'last' in the tree. For the later, we rely on the root node,
> + * which by augmented interval tree property, holds the largest value in
> + * its last-in-subtree.
> + */
> +static inline bool
> +__range_intersects_intree(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *root;
> +
> + if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> + return false;
> +
> + root = range_entry(tree->root.rb_node, struct interval_tree_node, rb);
> +
> + return lock->node.start <= root->__subtree_last &&
> + tree->leftmost->start <= lock->node.last;
> +}
> +
> +static inline void
> +__range_tree_insert(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + if (unlikely(RB_EMPTY_ROOT(&tree->root)))
> + tree->leftmost = &lock->node;
> + else if (lock->node.start < tree->leftmost->start)
> + tree->leftmost = &lock->node;
> +
> + interval_tree_insert(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_tree_remove(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + if (tree->leftmost == &lock->node) {
> + struct rb_node *next = rb_next(&tree->leftmost->rb);
> + tree->leftmost = range_entry(next, struct interval_tree_node, rb);
> + }
> +
> + interval_tree_remove(&lock->node, &tree->root);
> +}
> +
> +static inline void
> +__range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> + unsigned long last)
> +{
> + WARN_ON(start > last);
> +
> + lock->node.start = start;
> + lock->node.last = last;
> + RB_CLEAR_NODE(&lock->node.rb);
> + lock->blocking_ranges = 0;
> + lock->reader = false;
> +}
> +
> +/**
> + * range_rwlock_init - initialize the range lock
> + * @lock: the range lock to be initialized
> + * @start: start of the interval (inclusive)
> + * @last: last location in the interval (inclusive)
> + *
> + * Initialize the range's [start, last] such that it can
> + * later be locked. User is expected to enter a sorted
> + * range, such that @start <= @last.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init(struct range_rwlock *lock, unsigned long start,
> + unsigned long last)
> +{
> + __range_rwlock_init(lock, start, last);
> +}
> +EXPORT_SYMBOL(range_rwlock_init);
> +
> +/**
> + * range_rwlock_init_inf - initialize the range lock to infinity
> + * @lock: the range lock to be initialized
> + *
> + * Initialize the full range.
> + *
> + * It is not allowed to initialize an already locked range.
> + */
> +void range_rwlock_init_inf(struct range_rwlock *lock)
> +{
> + __range_rwlock_init(lock, 0, RANGE_RWLOCK_INFINITY);
> +}
> +EXPORT_SYMBOL(range_rwlock_init_inf);
> +
> +static inline void
> +range_rwlock_unblock(struct range_rwlock *lock, struct wake_q_head *wake_q)
> +{
> + if (!--lock->blocking_ranges)
> + wake_q_add(wake_q, lock->task);
> +}
> +
> +static inline int wait_for_ranges(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + int ret = 0;
> +
> + while (true) {
> + set_current_state(state);
> +
> + if (unlikely(signal_pending_state(state, current))) {
> + unsigned long flags;
> +
> + ret = -EINTR;
> + /*
> + * We're not taking the lock after all, cleanup
> + * after ourselves.
> + */
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> + spin_unlock_irqrestore(&tree->lock, flags);
> + break;
> + }
> +
> + /* do we need to go to sleep? */
> + if (!lock->blocking_ranges)
> + break;
Hi Davidlohr,
While building a kernel on top of a patched kernel using full range lock
in the place of mmap_sem, I found that fork() sometimes failed returning
ENOMEM.
It happens that if fork() get called at the time signal is sent to the
calling process, the call to range_write_lock_interruptible() is failing
even if there is no contention on the lock. This is because we check for
the signal pending before checking for the lock contention in
wait_for_ranges().
The loop here should rather be :
while (true) {
/* do we need to go to sleep? */
if (!lock->blocking_ranges)
break;
if (unlikely(signal_pending_state(state, current))) {
unsigned long flags;
ret = -EINTR;
/*
* We're not taking the lock after all, cleanup
* after ourselves.
*/
spin_lock_irqsave(&tree->lock, flags);
lock->reader = false;
__range_tree_remove(tree, lock);
spin_unlock_irqrestore(&tree->lock, flags);
break;
}
set_current_state(state);
schedule();
}
I also moved the call to set_current_state() before calling schedule(),
not sure this has to be done this way, but my system seems to work fine
like this.
Laurent.
> +
> + schedule();
> + }
> +
> + __set_current_state(TASK_RUNNING);
> + return ret;
> +}
> +
> +static __always_inline int
> +__range_read_lock_common(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = true;
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader)
> + lock->blocking_ranges++;
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + __range_tree_insert(tree, lock);
> +
> + lock->task = current;
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_read_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + might_sleep();
> + __range_read_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock);
> +d
> +int range_read_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_read_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_interruptible);
> +
> +int range_read_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_read_lock_common(tree, lock, TASK_KILLABLE);
> +
> +}
> +EXPORT_SYMBOL(range_read_lock_killable);
> +
> +/**
> + * range_read_trylock - Trylock for reading
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_read_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + int ret = true;
> + unsigned long flags;
> + struct interval_tree_node *node;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + /*
> + * We have overlapping ranges in the tree, ensure that we can
> + * in fact share the lock.
> + */
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader) {
> + ret = false;
> + goto unlock;
> + }
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + lock->reader = true;
> + __range_tree_insert(tree, lock);
> +unlock:
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(range_read_trylock);
> +
> +void range_read_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> +
> + if (!__range_intersects_intree(tree, lock)) {
> + /* nobody to wakeup, we're done */
> + spin_unlock_irqrestore(&tree->lock, flags);
> + return;
> + }
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (!blocked_lock->reader)
> + range_rwlock_unblock(blocked_lock, &wake_q);
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_read_unlock);
> +
> +static inline int __sched
> +__range_write_lock_common(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock, long state)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + lock->reader = false;
> +
> + if (!__range_intersects_intree(tree, lock))
> + goto insert;
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + /*
> + * As a writer, we always consider an existing node. We
> + * need to block; either the intersecting node is another
> + * writer or we have a reader that needs to finish.
> + */
> + lock->blocking_ranges++;
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +insert:
> + __range_tree_insert(tree, lock);
> +
> + lock->task = current;
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return wait_for_ranges(tree, lock, state);
> +}
> +
> +void range_write_lock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + might_sleep();
> + __range_write_lock_common(tree, lock, TASK_UNINTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock);
> +
> +int range_write_lock_interruptible(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_write_lock_common(tree, lock, TASK_INTERRUPTIBLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_interruptible);
> +
> +int range_write_lock_killable(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + might_sleep();
> + return __range_write_lock_common(tree, lock, TASK_KILLABLE);
> +}
> +EXPORT_SYMBOL(range_write_lock_killable);
> +
> +/**
> + * range_write_trylock - Trylock for writing
> + * @tree: interval tree
> + * @lock: the range lock to be trylocked
> + *
> + * The trylock is against the range itself, not the @tree->lock.
> + *
> + * Returns 1 if successful, 0 if contention (must block to acquire).
> + */
> +int range_write_trylock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + int intersects;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&tree->lock, flags);
> + intersects = __range_intersects_intree(tree, lock);
> +
> + if (!intersects) {
> + lock->reader = false;
> + __range_tree_insert(tree, lock);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> +
> + return !intersects;
> +}
> +EXPORT_SYMBOL(range_write_trylock);
> +
> +void range_write_unlock(struct range_rwlock_tree *tree, struct range_rwlock *lock)
> +{
> + struct interval_tree_node *node;
> + unsigned long flags;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + lock->reader = false;
> + __range_tree_remove(tree, lock);
> +
> + if (!__range_intersects_intree(tree, lock)) {
> + /* nobody to wakeup, we're done */
> + spin_unlock_irqrestore(&tree->lock, flags);
> + return;
> + }
> +
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + range_rwlock_unblock(blocked_lock, &wake_q);
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_write_unlock);
> +
> +/**
> + * range_downgrade_write - Downgrade write range lock to read lock
> + * @tree: interval tree
> + * @lock: the range lock to be downgraded
> + *
> + * Wakes any blocked readers, when @lock is the only conflicting range.
> + *
> + * It is not allowed to downgrade an unacquired write lock.
> + */
> +void range_downgrade_write(struct range_rwlock_tree *tree,
> + struct range_rwlock *lock)
> +{
> + unsigned long flags;
> + struct interval_tree_node *node;
> + DEFINE_WAKE_Q(wake_q);
> +
> + spin_lock_irqsave(&tree->lock, flags);
> +
> + WARN_ON(lock->reader);
> +
> + /*
> + * Unaccount for any blocked reader lock. Wakeup if possible.
> + */
> + node = interval_tree_iter_first(&tree->root, lock->node.start,
> + lock->node.last);
> + while (node) {
> + struct range_rwlock *blocked_lock;
> + blocked_lock = range_entry(node, struct range_rwlock, node);
> +
> + if (blocked_lock->reader)
> + range_rwlock_unblock(blocked_lock, &wake_q);
> +
> + node = interval_tree_iter_next(node, lock->node.start,
> + lock->node.last);
> + }
> +
> + lock->reader = true;
> + spin_unlock_irqrestore(&tree->lock, flags);
> + wake_up_q(&wake_q);
> +}
> +EXPORT_SYMBOL(range_downgrade_write);
> diff --git a/lib/Kconfig b/lib/Kconfig
> index 0c8b78a9ae2e..52ca6668a16d 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -347,20 +347,6 @@ config TEXTSEARCH_FSM
> config BTREE
> bool
>
> -config INTERVAL_TREE
> - bool
> - help
> - Simple, embeddable, interval-tree. Can find the start of an
> - overlapping range in log(n) time and then iterate over all
> - overlapping nodes. The algorithm is implemented as an
> - augmented rbtree.
> -
> - See:
> -
> - Documentation/rbtree.txt
> -
> - for more information.
> -
> config RADIX_TREE_MULTIORDER
> bool
>
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 97d62c2da6c2..a8a64eb54d3a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1771,7 +1771,6 @@ config RBTREE_TEST
> config INTERVAL_TREE_TEST
> tristate "Interval tree test"
> depends on m && DEBUG_KERNEL
> - select INTERVAL_TREE
> help
> A benchmark measuring the performance of the interval tree library
>
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..d4ff3057450a 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -80,7 +80,7 @@ obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
> obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
>
> obj-$(CONFIG_BTREE) += btree.o
> -obj-$(CONFIG_INTERVAL_TREE) += interval_tree.o
> +obj-y += interval_tree.o
> obj-$(CONFIG_ASSOCIATIVE_ARRAY) += assoc_array.o
> obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
> obj-$(CONFIG_DEBUG_LIST) += list_debug.o
>