Re: [PATCH kernel] rcu: Define lockless version of list_for_each_entry_rcu

From: Paul E. McKenney
Date: Wed Nov 18 2015 - 14:13:19 EST

Next message: Remi Pommarel: "Re: [PATCH v2 1/3] clk: bcm2835: Always round up clock divisor"
Previous message: Andy Shevchenko: "Re: enable dynamic debug - doubt"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Nov 06, 2015 at 01:17:17PM +1100, Alexey Kardashevskiy wrote:
> On 11/04/2015 01:39 AM, Steven Rostedt wrote:
> >On Tue, 3 Nov 2015 17:57:05 +1100
> >Alexey Kardashevskiy <aik@xxxxxxxxx> wrote:
> >
> >>This defines list_for_each_entry_lockless. This allows safe list
> >>traversing in cases when lockdep() invocation is unwanted like
> >>real mode (MMU is off).
> >>
> >>Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx>
> >>---
> >>
> >>This is for VFIO acceleration in POWERKVM for pSeries guests.
> >>There is a KVM instance. There also can be some VFIO (PCI passthru)
> >>devices attached to a KVM guest.
> >>
> >>To perform DMA, a pSeries guest registers DMA memory by calling some
> >>hypercalls explicitely at the rate close to one-two hcalls per
> >>a network packet, i.e. very often. When a guest does a hypercall
> >>(which is just an assembly instruction), the host kernel receives it
> >>in the real mode (MMU is off). When real mode fails to handle it,
> >>it enables MMU and tries handling a hcall in virtual mode.
> >>
> >>A logical bus ID (LIOBN) is a tagret id for these hypecalls.
> >>
> >>Each VFIO device belongs to an IOMMU group. Each group has an address
> >>translation table. It is allowed to have multiple IOMMU groups (i.e.
> >>multiple tables) under the same LIOBN.
> >>
> >>So effectively every DMA hcall has to update one or more TCE tables
> >>attached to the same LIOBN. RCU is used to update/traverse this list
> >>safely.
> >>
> >>Using RCU as is in virtual mode is fine. Lockdep works, etc.
> >>list_add_rcu() is used to populate the list;
> >>list_del_rcu() + call_rcu() used to remove groups from a list.
> >>These operations can happen in runtim as a result of PCI hotplug/unplug
> >>in guests.
> >>
> >>Using RCU as is in real mode is not fine as some RCU checks can lock up
> >>the system and in real mode we won't even have a chance to see any
> >>debug. This is why rcu_read_lock() and rcu_read_unlock() are NOT used.
> >>
> >>Previous version of this used to define list_for_each_entry_rcu_notrace()
> >>but it was proposed to use list_entry_lockless() instead. However
> >>the comment for lockless_dereference() suggests this is a good idea
> >>if "lifetime is managed by something other than RCU" but it is in my case.
> >>
> >>So what would be the correct approach here? Thanks.
> >
> >If the only use case for this so far is in POWERKVM, perhaps it should
> >be defined specifically (and in arch/powerpc) and not confuse others
> >about using this.
> >
> >Or, if you do imagine that this can be used in other scenarios, then a
> >much deeper comment must be made in the code in the kerneldoc section.
> >list_for_each_entry_rcu() should really be used in 99.99% of the time
> >in the kernel. This looks to be an extreme exception. I hate to add a
> >generic helper for something that will only be used in one location.
>
> No, I cannot imagine this really and I can move it somewhere in arch/powerpc.
>
> Still, is my approach correct? What does the comment for
> lockless_dereference() actally mean - it won't work together with
> RCU at all or this is to force people not to use it as
> "list_for_each_entry_rcu() should really be used in 99.99% of the
> time"? :)

Well, it depends...

The key difference between lockless_dereference() and rcu_dereference()
is that lockless_dereference() won't complain if used outside of
an RCU read-side critical section. When there is no RCU read-side
critical section, lockless_dereference() cannot rely on RCU's normal
action of keeping the data item around. Therefore, when you are using
lockless_dereference(), you have to have some mechanism other than RCU
to keep the data item around. The usual approach is for the data item
to never be freed, for example, if data is only ever added to the list
and never removed. Other approaches include reference counting, hazard
pointers, hashed arrays of locks, garbage collectors, transactional
memory, and so on.

Of these other approaches:

1. Reference counting is quite hard to use.

2. Hazard pointers are easier to use than reference counting, but
still requires that races with deletion force the traversal
to be retried from the top level.

3. Hashed arrays of locks pose interesting deadlock issues, and
also poor locality of reference, which in turn results in
not-so-good performance and scalability.

4. As far as I know, garbage collectors are not available in the
Linux kernel.

5. Full-blown hardware transactional memory is only available as s390's
constrained transactions. (Yes, I know about x86 and powerpc's
transactional memory, and in particular, I know that it needs a
software fallback, which needs to be some other item on this list.)
The software transactional memory implementations that I am aware
of suffer from not-so-good performance and scalability.

So, in the Linux kernel, it is common to use RCU.

In any case, the docbook comment "This list-traversal primitive may
safely run concurrently" does need some help. Maybe something like
the following?

* This list-traversal primitive is similar to list_for_each_entry_rcu(),
* but is intended for lists whose elements are never removed.

All well and good, but the other thing you might mean is that elements
-do- get removed, but in some environment that cannot be preempted.

In this case, you could use synchronize_rcu_sched() or call_rcu_sched()
to wait for all readers. If you also had readers running in one
of the normal kernel environments, they could be protected by
rcu_read_lock_sched() and rcu_read_unlock_sched() or any pair of
primitives that disable then re-enable preemption (e.g., local_irq_save()
and local_irq_restore()). Accesses from one of the normal kernel
environments would use rcu_dereference_sched(), but this primitive's
calls to lockdep might not work so well in your special kernel
environment.

In this case, the docbook comment might be as follows:

* This list-traversal primitive is similar to list_for_each_entry_rcu(),
* but is intended for lists whose elements are never removed on the
* one hand and for non-preemptible special environments where lockdep
* cannot be invoked on the other. In the case of the special environments,
* use synchronize_rcu_sched() and call_rcu_sched() to wait for readers.

Or are you doing something else entirely?

Thanx, Paul

> >-- Steve
> >
> >
> >>---
> >> include/linux/rculist.h | 16 ++++++++++++++++
> >> 1 file changed, 16 insertions(+)
> >>
> >>diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> >>index 17c6b1f..a83a924 100644
> >>--- a/include/linux/rculist.h
> >>+++ b/include/linux/rculist.h
> >>@@ -308,6 +308,22 @@ static inline void list_splice_init_rcu(struct list_head *list,
> >> pos = list_entry_rcu(pos->member.next, typeof(*pos), member))
> >>
> >> /**
> >>+ * list_for_each_entry_lockless - iterate over rcu list of given type
> >>+ * @pos: the type * to use as a loop cursor.
> >>+ * @head: the head for your list.
> >>+ * @member: the name of the list_struct within the struct.
> >>+ *
> >>+ * This list-traversal primitive may safely run concurrently
> >>+ */
> >>+#define list_entry_lockless(ptr, type, member) \
> >>+ container_of((typeof(ptr))lockless_dereference(ptr), type, member)
> >>+
> >>+#define list_for_each_entry_lockless(pos, head, member) \
> >>+ for (pos = list_entry_lockless((head)->next, typeof(*pos), member); \
> >>+ &pos->member != (head); \
> >>+ pos = list_entry_lockless(pos->member.next, typeof(*pos), member))
> >>+
> >>+/**
> >> * list_for_each_entry_continue_rcu - continue iteration over list of given type
> >> * @pos: the type * to use as a loop cursor.
> >> * @head: the head for your list.
> >
>
>
> --
> Alexey
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Remi Pommarel: "Re: [PATCH v2 1/3] clk: bcm2835: Always round up clock divisor"
Previous message: Andy Shevchenko: "Re: enable dynamic debug - doubt"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]