Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu()

From: Roman Penyaev
Date: Mon May 21 2018 - 08:56:09 EST


On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney
<paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote:
>> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney
>> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote:
>> >> Function is going to be used in transport over RDMA module
>> >> in subsequent patches.
>> >>
>> >> Function returns next element in round-robin fashion,
>> >> i.e. head will be skipped. NULL will be returned if list
>> >> is observed as empty.
>> >>
>> >> Signed-off-by: Roman Pen <roman.penyaev@xxxxxxxxxxxxxxxx>
>> >> Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
>> >> Cc: linux-kernel@xxxxxxxxxxxxxxx
>> >> ---
>> >> include/linux/rculist.h | 19 +++++++++++++++++++
>> >> 1 file changed, 19 insertions(+)
>> >>
>> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
>> >> index 127f534fec94..b0840d5ab25a 100644
>> >> --- a/include/linux/rculist.h
>> >> +++ b/include/linux/rculist.h
>> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list,
>> >> })
>> >>
>> >> /**
>> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion.
>> >> + * @head: the head for the list.
>> >> + * @ptr: the list head to take the next element from.
>> >> + * @type: the type of the struct this is embedded in.
>> >> + * @memb: the name of the list_head within the struct.
>> >> + *
>> >> + * Next element returned in round-robin fashion, i.e. head will be skipped,
>> >> + * but if list is observed as empty, NULL will be returned.
>> >> + *
>> >> + * This primitive may safely run concurrently with the _rcu list-mutation
>> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
>> >
>> > Of course, all the set of list_next_or_null_rr_rcu() invocations that
>> > are round-robining a given list must all be under the same RCU read-side
>> > critical section. For example, the following will break badly:
>> >
>> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr)
>> > {
>> > struct foo *ret;
>> >
>> > rcu_read_lock();
>> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist);
>> > rcu_read_unlock(); /* BUG */
>> > return ret;
>> > }
>> >
>> > You need a big fat comment stating this, at the very least. The resulting
>> > bug can be very hard to trigger and even harder to debug.
>> >
>> > And yes, I know that the same restriction applies to list_next_rcu()
>> > and friends. The difference is that if you try to invoke those in an
>> > infinite loop, you will be rapped on the knuckles as soon as you hit
>> > the list header. Without that knuckle-rapping, RCU CPU stall warnings
>> > might tempt people to do something broken like take_rr_step() above.
>>
>> Hi Paul,
>>
>> I need -rr behaviour for doing IO load-balancing when I choose next RDMA
>> connection from the list in order to send a request, i.e. my code is
>> something like the following:
>>
>> static struct conn *get_and_set_next_conn(void)
>> {
>> struct conn *conn;
>>
>> conn = rcu_dereferece(rcu_conn);
>> if (unlikely(!conn))
>> return conn;
>
> Wait. Don't you need to restart from the beginning of the list in
> this case? Or does the list never have anything added to it and is
> rcu_conn initially the first element in the list?

Hi Paul,

No, I continue from the pointer, which I assigned on the previous IO
in order to send IO fairly and keep load balanced.

Initially @rcu_conn points to the first element, but elements can
be deleted from the list and list can become empty.

The deletion code is below.

>
>> conn = list_next_or_null_rr_rcu(&conn_list,
>> &conn->entry,
>> typeof(*conn),
>> entry);
>> rcu_assign_pointer(rcu_conn, conn);
>
> Linus is correct to doubt this code. You assign a pointer to the current
> element to rcu_conn, which is presumably a per-CPU or global variable.
> So far, so good ...

I use per-CPU, in the first example I did not show that not to overcomplicate
the code.

>
>> return conn;
>> }
>>
>> rcu_read_lock();
>> conn = get_and_set_next_conn();
>> if (unlikely(!conn)) {
>> /* ... */
>> }
>> err = rdma_io(conn, request);
>> rcu_read_unlock();
>
> ... except that some other CPU might well remove the entry referenced by
> rcu_conn at this point. It would have to wait for a grace period (e.g.,
> synchronize_rcu()), but the current CPU has exited its RCU read-side
> critical section, and therefore is not blocking the grace period.
> Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it
> might well be referencing the freelist, or, even worse, some other type
> of structure.
>
> What is your code doing to prevent this from happening? (There are ways,
> but I want to know what you were doing in this case.)

Probably I should have shown the way of removal at the very beginning,
my fault. So deletion looks as the following (a bit changed and
simplified for the sake of clearness):

static void remove_connection(conn)
{
bool need_to_wait = false;
int cpu;

/* Do not let RCU list add/delete happen in parallel */
mutex_lock(&conn_lock);

list_del_rcu(&conn->entry);

/* Make sure everybody observes element removal */
synchronize_rcu();

/*
* At this point nobody sees @conn in the list, but
still we have
* dangling pointer @rcu_conn which _can_ point to @conn. Since
* nobody can observe @conn in the list, we guarantee
that IO path
* will not assign @conn to @rcu_conn, i.e. @rcu_conn
can be equal
* to @conn, but can never again become @conn.
*/

/*
* Get @next connection from current @conn which is going to be
* removed.
*/
next = list_next_or_null_rr_rcu(&conn_list, &conn->entry,
typeof(*next), entry);

/*
* Here @rcu_conn can be changed by reader side, so use @cmpxchg
* in order to keep fairness in load-balancing and do not touch
* the pointer which can be already changed by the IO path.
*
* Current path can be faster than IO path and the
following race
* exists:
*
* CPU0 CPU1
* ---- ----
* conn = rcu_dereferece(rcu_conn);
* next = list_next_or_null_rr_rcu(conn)
*
* conn ==
cmpxchg(rcu_conn, conn, next);
* synchronize_rcu();
*
* rcu_assign_pointer(rcu_conn, next);
* ^^^^^^^^^^^^^^^^^^
*
* Here @rcu_conn is already equal to @next (done by
@cmpxchg),
* so assignment to the same pointer is harmless.
*
*/
for_each_possible_cpu(cpu) {
struct conn **rcu_conn;

rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu);
if (*rcu_conn != conn)
/*
* This @cpu will never again pick up @conn,
* so it is safe just to choose next CPU.
*/
continue;

if (conn == cmpxchg(rcu_conn, conn, next))
/*
* @rcu_conn was successfully replaced
with @next,
* that means that someone can also hold a @conn
* and dereferencing it, so wait for a
grace period
* is required.
*/
need_to_wait = true;
}
if (need_to_wait)
synchronize_rcu();

mutex_unlock(&conn_lock);

kfree(conn);
}


>
>> i.e. usage of the @next pointer is under an RCU critical section.
>>
>> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like
>> > macro that makes it more obvious that the whole thing need to be under
>> > a single RCU read-side critical section? Such a macro would of course be
>> > an infinite loop if the list never went empty, so presumably there would
>> > be a break or return statement in there somewhere.
>>
>> The difference is that I do not need a loop, I take the @next conn pointer,
>> save it for the following IO request and do IO for current IO request.
>>
>> It seems list_for_each_entry_rcu()-like with immediate "break" in the body
>> of the loop does not look nice, I personally do not like it, i.e.:
>>
>>
>> static struct conn *get_and_set_next_conn(void)
>> {
>> struct conn *conn;
>>
>> conn = rcu_dereferece(rcu_conn);
>> if (unlikely(!conn))
>> return conn;
>> list_for_each_entry_rr_rcu(conn, &conn_list,
>> entry) {
>> break;
>> }
>> rcu_assign_pointer(rcu_conn, conn);
>> return conn;
>> }
>>
>>
>> or maybe I did not fully get your idea?
>
> That would not help at all because you are still leaking the pointer out
> of the RCU read-side critical section. That is completely and utterly
> broken unless you are somehow cleaning up rcu_conn when you remove
> the element. And getting that cleanup right is -extremely- tricky.
> Unless you have some sort of proof of correctness, you will get a NACK
> from me.

I understand all the consequences of the leaking pointer, and of course
wrapped loop with RCU lock/unlock is simpler, but in order to keep
load-balancing and IO fairness avoiding any locks on IO path I've come
up with these RCU tricks and list_next_or_null_rr_rcu() macro.

> More like this:
>
> list_for_each_entry_rr_rcu(conn, &conn_list, entry) {
> do_something_with(conn);
> if (done_for_now())
> break;
> }
>
>> >> + */
>> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \
>> >> +({ \
>> >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \
>> >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \
>> >
>> > Are there any uses for this outside of RDMA? If not, I am with Linus.
>> > Define this within RDMA, where a smaller number of people can more
>> > easily be kept aware of the restrictions on use. If it turns out to be
>> > more generally useful, we can take a look at exactly what makes sense
>> > more globally.
>>
>> The only one list_for_each_entry_rcu()-like macro I am aware of is used in
>> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr():
>>
>> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370
>>
>> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing
>> my list_next_or_null_rr_rcu() variant?
>
> Let's start with the basics: It absolutely does not make sense to leak
> pointers across rcu_read_unlock() unless you have arranged something else
> to protect the pointed-to data in the meantime. There are a number of ways
> of implementing this protection. Again, what protection are you using?
>
> Your code at the above URL looks plausible to me at first glance: You
> do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then
> rcu_read_unlock(). But at second glance, it looks like htcx->queue
> might have the same vulnerability as rcu_conn in your earlier code.

I am not the author of the code at the URL I specified. I provided the
link answering the question to show other possible users of round-robin
semantics for RCU list traversal. In my 'list_next_or_null_rr_rcu()'
case I can't use a loop, I leak the pointer and indeed have to be very
careful. But perhaps we can come up with some generic solution to cover
both cases: -rr loop and -rr next.

--
Roman