Re: [PATCH net-next] rhashtable: further improve stability of rhashtable_walk

From: NeilBrown
Date: Wed Dec 12 2018 - 22:49:15 EST

On Thu, Dec 13 2018, Herbert Xu wrote:

> On Wed, Dec 12, 2018 at 07:49:08PM +1100, NeilBrown wrote:
>> On Wed, Dec 12 2018, Herbert Xu wrote:
>> > On Wed, Dec 12, 2018 at 05:41:29PM +1100, NeilBrown wrote:
>> >>
>> >> So you would substantially slow down the rhashtable_walk_start() step.
>> >
>> > This whole thing is meant for uses such as /proc and netlink
>> > enumeration. Speed is definitely not a prerogative of these
>> > iterators.
>> An RCU grace period is typically 10s of milliseconds. It doesn't take
>> very many of those to start being noticed.
> Why would you even need an RCU grace period for deleting and adding
> the walker object to the bucket list? You just allocate a new one
> each time and free the old one after an RCU grace period. I don't
> see where the latency is coming from.

Yes, you could rcu_free the old one and allocate a new one. Then you
would have to be ready to deal with memory allocation failure which
complicates usage (I already don't like that rhashtable_insert() can
report -ENOMEM!).

Another problem with inserting a marker object on rhashtable_walk_stop()
is that it might not be clear where to insert it.
You would have to take the spinlock, and then you might find that the
location that you want to mark has been deleted from the list. I think
you can probably still manage a reliable placement, but it could get

>> > For that matter, if speed was an issue then surely you would
>> > not drop out of the RCU read lock at all while iterating.
>> tipc_nl_sk_walk() drops out of RCU for every object.
>> I don't know what it is used for, but I doubt that making it thousands
>> of times slower would be a good thing.
> Now you're conflating two different things. Dropping the RCU
> isn't necessarily slow. We were talking about waiting for an
> RCU grace period which would only come into play if you were
> suspending the walk indefinitely. Actually as I said above even
> there you don't really need to wait.

How would rhashtable_walk_stop() know if it was indefinite or not?

Yes, if you allocate a new marker on each rhashtable_walk_start()
(passing in a gfp_t and coping with errors) then you don't need to wait
a grace period. If you reuse one to avoid the errors, you do.

>> > It sounds to me like you're trying to use this interface for
>> > something that it's simply not designed to do.
>> I'm just trying to make sure we have a reliable data structure which I
>> can use without having to be worried that I might accidentally hit some
>> unsupported behaviour.
> Now that is something I agree with.
>> I don't really see why you think my patch is all that complex.
>> The only conceptual change is that hash chains are now ordered, and
>> ordered chains are easy to understand.
>> The biggest part of the code change is just moving some code from
>> rhashtable_walk_start() to rhashtable_walk_next().
>> Why don't you like it?
> My main beef is the fact that it doesn't work with rhlist. So down
> the track either we'll have to add more complexity for rhlist or
> we will have to rip this out again do something completely different.

You have previously said that functionality which isn't needed by current
code should not be a concern. No current code ever stops and restarts
an rhlist walk, so this shouldn't be relevant. Solve that problem
when/if we come to it.

If this is really the main beef, then let's turn the question around.
Suppose this patch had landed before rhltables had landed. How would we
implement rhltables without breaking this functionality?

Keeping all the objects with the same key in a linked list is clearly a
good idea - make this a circular list as there is no obvious "first".

*Not* keeping them all in the hash chain is ideal, but not essential.
I see three costs with this.
One is that we would compare the same key multiple times for lookup.
How much of a problem is that? A failing compare is usually quite quick,
and most rhltable uses have inline memcmp for comparison (admittedly not

The second cost is tracking the chain length against elasticity.
We could flag one object with each key as a 'master' (low bit of the
'next' pointer) and only count the masters. When lookup raced with
remove this might get a slightly incorrect count, but I don't think that

Finally, there is more pointer chasing as the chains are longer.

Was performance part of the reason for adding rhltables? It isn't
mentioned in the commit message, but it is easy to miss things.


> Cheers,
> --
> Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
> Home Page:
> PGP Key:

Attachment: signature.asc
Description: PGP signature