Re: [RFC][PATCH] module: Optimize __module_address() using a latched RB-tree

From: Peter Zijlstra
Date: Thu Feb 26 2015 - 14:14:22 EST


On Thu, Feb 26, 2015 at 10:28:17AM -0800, Paul E. McKenney wrote:


> Color me confused, both by the existing code and the modifications.
>
> It appears that you are using seqlock to force readers to retry when
> a concurrent update occurs, but I don't see what is ensuring that the
> readers see good data when racing with an insertion or a deletion.

Yes, I suppose that needs a wee bit more work (and comments).

> Yes,
> the reader will be retried, courtesy of the seqlock, but that won't help
> if the reader segfaults before it gets to the read_seqcount_retry().

Right, I don't think that can happen, but see below.

> > +static void __tree_insert(struct latch_tree_root *mod_tree, struct module *mod, int idx)
> > +{
> > + struct rb_root *root = &mod_tree->tree[idx & latch_bit];
> > + struct module_node *mn = &mod->tree_node[idx];
> > + struct rb_node **link = &root->rb_node;
> > + struct rb_node *parent = NULL;
> > + unsigned long mod_val, m_val;
> > + struct module *m;
> > + int i;
> > +
> > + mn->mod = mod;
> > + mod_val = mod_value(mod, idx);
> > +
> > + while (*link) {
> > + parent = *link;
> > + m = mod_entry(parent);
> > + i = mod_node_idx(m, parent);
> > + m_val = mod_value(m, i);
> > +
> > + if (mod_val < m_val)
> > + link = &parent->rb_left;
> > + else
> > + link = &parent->rb_right;
> > + }
> > +
> > + rb_link_node(&mn->node, parent, link);
>
> This makes the new module visible to readers, if I understand correctly.

You do.

> There needs to be a memory barrier between initialization and this call
> to rb_link_node(), otherwise, both the CPU (well, for Alpha) and the
> compiler (for everyone) can reorder, which could result in some hapless
> reader seeing pre-initialized data.
>
> Or did I miss the memory barrier?

I was going with the one in list_add_rcu() and the two in
raw_write_seqcount_latch(), however, now that you made me look at it, I
need one in here to ensure mn->mod is observed.

So yes. The best solution might be a mod_tree_init() which initializes
those back pointers before this insertion.

> > + rb_insert_color(&mn->node, root);
>
> This -might- be OK -- the rotations do not make new nodes visible,
> instead simply rearranging nodes that are already visible.
>
> For both rb_link_node() and rb_insert_color(), given that we are updating
> pointers while readers are traversing them, READ_ONCE() and WRITE_ONCE()
> would be good. Yes, the compiler -probably- doesn't mess you up, but it
> would be completely within its rights to do so. :-/
>
> Alpha would like either rcu_dereference() or lockless_dereference()
> instead of READ_ONCE(), of course!

I _think_ we should be ok here. Since as you say, we only reorder. If we
observe nodes in the wrong order, our traversal gets confused and might
return the wrong or no node.

That is fine.

Of course, as you say, the compiler is within its right to split stores
and do other creative nonsense without the volatile thing. But I don't
think it will actually do so for native word sized thingies.

If we're really paranoid I suppose I can go make copies of the
functions that have READ/WRITE_ONCE() in.

As to Alpha, I think we're good there, the split cache would make us see
stale data, which is perfectly fine, we'll traverse a crap tree, no
problem. As long as native sized and native aligned loads/stores aren't
split -- which I'm pretty sure doesn't happen (because of the split
cache).

> > +}
> > +
> > +static void __tree_remove(struct latch_tree_root *mod_tree, struct module *mod, int idx)
> > +{
> > + struct rb_root *root = &mod_tree->tree[idx & latch_bit];
> > + struct module_node *mn = &mod->tree_node[idx];
> > +
> > + rb_erase(&mn->node, root);
>
> This is just removing and not freeing, so OK from a read-side viewpoint.
> WRITE_ONCE() would be good.

Yah, would need a copy of lib/rbtree.c again :/

> > +static struct module *__tree_find(struct rb_root *r, unsigned long addr)
> > +{
> > + struct rb_node *n = r->rb_node;
> > + unsigned long m_val, m_size;
> > +
> > + while (n) {
> > + struct module *m = mod_entry(n);
> > + int idx = mod_node_idx(m, n);
> > +
> > + m_val = mod_value(m, idx);
> > + m_size = mod_size(m, idx);
> > +
> > + if (addr < m_val)
> > + n = n->rb_left;
>
> We need a rcu_dereference() or lockless_dereference() here, I think.
>
> > + else if (addr >= m_val + m_size)
> > + n = n->rb_right;
>
> And here.

As per the above argument; without doing the whole READ/WRITE_ONCE for
the rb tree primitives, I think we're fine. We don't actually need the
read_barrier_depends() I think.

I think this because raw_read_seqcount() will issue an rmb, which should
be enough for the 'stable' tree. For the unstable one we don't care as
long as we don't see split loads/stores.

> > + else
> > + return m;
> > + }
> > +
> > + return NULL;
>
> I did finally find the RCU read-side critical section, supplied by
> is_module_text_address()'s preempt_disable() and preempt_enable().

Indeed.

> The matching synchronize_sched() is supplied by do_init_module() and
> load_module(). I expected a synchronize_sched() in free_module() as well,
> but I only see a synchronize_rcu(). Is the required synchronize_sched()
> on this code path hiding somewhere?
>
> Or did I miss an rcu_read_lock()/rcu_read_unlock() pair somewhere?

No, I think someone is 'hoping' sync_rcu() implies sync_sched(). But
yes, I should look harder at this. I was assuming the existing code was
OK here.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/