Re: [PATCH] ipc: Convert ipcs_idr to XArray

From: Matthew Wilcox
Date: Mon Apr 20 2020 - 13:06:06 EST


On Mon, Apr 20, 2020 at 05:35:20PM +0200, Manfred Spraul wrote:
> > - max_idx = max(ids->in_use*3/2, ipc_min_cycle);
> > - max_idx = min(max_idx, ipc_mni);
> > -
> > - /* allocate the idx, with a NULL struct kern_ipc_perm */
> > - idx = idr_alloc_cyclic(&ids->ipcs_idr, NULL, 0, max_idx,
> > - GFP_NOWAIT);
> > -
> > - if (idx >= 0) {
> > - /*
> > - * idx got allocated successfully.
> > - * Now calculate the sequence number and set the
> > - * pointer for real.
> > - */
> > - if (idx <= ids->last_idx) {
> > + min_idx = ids->next_idx;
> > + new->seq = ids->seq;
> > +
> > + /* Modified version of __xa_alloc */
> > + do {
> > + xas.xa_index = min_idx;
> > + xas_find_marked(&xas, max_idx, XA_FREE_MARK);
> > + if (xas.xa_node == XAS_RESTART && min_idx > 0) {
> > ids->seq++;
> > if (ids->seq >= ipcid_seq_max())
> > ids->seq = 0;
> > + new->seq = ids->seq;
> > + xas.xa_index = 0;
> > + min_idx = 0;
> > + xas_find_marked(&xas, max_idx, XA_FREE_MARK);
> > }
>
> Is is nessary to have that many details of xarray in ipc/util?
>
> This function is not performance critical.
>
> The core requirement is that ipc_obtain_object_check() must scale.
>
> Would it be possible to use something like
>
>     xa_alloc(,entry=NULL,)
>
>     new->seq = ...
>
>     xa_store(,entry=new,);

Yes, that would absolutely be possible, and some users do exactly that.
I thought that creating a new message queue / semaphore set / shared
memory segment would be performance sensitive.

> > - ids->last_idx = idx;
> > -
> > - new->seq = ids->seq;
> > - /* no need for smp_wmb(), this is done
> > - * inside idr_replace, as part of
> > - * rcu_assign_pointer
> > - */
>
> Could you leave the memory barrier comments in the code?
>
> The rcu_assign_pointer() is the first hands-off from semget() or msgget().
>
> Before the rcu_assign_pointer, e.g. semop() calls would return -EINVAL;
>
> After the rcu_assign_pointer, semwhatever() must work - and e.g. the
> permission checks are lockless.

How about simply:
/* xa_store contains a memory barrier */

> > - idr_replace(&ids->ipcs_idr, new, idx);
> > - }
> > + if (xas.xa_node == XAS_RESTART)
> > + xas_set_err(&xas, -ENOSPC);
> > + else
> > + new->id = (new->seq << ipcmni_seq_shift()) +
> > + xas.xa_index;
>
> Setting new->id should remain at the end, outside any locking:
>
> The variable has no special protection, access is only allowed after proper
> locking, thus no need to have the initialization in the middle.
>
> What is crucial is that the final value of new->seq is visible to all cpus
> before a storing the pointer.

The IPC locking is weird. Most users spin_lock the xarray/idr/radix
tree for modifications, and on the read-side use RCU to protect the
lookup and a refcount on the object looked up in it (after which,
RCU is unlocked). IPC seems to hold the RCU lock much longer, and it
has a per-object spinlock rather than refcount. And it has an rwsem.
It feels like it could be much simpler, but I haven't had time to dig
into it and figure out why it's so different from all the other users.
Maybe it's just older code.

> > + xas_store(&xas, new);
> > + xas_clear_mark(&xas, XA_FREE_MARK);
> > + } while (__xas_nomem(&xas, GFP_KERNEL));
> > +
>
> Just for my curiosity:
>
> If the xas is in an invalid state, then xas_store() will not store anything.
> Thus the loop will not store "new" multiple times, it will be stored only
> once.

Correct, although we're going to delete this loop entirely.

> @@ -472,7 +487,7 @@ void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm
> *ipcp)
> > idx--;
> > if (idx == -1)
> > break;
> > - } while (!idr_find(&ids->ipcs_idr, idx));
> > + } while (!xa_load(&ids->ipcs, idx));
> > ids->max_idx = idx;
> > }
> > }
>
> Is there an xa_find_last() function?
>
> It is outside of any hot path, I have a patch that does a binary search with
> idr_get_next().
>
> If there is no xa_find_last(), then I would rebase that patch.

There is not currently an xa_find_last(). It shouldn't be too hard
to add; start at the top of the tree and walk backwards in each node
until finding a non-NULL entry. Of course, it'll need documentation
and test cases.