Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention

From: Roman Penyaev
Date: Tue Dec 04 2018 - 06:51:04 EST


On 2018-12-03 18:34, Linus Torvalds wrote:
On Mon, Dec 3, 2018 at 3:03 AM Roman Penyaev <rpenyaev@xxxxxxx> wrote:

Also I'm not quite sure where to put very special lockless variant
of adding element to the list (list_add_tail_lockless() in this
patch). Seems keeping it locally is safer.

That function is scary, and can be mis-used so easily that I
definitely don't want to see it anywhere else.

Afaik, it's *really* important that only "add_tail" operations can be
done in parallel.

True, adding element either to head or to tail can work in parallel,
any mix will corrupt the list. I tried to reflect this in the comment
of list_add_tail_lockless(). Although not sure has it become clearer
to a reader or not.


This also ends up making the memory ordering of "xchg()" very very
important. Yes, we've documented it as being an ordering op, but I'm
not sure we've relied on it this directly before.

Seems exit_mm() does exactly the same, the following chunk:

up_read(&mm->mmap_sem);

self.task = current;
self.next = xchg(&core_state->dumper.next, &self);


At least code pattern looks similar.


I also note that now we do more/different locking in the waitqueue
handling, because the code now takes both that rwlock _and_ the
waitqueue spinlock for wakeup. That also makes me worried that the
"waitqueue_active()" games are no no longer reliable. I think they're
fine (looks like they are only done under the write-lock, so it's
effectively the same serialization anyway),


The only difference in waking up is that same epollitem waitqueue can be
observed as active from different CPUs, real wake up happens only once
(wake_up() takes wq.lock, so should be fine to call it multiple times),
but 1 is returned for all callers of ep_poll_callback() who has seen
the wq as active.

If epollitem is created with EPOLLEXCLUSIVE flag, then 1, which is returned
from ep_poll_callback(), indicates "break the loop, exclusive wake up has
happened" (the loop is in __wake_up_common), but even we consider this
exclusive wake up case this seems is totally fine, because wake up events
are not lost and epollitem will scan all ready fds and eventually will
observe all of the callers (who has returned 1 from ep_poll_callback())
as ready. I hope I did not miss anything.


but the upshoot of all of
this is that I *really* want others to look at this patch too. A lot
of small subtle things here.

Would be great if someone can look through, eventpoll.c looks a
bit abandoned.

--
Roman