Re: [RFC v3 1/2] epoll: avoid spinlock contention with wfcqueue

From: Arve Hjønnevåg
Date: Fri Mar 22 2013 - 18:27:42 EST


On Fri, Mar 22, 2013 at 12:24 PM, Eric Wong <normalperson@xxxxxxxx> wrote:
...
> Perhaps just using epitem->ws and removing ep->ws can work.
>
> I think the following change to keep wakeup_source in sync with
> epi->state is sufficient to prevent suspend.
>
> But I'm not familiar with suspend. Is it possible to suspend while
> a) spinning on a lock?
> b) holding a spinlock?
>

If you code cannot be preempted suspend will not complete. However, if
you drop the last wakeup_source that was active, you may trigger a new
suspend attempt, so it is best to avoid this.

> Since we avoid spinlocks in the main ep_poll_callback path, maybe the
> chance of entering suspend is reduced anyways since we may activate
> the ws sooner.
>
> What do you think?
>

I think you should make sure ep_poll_callback does not return without
an active wakeup_source if EPOLLWAKEUP is set.

> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 1e04175..531ad46 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -214,9 +214,6 @@ struct eventpoll {
> /* RB tree root used to store monitored fd structs */
> struct rb_root rbr;
>
> - /* wakeup_source used when ep_send_events is running */
> - struct wakeup_source *ws;
> -
> /* The user that created the eventpoll descriptor */
> struct user_struct *user;
>
> @@ -718,7 +715,6 @@ static void ep_free(struct eventpoll *ep)
> mutex_unlock(&epmutex);
> mutex_destroy(&ep->mtx);
> free_uid(ep->user);
> - wakeup_source_unregister(ep->ws);
> kfree(ep);
> }
>
> @@ -1137,12 +1133,6 @@ static int ep_create_wakeup_source(struct epitem *epi)
> const char *name;
> struct wakeup_source *ws;
>
> - if (!epi->ep->ws) {
> - epi->ep->ws = wakeup_source_register("eventpoll");
> - if (!epi->ep->ws)
> - return -ENOMEM;
> - }
> -
> name = epi->ffd.file->f_path.dentry->d_name.name;
> ws = wakeup_source_register(name);
>
> @@ -1390,22 +1380,6 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
> WARN_ON(state != EP_STATE_READY);
> wfcq_node_init(&epi->rdllink);
>
> - /*
> - * Activate ep->ws before deactivating epi->ws to prevent
> - * triggering auto-suspend here (in case we reactive epi->ws
> - * below).
> - *
> - * This could be rearranged to delay the deactivation of epi->ws
> - * instead, but then epi->ws would temporarily be out of sync
> - * with epi->state.
> - */
> - ws = ep_wakeup_source(epi);
> - if (ws) {
> - if (ws->active)
> - __pm_stay_awake(ep->ws);
> - __pm_relax(ws);
> - }
> -
> revents = ep_item_poll(epi, &pt);
>
> /*
> @@ -1419,7 +1393,6 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
> __put_user(epi->event.data, &uevent->data)) {
> wfcq_enqueue_local(&ep->txlhead, &ep->txltail,
> &epi->rdllink);
> - ep_pm_stay_awake(epi);
> if (!eventcnt)
> eventcnt = -EFAULT;
> break;
> @@ -1441,13 +1414,34 @@ static int ep_send_events(struct eventpoll *ep, bool *eavail,
> */
> wfcq_enqueue_local(&lthead, &lttail,
> &epi->rdllink);
> - ep_pm_stay_awake(epi);
> continue;
> }
> }
>
> /*
> - * reset item state for EPOLLONESHOT and EPOLLET
> + * Deactivate the wakeup source before marking it idle.
> + * The barrier implied by the spinlock in __pm_relax ensures
> + * any ep_poll_callback callers running will see the
> + * deactivated ws before epi->state == EP_STATE_IDLE.
> + *
> + * For EPOLLET, the event may still be merged into the one
> + * that is currently on its way into userspace, but it has
> + * always been the responsibility of userspace to trigger
> + * EAGAIN on the file before it expects the item to appear
> + * again in epoll_wait.
> + *
> + * Level Trigger never gets here, so the ws remains active.

I don't think this statement is true.

> + *
> + * EPOLLONESHOT will either be dropped by ep_poll_callback
> + * or dropped the next time ep_send_events is called, so the
> + * ws is irrelevant until it is hit by ep_modify
> + */
> + ws = ep_wakeup_source(epi);
> + if (ws)
> + __pm_relax(ws);
> +
> + /*
> + * reset item state for EPOLLONESHOT and EPOLLET.
> * no barrier here, rely on ep->mtx release for write barrier
> */
> epi->state = EP_STATE_IDLE;
>
> --
> Eric Wong



--
Arve Hjønnevåg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/