Re: [PATCH] fs/epoll: make nesting accounting safe for -rt kernel

From: Andrew Morton
Date: Mon Feb 24 2020 - 19:38:40 EST


On Fri, 17 Jan 2020 14:16:47 -0500 Jason Baron <jbaron@xxxxxxxxxx> wrote:

> Davidlohr Bueso pointed out that when CONFIG_DEBUG_LOCK_ALLOC is set
> ep_poll_safewake() can take several non-raw spinlocks after disabling
> pre-emption which is no no for the -rt kernel. So let's re-work how we
> determine the nesting level such that it plays nicely with -rt kernel.

"no no" isn't terribly informative, and knowledge of -rt's requirements
isn't widespread. Can we please spell this requirement out in full
detail, if only to spread the -rt education a bit?

> Let's introduce a 'nests' field in struct eventpoll that records the
> current nesting level during ep_poll_callback(). Then, if we nest again we
> can find the previous struct eventpoll that we were called from and
> increase our count by 1. The 'nests' field is protected by
> ep->poll_wait.lock.
>
> I've also moved napi_id field into a hole in struct eventpoll as part of
> introduing the nests field. This change reduces the struct eventpoll from
> 184 bytes to 176 bytes on x86_64 for the !CONFIG_DEBUG_LOCK_ALLOC
> production config.
>
> ...
>
> @@ -551,30 +557,43 @@ static int ep_call_nested(struct nested_calls *ncalls,
> */
> #ifdef CONFIG_DEBUG_LOCK_ALLOC
>
> -static DEFINE_PER_CPU(int, wakeup_nest);
> -
> -static void ep_poll_safewake(wait_queue_head_t *wq)
> +static void ep_poll_safewake(struct eventpoll *ep, struct epitem *epi)
> {
> + struct eventpoll *ep_src;
> unsigned long flags;
> - int subclass;
> + u8 nests = 0;
>
> - local_irq_save(flags);
> - preempt_disable();
> - subclass = __this_cpu_read(wakeup_nest);
> - spin_lock_nested(&wq->lock, subclass + 1);
> - __this_cpu_inc(wakeup_nest);
> - wake_up_locked_poll(wq, POLLIN);
> - __this_cpu_dec(wakeup_nest);
> - spin_unlock(&wq->lock);
> - local_irq_restore(flags);
> - preempt_enable();
> + /*
> + * If we are not being call from ep_poll_callback(), epi is
> + * NULL and we are at the first level of nesting, 0. Otherwise,
> + * we are being called from ep_poll_callback() and if a previous
> + * wakeup source is not an epoll file itself, we are at depth
> + * 1 since the wakeup source is depth 0. If the wakeup source
> + * is a previous epoll file in the wakeup chain then we use its
> + * nests value and record ours as nests + 1. The previous epoll
> + * file nests value is stable since its already holding its
> + * own poll_wait.lock.
> + */

Similarly, it would be helpful if this comment were to explain that
this code exists for -rt's requirements, and to briefly describe what
that requirement is.

> + if (epi) {
> + if ((is_file_epoll(epi->ffd.file))) {
> + ep_src = epi->ffd.file->private_data;
> + nests = ep_src->nests;
> + } else {
> + nests = 1;
> + }
> + }
> + spin_lock_irqsave_nested(&ep->poll_wait.lock, flags, nests);
> + ep->nests = nests + 1;
> + wake_up_locked_poll(&ep->poll_wait, EPOLLIN);
> + ep->nests = 0;
> + spin_unlock_irqrestore(&ep->poll_wait.lock, flags);
> }