Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write

From: Manfred Spraul
Date: Fri Dec 27 2024 - 13:39:28 EST


Hi,
(I had to remove many cc, my mail server rejected to send)

On 12/26/24 9:11 PM, Oleg Nesterov wrote:
I mostly agree, see my reply to this patch, but...

On 12/26, Linus Torvalds wrote:
If the optimization is correct, there is no point to having a config option.

If the optimization is incorrect, there is no point to having the code.

Either way, there's no way we'd ever have a config option for this.
Agreed,

+ if (was_full && pipe_check_wq_has_sleeper(&pipe->wr_wait))
wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
End result: you need to explain why the race cannot exist.
Agreed,

And I think your patch is simply buggy
Agreed again, but probably my reasoning was wrong,

But now waiters use "wait_event_interruptible_exclusive()" explicitly
outside the pipe mutex, so the waiters and wakers aren't actually
serialized.
This is what I don't understand... Could you spell ?
I _think_ that

wait_event_whatever(WQ, CONDITION);
vs

CONDITION = 1;
if (wq_has_sleeper(WQ))
wake_up_xxx(WQ, ...);

is fine.

This pattern is documented in wait.h:

https://elixir.bootlin.com/linux/v6.12.6/source/include/linux/wait.h#L96

Thus if there an issue, then the documentation should be updated.

Both wq_has_sleeper() and wait_event_whatever()->prepare_to_wait_event()
have the necessary barriers to serialize the waiters and wakers.

Damn I am sure I missed something ;)

Actually:

Does this work universally? I.e. can we add the optimization into __wake_up()?


But I do not understand this comment (from 2.6.0)

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/kernel/fork.c?h=v2.6.0&id=e220fdf7a39b54a758f4102bdd9d0d5706aa32a7

/* * Note: we use "set_current_state()" _after_ the wait-queue add, * because we need a memory barrier there on SMP, so that any * wake-function that tests for the wait-queue being active * will be guaranteed to see waitqueue addition _or_ subsequent * tests in this thread will see the wakeup having taken place. * * The spin_unlock() itself is semi-permeable and only protects * one way (it only protects stuff inside the critical region and * stops them from bleeding out - it would still allow subsequent * loads to move into the the critical region). */ voidprepare_to_wait <https://elixir.bootlin.com/linux/v2.6.0/C/ident/prepare_to_wait>(wait_queue_head_t <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_head_t>*q,wait_queue_t <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait_queue_t>*wait <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>,intstate) { unsignedlongflags; wait <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>->flags&=~WQ_FLAG_EXCLUSIVE <https://elixir.bootlin.com/linux/v2.6.0/C/ident/WQ_FLAG_EXCLUSIVE>; spin_lock_irqsave <https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_lock_irqsave>(&q->lock,flags); if(list_empty <https://elixir.bootlin.com/linux/v2.6.0/C/ident/list_empty>(&wait <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>->task_list <https://elixir.bootlin.com/linux/v2.6.0/C/ident/task_list>)) __add_wait_queue <https://elixir.bootlin.com/linux/v2.6.0/C/ident/__add_wait_queue>(q,wait <https://elixir.bootlin.com/linux/v2.6.0/C/ident/wait>); set_current_state <https://elixir.bootlin.com/linux/v2.6.0/C/ident/set_current_state>(state); spin_unlock_irqrestore <https://elixir.bootlin.com/linux/v2.6.0/C/ident/spin_unlock_irqrestore>(&q->lock,flags); }
set_current_state() now uses smp_store_mb(), which is a memory barrier _after_ the store. Thus I do not see what enforces that the store happens before the store for the __add_wait_queue().

--

    Manfred
From 7cb8f06ab022e7fc36bacbe65f654822147ec1aa Mon Sep 17 00:00:00 2001
From: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Date: Fri, 27 Dec 2024 18:21:41 +0100
Subject: [PATCH] __wake_up_common_lock: Optimize for empty wake queues

wake_up() must wake up every task that is in the wait queue.
But: If there is concurrency between wake_up() and add_wait_queue(),
then (as long as we are not violating the memory ordering rules) we
can just claim that wake_up() happened "first" and add_wait_queue()
happened afterwards.
From memory ordering perspective:
- If the wait_queue() is empty, then wake_up() just does
spin_lock();
list_empty()
spin_unlock();
- add_wait_queue()/prepare_to_wait() do all kind of operations,
but they may become visible only when the spin_unlock_irqrestore()
is done.
Thus, instead of pairing the memory barrier to the spinlock, and
thus writing to a potentially shared cacheline, we load-acquire
the next pointer from the list.

Risks and side effects:
- The guaranteed memory barrier of wake_up() is reduced to load_acquire.
Previously, there was always a spin_lock()/spin_unlock() pair.
- prepare_to_wait() actually does two operations under spinlock:
It adds current to the wait queue, and it calls set_current_state().
The comment above prepare_to_wait() is not clear to me, thus there
might be further side effects.

Only lightly tested.
No benchmark test done.
---
kernel/sched/wait.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..10d02f652ab8 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -124,6 +124,23 @@ static int __wake_up_common_lock(struct wait_queue_head *wq_head, unsigned int m
int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, void *key)
{
+ if (list_empty(&wq_head->head)) {
+ struct list_head *pn;
+
+ /*
+ * pairs with spin_unlock_irqrestore(&wq_head->lock);
+ * We actually do not need to acquire wq_head->lock, we just
+ * need to be sure that there is no prepare_to_wait() that
+ * completed on any CPU before __wake_up was called.
+ * Thus instead of load_acquiring the spinlock and dropping
+ * it again, we load_acquire the next list entry and check
+ * that the list is not empty.
+ */
+ pn = smp_load_acquire(&wq_head->head.next);
+
+ if(pn == &wq_head->head)
+ return 0;
+ }
return __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);
}
EXPORT_SYMBOL(__wake_up);
--
2.47.1