Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full

From: K Prateek Nayak
Date: Tue Mar 04 2025 - 00:36:28 EST

Next message: Gavin Shan: "Re: [PATCH v7 24/45] KVM: arm64: Handle Realm PSCI requests"
Previous message: Greg KH: "Re: [PATCH V2] usb: xhci: Add debugfs support for xHCI port bandwidth"
In reply to: Hillf Danton: "Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full"
Next in thread: Hillf Danton: "Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Hillf,

On 3/4/2025 10:36 AM, Hillf Danton wrote:

On Mon, 3 Mar 2025 15:16:34 +0530 "Sapkal, Swapnil" <swapnil.sapkal@xxxxxxx>

On 2/28/2025 10:03 PM, Oleg Nesterov wrote:

And... I know, I know you already hate me ;)

Not at all :)

but if you have time, could you check if this patch (with or without the
previous debugging patch) makes any difference? Just to be sure.

Sure, I will give this a try.

But in the meanwhile me and Prateek tried some of the experiments in the weekend.
We were able to reproduce this issue on a third generation EPYC system as well as
on an Intel Emerald Rapids (2 X INTEL(R) XEON(R) PLATINUM 8592+).

We tried heavy hammered tracing approach over the weekend on top of your debug patch.
I have attached the debug patch below. With tracing we found the following case for
pipe_writable():

hackbench-118768 [206] ..... 1029.550601: pipe_write: 000000005eea28ff: 0: 37 38 16: 1

Here,

head = 37
tail = 38
max_usage = 16
pipe_full() returns 1.

Between reading of head and later the tail, the tail seems to have moved ahead of the
head leading to wraparound. Applying the following changes I have not yet run into a
hang on the original machine where I first saw it:

diff --git a/fs/pipe.c b/fs/pipe.c
index ce1af7592780..a1931c817822 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -417,9 +417,19 @@ static inline int is_packetized(struct file *file)
/* Done while waiting without holding the pipe lock - thus the READ_ONCE() */
static inline bool pipe_writable(const struct pipe_inode_info *pipe)
{
- unsigned int head = READ_ONCE(pipe->head);
- unsigned int tail = READ_ONCE(pipe->tail);
unsigned int max_usage = READ_ONCE(pipe->max_usage);
+ unsigned int head, tail;
+
+ tail = READ_ONCE(pipe->tail);
+ /*
+ * Since the unsigned arithmetic in this lockless preemptible context
+ * relies on the fact that the tail can never be ahead of head, read
+ * the head after the tail to ensure we've not missed any updates to
+ * the head. Reordering the reads can cause wraparounds and give the
+ * illusion that the pipe is full.
+ */
+ smp_rmb();
+ head = READ_ONCE(pipe->head);
return !pipe_full(head, tail, max_usage) ||
!READ_ONCE(pipe->readers);
---

smp_rmb() on x86 is a nop and even without the barrier we were not able to
reproduce the hang even after 10000 iterations.

My $.02 that changes the wait condition.
Not sure it makes sense for you.

--- x/fs/pipe.c
+++ y/fs/pipe.c
@@ -430,7 +430,7 @@ pipe_write(struct kiocb *iocb, struct io
{
struct file *filp = iocb->ki_filp;
struct pipe_inode_info *pipe = filp->private_data;
- unsigned int head;
+ unsigned int head, tail;
ssize_t ret = 0;
size_t total_len = iov_iter_count(from);
ssize_t chars;
@@ -573,11 +573,13 @@ pipe_write(struct kiocb *iocb, struct io
* after waiting we need to re-check whether the pipe
* become empty while we dropped the lock.
*/
+ tail = pipe->tail;
mutex_unlock(&pipe->mutex);
if (was_empty)
wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
- wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
+ wait_event_interruptible_exclusive(pipe->wr_wait,
+ !READ_ONCE(pipe->readers) || tail != READ_ONCE(pipe->tail));

That could work too for the case highlighted but in case the head too
has moved by the time the writer wakes up, it'll lead to an extra
wakeup.

Linus' diff seems cleaner and seems to cover all racy scenarios.

mutex_lock(&pipe->mutex);
was_empty = pipe_empty(pipe->head, pipe->tail);
wake_next_writer = true;
--

--
Thanks and Regards,
Prateek

Next message: Gavin Shan: "Re: [PATCH v7 24/45] KVM: arm64: Handle Realm PSCI requests"
Previous message: Greg KH: "Re: [PATCH V2] usb: xhci: Add debugfs support for xHCI port bandwidth"
In reply to: Hillf Danton: "Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full"
Next in thread: Hillf Danton: "Re: [PATCH] pipe_read: don't wake up the writer if the pipe is still full"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]