Re: [RESEND PATCH] fs/pipe: Introduce a check to skip sleeping processes during pipe read/write

From: Mateusz Guzik
Date: Wed Dec 25 2024 - 12:23:17 EST


On Wed, Dec 25, 2024 at 5:32 PM Kent Overstreet
<kent.overstreet@xxxxxxxxx> wrote:
>
> On Wed, Dec 25, 2024 at 05:04:46PM +0100, Mateusz Guzik wrote:
> > On Wed, Dec 25, 2024 at 08:53:05AM -0500, Kent Overstreet wrote:
> > > On Wed, Dec 25, 2024 at 03:30:05PM +0200, Andy Shevchenko wrote:
> > > > Don't you think the Cc list is a bit overloaded?
> > >
> > > Indeed, my mail server doesn't let me reply-all.
> > >
> > > > On Wed, Dec 25, 2024 at 05:42:02PM +0800, WangYuli wrote:
> > > > > +config PIPE_SKIP_SLEEPER
> > > > > + bool "Skip sleeping processes during pipe read/write"
> > > > > + default n
> > > >
> > > > 'n' is the default 'default', no need to have this line.
> > >
> > > Actually, I'd say to skip the kconfig option for this. Kconfig options
> > > that affect the behaviour of core code increase our testing burden, and
> > > are another variable to account for when chasing down bugs, and the
> > > potential overhead looks negligable.
> > >
> >
> > I agree the behavior should not be guarded by an option. However,
> > because of how wq_has_sleeper is implemented (see below) I would argue
> > this needs to show how often locking can be avoided in real workloads.
> >
> > The commit message does state this comes with a slowdown for cases which
> > can't avoid wakeups, but as is I thought the submitter just meant an
> > extra branch.
> >
> > > Also, did you look at adding this optimization to wake_up()? No-op
> > > wakeups are very common, I think this has wider applicability.
> >
> > I was going to suggest it myself, but then:
> >
> > static inline bool wq_has_sleeper(struct wait_queue_head *wq_head)
> > {
> > /*
> > * We need to be sure we are in sync with the
> > * add_wait_queue modifications to the wait queue.
> > *
> > * This memory barrier should be paired with one on the
> > * waiting side.
> > */
> > smp_mb();
> > return waitqueue_active(wq_head);
> > }
> >
> > Which means this is in fact quite expensive.
> >
> > Since wakeup is a lock + an interrupt trip, it would still be
> > cheaper single-threaded to "merely" suffer a full fence and for cases
> > where the queue is empty often enough this is definitely the right thing
> > to do.
>
> We're comparing against no-op wakeup. A real wakeup does an IPI, which
> completely dwarfs the cost of a barrier.
>
> And note that wake_up() is spin_lock_irqsave(), not spin_lock(). I
> assume it's gotten better, but back when I was looking at waitqueues
> nested pushf/popf was horrifically expensive.
>
> But perhaps can we do this with just a release barrier? Similar to how
> list_empty_careful() works.
>
> > On the other hand this executing when the queue is mostly *not* empty
> > would combat the point.
> >
> > So unfortunately embedding this in wake_up is a no-go.
>
> You definitely can't say that without knowing how often no-op
> wake_up()s occur. It wouldn't be hard to gather that (write a patch to
> add a pair of percpu counters, throw it on a few machines running random
> workloads) and I think the results might surprise you.

There is some talking past each other here.

I explicitly noted one needs to check what happens in real workloads.

I very much expect there will be consumers where there are no waiters
almost every time and consumers which almost always do have them.

My claim is that this should be handled on a case-by-case basis.

So i whipped out a bpftrace one liner do take a look at the kernel
build, details at the end.

In terms of the total (0 == no waiters, 1 == waiters):
[0, 1) 600191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, ...) 457956 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |

There is some funzies in the vfs layer which I'm going to sort out myself.

The kernel is tags/next-20241220

As far as pipes go:

@[
wakeprobe+5
__wake_up_common+63
__wake_up_sync_key+59
pipe_read+385
]:
[0, 1) 10629 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

So this guy literally never had any waiters when wakeup was issued.
faddr2line claims line 405, which I presume is off by one:

401 if (was_full)
402 wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
403 if (wake_next_reader)
404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);
405 kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);

I'm guessing the real empty queue is rd_wait. Definitely a candidate
depending on other workloads, personally I would just patch it as is.

@[
wakeprobe+5
__wake_up_common+63
__wake_up+54
pipe_release+92
]:
[0, 1) 12540 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1, ...) 5330 |@@@@@@@@@@@@@@@@@@@@@@ |

a wash, would not touch that no matter what

@[
wakeprobe+5
__wake_up_common+63
__wake_up+54
pipe_release+110
]:
[0, 1) 17870 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

again no waiters, line claimed is 737, again off by one:
733 /* Was that the last reader or writer, but not the other side? */
734 if (!pipe->readers != !pipe->writers) {
735 │ wake_up_interruptible_all(&pipe->rd_wait);
736 │ wake_up_interruptible_all(&pipe->wr_wait);
737 │ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
738 │ kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);

so I presume wr_wait? same comment as the entry at claimed line 405

@[
wakeprobe+5
__wake_up_common+63
__wake_up_sync_key+59
pipe_write+773
]:
[0, 1) 22237 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

again no waiters, claimed line 606
604 if (wake_next_writer)
605 │ wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
606 if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {

again would be inclined to patch as is

@[
wakeprobe+5
__wake_up_common+63
__wake_up_sync_key+59
pipe_read+943
]:
[0, 1) 9488 |@@@@@@@@@@@@@ |
[1, ...) 35765 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

majority of the time there were waiters, would not touch regardless of
other workloads, line 403

401 if (was_full)
402 wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);
403 if (wake_next_reader)
404 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);

the wr_wait thing

@[
wakeprobe+5
__wake_up_common+63
__wake_up_sync_key+59
pipe_write+729
]:
[0, 1) 199929 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[1, ...) 376586 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

ditto concerning not touching, resolved to line 603

601 if (was_empty || pipe->poll_usage)
602 │ wake_up_interruptible_sync_poll(&pipe->rd_wait,
EPOLLIN | EPOLLRDNORM);
603 kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
604 if (wake_next_writer)
605 wake_up_interruptible_sync_poll(&pipe->wr_wait,
EPOLLOUT | EPOLLWRNORM);

That is to say as far as this workload goes the submitted patch does
avoid some of the lock + irq trips by covering cases where there no
waiters seen in this workload, but also adds the smp_mb thing when it
does not help -- I would remove those spots from the submission.

While I agree a full service treatment would require a bunch of
different workloads, personally I would be inclined justify the change
merely based on the kernel build + leaving bpftrace running over some
a real-world box running random crap.

As for obtaining such info, I failed to convince bpftrace to check if
the waiter list is empty. Instead I resorted to adding a dedicated
func which grabs the bit and probing on that. The func is in a
different file because gcc decided to omit the call otherwise

one-liner:
bpftrace -e 'kprobe:wakeprobe { @[kstack(4)] = lhist(arg0, 0, 1, 1); }'

hack:
diff --git a/fs/file.c b/fs/file.c
index d868cdb95d1e..00d6a34eb174 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1442,3 +1442,11 @@ int iterate_fd(struct files_struct *files, unsigned n,
return res;
}
EXPORT_SYMBOL(iterate_fd);
+
+
+void wakeprobe(int count);
+
+void wakeprobe(int count)
+{
+}
+EXPORT_SYMBOL(wakeprobe);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 51e38f5f4701..8db7b0daf04b 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -57,6 +57,8 @@ void remove_wait_queue(struct wait_queue_head
*wq_head, struct wait_queue_entry
}
EXPORT_SYMBOL(remove_wait_queue);

+void wakeprobe(int count);
+
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
@@ -77,6 +79,8 @@ static int __wake_up_common(struct wait_queue_head
*wq_head, unsigned int mode,

lockdep_assert_held(&wq_head->lock);

+ wakeprobe(wq_has_sleeper(wq_head));
+
curr = list_first_entry(&wq_head->head, wait_queue_entry_t, entry);

if (&curr->entry == &wq_head->head)


--
Mateusz Guzik <mjguzik gmail.com>