Re: [PATCH v2 1/2] fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write

From: Breno Leitao

Date: Sun May 24 2026 - 12:47:47 EST

Hello Mateusz,

On Sun, May 24, 2026 at 04:48:14PM +0200, Mateusz Guzik wrote:
> On Sun, May 24, 2026 at 4:30 PM Breno Leitao <leitao@xxxxxxxxxx> wrote:
> >
> > On Sat, May 23, 2026 at 06:26:27PM +0200, Oleg Nesterov wrote:
> > > > @@ -566,7 +661,9 @@ anon_pipe_write(struct kiocb *iocb, struct iov_iter *from)
> > > > * after waiting we need to re-check whether the pipe
> > > > * become empty while we dropped the lock.
> > > > */
> > > > + anon_pipe_refill_tmp_pages(pipe, &prealloc);
> > > > mutex_unlock(&pipe->mutex);
> > > > + anon_pipe_free_pages(&prealloc);
> > >
> > > Do we really want to call anon_pipe_free_pages() at this point?
> > >
> > > The main loop will continue when pipe_writable() becomes true again...
> >
> > I went back and forth on this. The argument for freeing was that
> > wait_event_interruptible_exclusive() can sleep arbitrarily long (slow or
> > stopped reader), and holding up the prealloc pages felt antisocial --
> > especially under the memory pressure this series targets, where those pages are
> > more useful on the freelists than parked on a sleeping task.
> >
> > On the other side, on wakeup the loop is guaranteed to want pages again, and
> > re-entering the allocator under the mutex puts us back in the contended state
> > the patch removes. For any write() large enough to wait mid-syscall (which is
> > the workload patch 2/2 measures), keeping them strictly wins on throughput /
> > p99.
> >
>
> You can still prealloc after wakeup for whatever reminder you got
> though, but I can agree dropping these frees is a sensible way out and
> it is easier and I'm not going to insist on one way or the other.

Ack. I've sent a v3 with anon_pipe_free_pages() and
anon_pipe_refill_tmp_pages() dropped.

> However, I think it would be prudent to add a tracepoint to some
> machines on your fleet to find out how often they allocate pages under
> the mutex (and for what i/o size). Initial alloc for the first write <
> PAGE_SIZE definitely happens under the mutex which is probably not a
> problem, but for anything later?

> The tracepoint can have a trivial
> indicator if this is the first write if that matters. One can

Isn't this what I've reported earlier?

https://lore.kernel.org/all/ag3Ty3T24wjn1aFw@xxxxxxxxx/

Adding a tracepoint is harder than usual, given kernel rollout takes ages.

But I hacked a bpftrace script and ran it on a random sample of fleet hosts (5
min each).

As reported earlier, multi-page pipe writes are not uncommon: on one
host a single long-running process produced 196,476 under-mutex alloc_page()
calls in 5 minutes, with allocs-per-write distributions reaching 16+ -- exactly
the pattern this patch removes.

Most hosts sit at the boring ~20-30 allocs/sec dominated by one-page
first-writes that the patch's `total_len <= PAGE_SIZE` early-return skips
anyway, so the win is concentrated on the workloads that actually need it.

None of the allocs hit reclaim during the trace I ran, but I would expect
direct reclaim to happen with the lock held.

Thanks for the review and direction,
--breno