Re: [PATCH v2 3/5] mm/shmem: introduce copy_zero_to_iter() for large zeroing

From: David Laight

Date: Mon Jun 01 2026 - 18:03:36 EST

On Mon, 1 Jun 2026 17:02:01 +0200
Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:

> On Mon, Jun 01, 2026 at 02:22:04PM +0100, Matthew Wilcox wrote:
> > On Mon, Jun 01, 2026 at 01:57:02PM +0800, Chi Zhiling wrote:
> > > Currently, holes larger than PAGE_SIZE cannot be handled because
> > > ZERO_PAGE is limited to a single page. Add copy_zero_to_iter() as a
> > > wrapper to support copying larger zero ranges to the iterator.
> >
> > I think Hugh put this optimisation in the wrong place, and you're
> > perpetuating that ;-)
> >
> > So perhaps we can start by moving this optimisation to lib/iov_iter.c?
> > And then you can redo your optimisation on top of that.
> >
> > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > index 243662af1af7..06c54d719fcd 100644
> > --- a/lib/iov_iter.c
> > +++ b/lib/iov_iter.c
> > @@ -451,7 +451,23 @@ static __always_inline
> > size_t zero_to_user_iter(void __user *iter_to, size_t progress,
> > size_t len, void *priv, void *priv2)
> > {
> > - return clear_user(iter_to, len);
> > + /*
> > + * it is noticeably faster to copy the zero page instead of
> > + * calling clear_user(). Shame.
> > + */
>
>
> This is a rather suspicious claim. If clear_user is indeed so terrible
> that it is faster to copy, the routine needs to get unfucked instead of
> the problem being worked around.
>
> I can't speak for arm64 or other non-amd64 archs, maybe these are
> horrendeously broken.
>
> On amd64 some archeology shows the following:
> 1. 0db7058e8e23e6bb ("x86/clear_user: Make it faster")
>
> 2022 vintage, replaces thoroughly terrible 8-byte per-iteration write
> with rep stos usage
>
> 2. 8c9b6a88b7e2f33c ("x86: improve on the non-rep 'clear_user' function")
>
> inlines rep stosb at the callsite if the CPU has FSRS, otherwise
> fallsback to a new routine which does 64-byte writes per loop iteration.
>
> FSRS is reasonably popular by now and chances are decent the test jig
> used by Chi has it.
>
> For a size like 4096 bytes, the 64-byte loop will be slower than rep
> movsb and even rep stosq. This needs to be patched and maybe I'll get
> around to doing the needful(tm) in few days (it's not hard to write, but
> some care with testing is needed).

I think Intel cpu from Sandy bridge onwards execute 'rep stosb' just
as fast as 'rep stosq'.
(I'm sure I've done the measurements for 'rep movs' and stos ought to
be similar.)
I suspect you get the same big gain (twice as fast) from an aligned
destination (IIRC 64 bytes on later cpu).
(But I doubt it is worth the cost of aligning the destination.)
The source alignment (for rep movs) make no difference at all.

The 'elephant in the room' is older zen cpu.
Some of those are no where near as fast as you might expect.
(Look at the issues using 'rep movsb' for all copies.)

I can test a range of old Intel cpu, but not amd ones.

-- David

>
> I could not be bothered to check how the workaround showed up, but it
> definitely needs to be removed as opposed to being perpetuated.
>