Re: [PATCH v2 3/5] mm/shmem: introduce copy_zero_to_iter() for large zeroing

From: Mateusz Guzik

Date: Mon Jun 01 2026 - 11:22:24 EST

On Mon, Jun 01, 2026 at 05:02:01PM +0200, Mateusz Guzik wrote:
> On Mon, Jun 01, 2026 at 02:22:04PM +0100, Matthew Wilcox wrote:
> > On Mon, Jun 01, 2026 at 01:57:02PM +0800, Chi Zhiling wrote:
> > > Currently, holes larger than PAGE_SIZE cannot be handled because
> > > ZERO_PAGE is limited to a single page. Add copy_zero_to_iter() as a
> > > wrapper to support copying larger zero ranges to the iterator.
> >
> > I think Hugh put this optimisation in the wrong place, and you're
> > perpetuating that ;-)
> >
> > So perhaps we can start by moving this optimisation to lib/iov_iter.c?
> > And then you can redo your optimisation on top of that.
> >
> > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > index 243662af1af7..06c54d719fcd 100644
> > --- a/lib/iov_iter.c
> > +++ b/lib/iov_iter.c
> > @@ -451,7 +451,23 @@ static __always_inline
> > size_t zero_to_user_iter(void __user *iter_to, size_t progress,
> > size_t len, void *priv, void *priv2)
> > {
> > - return clear_user(iter_to, len);
> > + /*
> > + * it is noticeably faster to copy the zero page instead of
> > + * calling clear_user(). Shame.
> > + */
>
>
> This is a rather suspicious claim. If clear_user is indeed so terrible
> that it is faster to copy, the routine needs to get unfucked instead of
> the problem being worked around.
>
> I can't speak for arm64 or other non-amd64 archs, maybe these are
> horrendeously broken.
>
> On amd64 some archeology shows the following:
> 1. 0db7058e8e23e6bb ("x86/clear_user: Make it faster")
>
> 2022 vintage, replaces thoroughly terrible 8-byte per-iteration write
> with rep stos usage
>
> 2. 8c9b6a88b7e2f33c ("x86: improve on the non-rep 'clear_user' function")
>
> inlines rep stosb at the callsite if the CPU has FSRS, otherwise
> fallsback to a new routine which does 64-byte writes per loop iteration.
>
> FSRS is reasonably popular by now and chances are decent the test jig
> used by Chi has it.
>
> For a size like 4096 bytes, the 64-byte loop will be slower than rep
> movsb and even rep stosq. This needs to be patched and maybe I'll get
> around to doing the needful(tm) in few days (it's not hard to write, but
> some care with testing is needed).
>
> I could not be bothered to check how the workaround showed up, but it
> definitely needs to be removed as opposed to being perpetuated.

Maybe I was not clear enough, so here it is stated differently:

The routine was incredibly bad on amd64 prior to 2022, I presume the
claim of bad performance predates the fixup. The current implementation
for FSRS_enabled CPUs is fine and I guarantee it wont be slower than
copying. For CPUs without FSRS the fallback is slower than it needs to
be for sizes like 4096 bytes, that I'm going to fix soon(tm).

No matter what, there is no justification for issuing a copy from zero
page just to avoid zeroing in place.