Re: [PATCH v2 3/5] mm/shmem: introduce copy_zero_to_iter() for large zeroing

From: Mateusz Guzik

Date: Mon Jun 01 2026 - 11:05:41 EST

On Mon, Jun 01, 2026 at 02:22:04PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 01, 2026 at 01:57:02PM +0800, Chi Zhiling wrote:
> > Currently, holes larger than PAGE_SIZE cannot be handled because
> > ZERO_PAGE is limited to a single page. Add copy_zero_to_iter() as a
> > wrapper to support copying larger zero ranges to the iterator.
>
> I think Hugh put this optimisation in the wrong place, and you're
> perpetuating that ;-)
>
> So perhaps we can start by moving this optimisation to lib/iov_iter.c?
> And then you can redo your optimisation on top of that.
>
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 243662af1af7..06c54d719fcd 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -451,7 +451,23 @@ static __always_inline
> size_t zero_to_user_iter(void __user *iter_to, size_t progress,
> size_t len, void *priv, void *priv2)
> {
> - return clear_user(iter_to, len);
> + /*
> + * it is noticeably faster to copy the zero page instead of
> + * calling clear_user(). Shame.
> + */

This is a rather suspicious claim. If clear_user is indeed so terrible
that it is faster to copy, the routine needs to get unfucked instead of
the problem being worked around.

I can't speak for arm64 or other non-amd64 archs, maybe these are
horrendeously broken.

On amd64 some archeology shows the following:
1. 0db7058e8e23e6bb ("x86/clear_user: Make it faster")

2022 vintage, replaces thoroughly terrible 8-byte per-iteration write
with rep stos usage

2. 8c9b6a88b7e2f33c ("x86: improve on the non-rep 'clear_user' function")

inlines rep stosb at the callsite if the CPU has FSRS, otherwise
fallsback to a new routine which does 64-byte writes per loop iteration.

FSRS is reasonably popular by now and chances are decent the test jig
used by Chi has it.

For a size like 4096 bytes, the 64-byte loop will be slower than rep
movsb and even rep stosq. This needs to be patched and maybe I'll get
around to doing the needful(tm) in few days (it's not hard to write, but
some care with testing is needed).

I could not be bothered to check how the workaround showed up, but it
definitely needs to be removed as opposed to being perpetuated.