Re: [PATCH] x86-64: fix memset() to support sizes of 4Gb and above
From: Ingo Molnar
Date: Fri Jan 06 2012 - 06:07:22 EST
* Jan Beulich <JBeulich@xxxxxxxx> wrote:
> While currently there doesn't appear to be any reachable
> in-tree case where such large memory blocks may be passed to
> memset() (alloc_bootmem() being the primary non-reachable one,
> as it gets called with suitably large sizes in FLATMEM
> configurations), we have recently hit the problem a second
> time in our Xen kernels. Rather than working around it a
> second time, prevent others from falling into the same trap by
> fixing this long standing limitation.
>
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
Have you checked the before/after size of the hotpath?
The patch suggests that it got shorter by 3 instructions:
> - movl %edx,%r8d
> - andl $7,%r8d
> - movl %edx,%ecx
> - shrl $3,%ecx
> + movq %rdx,%rcx
> + andl $7,%edx
> + shrq $3,%rcx
[...]
> movq %rdi,%r10
> - movq %rdx,%r11
>
[...]
> - movl %r11d,%ecx
> - andl $7,%ecx
> + andl $7,%edx
Is that quick impression correct? I have not tried building or
measuring it.
Also, note that we have some cool instrumentation tech upstream,
lately we've added a way to measure the *kernel*'s memcpy
routine performance in user-space, using perf bench:
$ perf bench mem memcpy -r x86
# Running mem/memcpy benchmark...
Unknown routine:x86
Available routines...
default ... Default memcpy() provided by glibc
x86-64-unrolled ... unrolled memcpy() in arch/x86/lib/memcpy_64.S
$ perf bench mem memcpy -r x86-64-unrolled
# Running mem/memcpy benchmark...
# Copying 1MB Bytes ...
1.888902 GB/Sec
15.024038 GB/Sec (with prefault)
This builds the routine in arch/x86/lib/memcpy_64.S and measures
its bandwidth in the cache-cold and cache-hot case as well.
Would be nice to add support for arch/x86/lib/memset_64.S as
well, and look at the before/after performance of it. In
userspace we can do a lot more accurate measurements of this
kind:
$ perf stat --repeat 1000 -e instructions -e cycles perf bench mem memcpy -r x86-64-unrolled >/dev/null
Performance counter stats for 'perf bench mem memcpy -r x86-64-unrolled' (1000 runs):
4,924,378 instructions # 0.84 insns per cycle ( +- 0.03% )
5,892,603 cycles # 0.000 GHz ( +- 0.06% )
0.002208000 seconds time elapsed ( +- 0.10% )
Check how the confidence interval of the measurement is in the
0.03% range, thus even a single cycle change in overhead caused
by your patch should be measurable. Note, you'll need to rebuild
perf with the different memset routines.
For example the kernel's memcpy routine in slightly faster than
glibc's:
$ perf stat --repeat 1000 -e instructions -e cycles perf bench mem memcpy -r default >/dev/null
Performance counter stats for 'perf bench mem memcpy -r default' (1000 runs):
4,927,173 instructions # 0.83 insns per cycle ( +- 0.03% )
5,928,168 cycles # 0.000 GHz ( +- 0.06% )
0.002157349 seconds time elapsed ( +- 0.10% )
If such measurements all suggests equal or better performance,
and if there's no erratum in current CPUs that would make 4G
string copies dangerous [which your research suggests should be
fine], i have no principal objection against this patch.
I'd not be surprised (at all) if other OSs did larger than 4GB
memsets in certain circumstances.
Thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/