x86: Static optimisations for copy_user

From: Chris Wilson
Date: Thu Jun 01 2017 - 02:59:20 EST


I was looking at the overhead of drmIoctl() in a microbenchmark that
repeatedly did a copy_from_user(.size=8) followed by a
copy_to_user(.size=8) as part of the DRM_IOCTL_I915_GEM_BUSY. I found
that if I forced inlined the get_user/put_user instead the walltime of
the ioctl was improved by about 20%. If copy_user_generic_unrolled was
used instead of copy_user_enhanced_fast_string, performance of the
microbenchmark was improved by 10%. Benchmarking on a few machines

(Broadwell)
benchmark_copy_user(hot):
size unrolled string fast-string
1 158 77 79
2 306 154 158
4 614 308 317
6 926 462 476
8 1344 298 635
12 1773 482 952
16 2797 602 1269
24 4020 903 1906
32 5055 1204 2540
48 6150 1806 3810
64 9564 2409 5082
96 13583 3612 6483
128 18108 4815 8434

(Broxton)
benchmark_copy_user(hot):
size unrolled string fast-string
1 270 52 53
2 364 106 109
4 460 213 218
6 486 305 312
8 1250 253 437
12 1009 332 625
16 2059 514 897
24 2624 672 1071
32 3043 1014 1750
48 3620 1499 2561
64 7777 1971 3333
96 7499 2876 4772
128 9999 3733 6088

which says that for this cache hot case in benchmarking the rep mov
microcode noticeably underperforms. Though once we pass a few
cachelines, and definitely after exceeding L1 cache, rep mov is the
clear winner. From cold, there is no difference in timings.

I can improve the microbenchmark by either force inlining the
raw_copy_*_user switches, or by switching to copy_user_generic_unrolled.
Both leave a sour taste. The switch is too big to be inlined, and if
called out-of-line the function call overhead negates its benefits.
Switching between fast-string and unrolled makes a presumption on
behaviour.

In the end, I limited this series to just adding a few extra
translations for statically known copy_*_user().
-Chris