Re: [PATCH] x86/uaccess: use unrolled string copy for short strings

From: Paolo Abeni
Date: Thu Jun 22 2017 - 13:02:57 EST


On Thu, 2017-06-22 at 10:47 +0200, Ingo Molnar wrote:
> * Paolo Abeni <pabeni@xxxxxxxxxx> wrote:
>
> > The 'rep' prefix suffers for a relevant "setup cost"; as a result
> > string copies with unrolled loops are faster than even
> > optimized string copy using 'rep' variant, for short string.
> >
> > This change updates __copy_user_generic() to use the unrolled
> > version for small string length. The threshold length for short
> > string - 64 - has been selected with empirical measures as the
> > larger value that still ensure a measurable gain.
> >
> > A micro-benchmark of __copy_from_user() with different lengths shows
> > the following:
> >
> > string len vanilla patched delta
> > bytes ticks ticks tick(%)
> >
> > 0 58 26 32(55%)
> > 1 49 29 20(40%)
> > 2 49 31 18(36%)
> > 3 49 32 17(34%)
> > 4 50 34 16(32%)
> > 5 49 35 14(28%)
> > 6 49 36 13(26%)
> > 7 49 38 11(22%)
> > 8 50 31 19(38%)
> > 9 51 33 18(35%)
> > 10 52 36 16(30%)
> > 11 52 37 15(28%)
> > 12 52 38 14(26%)
> > 13 52 40 12(23%)
> > 14 52 41 11(21%)
> > 15 52 42 10(19%)
> > 16 51 34 17(33%)
> > 17 51 35 16(31%)
> > 18 52 37 15(28%)
> > 19 51 38 13(25%)
> > 20 52 39 13(25%)
> > 21 52 40 12(23%)
> > 22 51 42 9(17%)
> > 23 51 46 5(9%)
> > 24 52 35 17(32%)
> > 25 52 37 15(28%)
> > 26 52 38 14(26%)
> > 27 52 39 13(25%)
> > 28 52 40 12(23%)
> > 29 53 42 11(20%)
> > 30 52 43 9(17%)
> > 31 52 44 8(15%)
> > 32 51 36 15(29%)
> > 33 51 38 13(25%)
> > 34 51 39 12(23%)
> > 35 51 41 10(19%)
> > 36 52 41 11(21%)
> > 37 52 43 9(17%)
> > 38 51 44 7(13%)
> > 39 52 46 6(11%)
> > 40 51 37 14(27%)
> > 41 50 38 12(24%)
> > 42 50 39 11(22%)
> > 43 50 40 10(20%)
> > 44 50 42 8(16%)
> > 45 50 43 7(14%)
> > 46 50 43 7(14%)
> > 47 50 45 5(10%)
> > 48 50 37 13(26%)
> > 49 49 38 11(22%)
> > 50 50 40 10(20%)
> > 51 50 42 8(16%)
> > 52 50 42 8(16%)
> > 53 49 46 3(6%)
> > 54 50 46 4(8%)
> > 55 49 48 1(2%)
> > 56 50 39 11(22%)
> > 57 50 40 10(20%)
> > 58 49 42 7(14%)
> > 59 50 42 8(16%)
> > 60 50 46 4(8%)
> > 61 50 47 3(6%)
> > 62 50 48 2(4%)
> > 63 50 48 2(4%)
> > 64 51 38 13(25%)
> >
> > Above 64 bytes the gain fades away.
> >
> > Very similar values are collectd for __copy_to_user().
> > UDP receive performances under flood with small packets using recvfrom()
> > increase by ~5%.
>
> What CPU model(s) were used for the performance testing and was it performance
> tested on several different types of CPUs?
>
> Please add a comment here:
>
> + if (len <= 64)
> + return copy_user_generic_unrolled(to, from, len);
> +
>
> ... because it's not obvious at all that this is a performance optimization, not a
> correctness issue. Also explain that '64' is a number that we got from performance
> measurements.
>
> But in general I like the change - as long as it was measured on reasonably modern
> x86 CPUs. I.e. it should not regress on modern Intel or AMD CPUs.

Thank you for reviewing this.

I'll add an hopefully descriptive comment in v2.

The above figures are for an Intel Xeon E5-2690 v4.

I see similar data points with an i7-6500U CPU, while an i7-4810MQ
shows slightly better improvements.

I'm in the process of collecting more figures for AMD processors, which
I don't have so handy - it may take some time.

Thanks,

Paolo