Re: [x86] d55564cfc2: will-it-scale.per_thread_ops -5.8% regression

From: Linus Torvalds
Date: Thu Jan 07 2021 - 12:45:03 EST


On Thu, Jan 7, 2021 at 5:32 AM kernel test robot <oliver.sang@xxxxxxxxx> wrote:
>
> FYI, we noticed a -5.8% regression of will-it-scale.per_thread_ops due to commit:

Ok, that's noticeable.

And:

> commit: d55564cfc222326e944893eff0c4118353e349ec ("x86: Make __put_user() generate an out-of-line call")

Yeah, that wasn't supposed to cause any performance regressions. No
core code should use __put_user() so much.

But:

> | testcase: change | will-it-scale: will-it-scale.per_process_ops -7.3% regression |
> | test machine | 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
> | test parameters | cpufreq_governor=performance |
> | | mode=process |
> | | nr_task=100% |
> | | test=poll2 |
> | | ucode=0x16 |

Ok, it's poll(), and it's definitely the __put_user() there:

> 0.00 +1.8 1.77 ą 3% perf-profile.children.cycles-pp.__put_user_nocheck_2
> 0.00 +1.6 1.64 ą 3% perf-profile.self.cycles-pp.__put_user_nocheck_2

And in fact, it's that final "write back the 16-bit revents field" at the end.

Which must have sucked before too, because it used to do a "stac/clac"
for every word - but now it does it out of line.

The fix is to convert that loop to use "unsafe_put_user()" with the
necessary accoutrements around it, and that should speed things up
quite nicely. The (double) loop itself is actually just 14
instructions, it's ridiculous how bad the code used to be, and how
much better it is with the nice unsafe_put_user(). The whole double
loop ends up being just

lea 0x68(%rsp),%rsi
mov %rcx,%rax
1: mov 0x8(%rsi),%ecx
lea 0xc(%rsi),%rdx
test %ecx,%ecx
je 3f
lea (%rax,%rcx,8),%rdi
2: movzwl 0x6(%rdx),%ecx
mov %cx,0x6(%rax)
add $0x8,%rax
add $0x8,%rdx
cmp %rdi,%rax
jne 2b
3: mov (%rsi),%rsi
test %rsi,%rsi
jne 1b

with the attached patch.

Before, it would do the whole CLAC/STAC dance inside that loop for
every entry (and with that commit d55564cfc22 it would be a function
call, of course).

Can you verify that this fixes the regression (and in fact I'd expect
it to improve that test-case)?

NOTE! The patch is entirely untested. I verified that the code
generation now looks sane, and it all looks ObviouslyCorrect(tm) to
me, but mistakes happen and maybe I missed some detail..

Linus

Attachment: patch
Description: Binary data