RE: [PATCH 09/31] x86/entry/32: Leave the kernel via trampoline stack

From: David Laight
Date: Sat Feb 10 2018 - 10:26:40 EST


From: Denys Vlasenko
> Sent: 09 February 2018 17:17
> On 02/09/2018 06:05 PM, Linus Torvalds wrote:
> > On Fri, Feb 9, 2018 at 1:25 AM, Joerg Roedel <joro@xxxxxxxxxx> wrote:
> >> +
> >> + /* Copy over the stack-frame */
> >> + cld
> >> + rep movsb
> >
> > Ugh. This is going to be horrendous. Maybe not noticeable on modern
> > CPU's, but the whole 32-bit code is kind of pointless on a modern CPU.
> >
> > At least use "rep movsl". If the kernel stack isn't 4-byte aligned,
> > you have issues.

The alignment doesn't matter, 'rep movsl' will still work.

> Indeed, "rep movs" has some setup overhead that makes it undesirable
> for small sizes. In my testing, moving less than 128 bytes with "rep movs"
> is a loss.

It very much depends on the cpu.

Recent (Haswell?) Intel cpus have hardware support for optimising 'rep movsb'
for cached memory locations so that it is fast regardless of the alignments.
The setup cost is fairly small.

The previous generation had an optimisation for 'rep movsb' for less than
7 bytes, but for larger values the setup cost was significantly higher.
On these cpu you needed to use 'rep movsd' (64 bits is best) for the bulk
of a copy.

Actually, instead of using 'rep movsb' to copy the odd few bytes, for
memcpy() you can copy the last (misaligned) 8 bytes first then use
'rep movsd' for the bulk of the copy.

On Netburst P4 the setup cost for any 'rep movs' was something like 45 clocks.
You really didn't want to use them for short copies.
(A C compiler from a well known OS supplier will 'optimise' any copy loop
into 'rep movsb' - not entirely the best of optimisations!)

I also managed to match the per-cycle cost of 'rep movsl' with a copy
loop on my Athlon-700 (but not the setup cost, on a P4 I might have
beaten the setup cost as well).

David