Re: [PATCH 4/5] x86: entry_64.S: always allocate complete "struct pt_regs"

From: Andy Lutomirski
Date: Mon Aug 04 2014 - 17:03:50 EST


On Mon, Aug 4, 2014 at 11:28 PM, Denys Vlasenko
<vda.linux@xxxxxxxxxxxxxx> wrote:
> On Fri, Aug 1, 2014 at 7:04 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Fri, Aug 1, 2014 at 7:48 AM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
>>> 64-bit code was using six stack slots fewer by not saving/restoring
>>> registers which a callee-preserved according to C ABI,
>>> and not allocating space for them
>>
>> This is great.
>>
>> Next up: remove FIXUP/RESTORE_TOP_OF_STACK? :) Maybe I'll give that a shot.
>
> I'm yet at the stage "what that stuff does anyway?" and at
> "why do we need percpu old_rsp thingy?" in particular.

On x86_64, the syscall instruction has no effect on rsp. That means
that the entry point starts out with no stack. There are no free
registers whatsoever at the entry point.

That means that the entry code needs to do swapgs, stash rsp somewhere
relative to gs, and then load the kernel's rsp. old_rsp is the spot
used for this.

Now the kernel does an optimization that is, I think, very much not
worth it. The kernel doesn't bother sticking the old rsp value into
pt_regs (saving two instructions on fast path entries) and doesn't
initialize the SS, CS, RCX, and EFLAGS fields in pt_regs, saving four
more instructions.

To make this optimization work, the whole FIXUP/RESTORE_TOP_OF_STACK
dance is needed, and there's the usersp crap in the context switch
code, and current_user_stack_pointer(), and probably even more crap
that I haven't noticed. And I sure hope that nothing in the *compat*
syscall path touches current_user_stack_pointer(), because the compat
code doesn't seem to use old_rsp.

I think this should all be ripped out. The only real difficulty will
be that the sysret code needs to restore rsp itself, so the sysret
path will end up needing two more instructions. Removing all of the
TOP_OF_STACK stuff will add ten instructions to fast path syscalls,
and I wouldn't be surprised if this adds considerably fewer than ten
cycles on any modern chip.

(It's too bad that there's no unlocked xchg; this could be faster if
we had one. It's also too bad that the syscall ABI didn't choose some
register to unconditionally set to zero, which would have given us the
single scratch register we'd need to avoid this whole mess in the
first place.)

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/