Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss

From: Andy Lutomirski
Date: Thu Apr 23 2015 - 04:50:24 EST


On Thu, Apr 23, 2015 at 12:37 AM, Brian Gerst <brgerst@xxxxxxxxx> wrote:
> On Tue, Mar 31, 2015 at 8:38 AM, tip-bot for Denys Vlasenko
> <tipbot@xxxxxxxxx> wrote:
>> Commit-ID: e7d6eefaaa443130079d73cd05039d90b3db7a4a
>> Gitweb: http://git.kernel.org/tip/e7d6eefaaa443130079d73cd05039d90b3db7a4a
>> Author: Denys Vlasenko <dvlasenk@xxxxxxxxxx>
>> AuthorDate: Fri, 27 Mar 2015 11:48:17 -0700
>> Committer: Ingo Molnar <mingo@xxxxxxxxxx>
>> CommitDate: Tue, 31 Mar 2015 10:45:15 +0200
>>
>> x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
>>
>> This vDSO code only gets used by 64-bit kernels, not 32-bit ones.
>>
>> On 64-bit kernels, the data segment is the same for 32-bit and
>> 64-bit userspace, and the SYSRET instruction loads %ss with its
>> selector.
>>
>> So there's no need to repeat it by hand. Segment loads are somewhat
>> expensive: tens of cycles.
>>
>> Signed-off-by: Denys Vlasenko <dvlasenk@xxxxxxxxxx>
>> [ Removed unnecessary comment. ]
>> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
>> Cc: Alexei Starovoitov <ast@xxxxxxxxxxxx>
>> Cc: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
>> Cc: Borislav Petkov <bp@xxxxxxxxx>
>> Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
>> Cc: H. Peter Anvin <hpa@xxxxxxxxx>
>> Cc: Kees Cook <keescook@xxxxxxxxxxxx>
>> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
>> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
>> Cc: Will Drewry <wad@xxxxxxxxxxxx>
>> Link: http://lkml.kernel.org/r/63da6d778f69fd0f1345d9287f6764d58be519fa.1427482099.git.luto@xxxxxxxxxx
>> Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
>> ---
>> arch/x86/vdso/vdso32/syscall.S | 2 --
>> 1 file changed, 2 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S
>> index 5415b56..6b286bb 100644
>> --- a/arch/x86/vdso/vdso32/syscall.S
>> +++ b/arch/x86/vdso/vdso32/syscall.S
>> @@ -19,8 +19,6 @@ __kernel_vsyscall:
>> .Lpush_ebp:
>> movl %ecx, %ebp
>> syscall
>> - movl $__USER32_DS, %ecx
>> - movl %ecx, %ss
>> movl %ebp, %ecx
>> popl %ebp
>> .Lpop_ebp:
>
> This patch unfortunately is causing Wine to break on some applications:
>
> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
> Register dump:
> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
> ESI:00000040 EDI:7ffd4000
> Stack dump:
> 0x00aed60c: 00aed648 f7575e5b 7bcc8000 00000000
> 0x00aed61c: 7bc7bc09 00000010 00aed750 00000040
> 0x00aed62c: 00aed750 00aed650 7bcc8000 7bc7bbdd
> 0x00aed63c: 7bcc8000 00aed6a0 00aed750 00aed738
> 0x00aed64c: 7bc7cfa9 00000011 00aed750 00000040
> 0x00aed65c: 00000020 00000000 00000000 7bc4f141
> Backtrace:
> =>0 0xf779bc07 __kernel_vsyscall+0x7() in [vdso].so (0x00aed750)
> 1 0xf7575e5b __libc_read+0x4a() in libpthread.so.0 (0x00aed648)
> 2 0x7bc7bc09 read_reply_data+0x38(buffer=0xaed750, size=0x40)
> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/server.c:239]
> in ntdll (0x00aed648)
> 3 0x7bc7cfa9 wine_server_call+0x178() in ntdll (0x00aed738)
> 4 0x7bc840ec NtSetEvent+0x4b(handle=0x80,
> NumberOfThreadsReleased=0x0(nil))
> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/sync.c:361]
> in ntdll (0x00aed7c8)
> 5 0x7b874afa SetEvent+0x24(handle=<couldn't compute location>)
> [/home/bgerst/src/wine/wine32/dlls/kernel32/../../../dlls/kernel32/sync.c:572]
> in kernel32 (0x00aed7e8)
> 6 0x0044e31a in battle.net launcher (+0x4e319) (0x00aed818)
> ...
>
> __kernel_vsyscall+0x7 points to "pop %ebp".
>
> This is on an AMD Phenom(tm) II X6 1055T Processor.
>
> It appears that there are some subtle differences in how sysretl works
> on AMD vs. Intel. According to the Intel docs, the SS selector and
> descriptor cache is completely reset by sysret to fixed values. The
> AMD docs however are concerning:

My understanding is that, in long mode, the segment attributes are
ignored, and that there is no such thing as a "64-bit stack". So...

>
> AMD's syscall:
> SS.sel = MSR_STAR.SYSCALL_CS + 8
> SS.attr = 64-bit stack,dpl0

I don't really believe that.

> SS.base = 0x00000000
> SS.limit = 0xFFFFFFFF
>
> AMD's sysret:
> SS.sel = MSR_STAR.SYSRET_CS + 8 // SS selector is changed,
> // SS base, limit, attributes unchanged.
>

I'm pretty sure that this is at least a little bit wrong. It makes no
sense for me for syscall to set SS.DPL=0 and for sysret to leave
SS.DPL=0. It had better at least change DPL to 3. (Except... don't
they mean RPL? Why is the DPL cached at all? But RPL is clearly
changed, since it's part of the selector.)

> Not changing base or limit is no big deal, but not changing attributes
> could be the problem. It might be leaving the "64-bit stack"
> attribute set, for whatever that means.

Hmm. I don't know if I believe that explanation. For one thing, the
APM says "Executing SYSRET in non-64-bit mode or with a 16- or 32-bit
operand size returns to 32-bit mode with a 32-bit stack pointer."

We can revert this patch or fix it, but I'd like to at least try to
understand what's wrong first. Borislav, any ideas?

I'm curious whether we can somehow end up in the kernel without a
sensible SS. What happens if we have SS = 0?

Try this on for size:

1. Wine process does syscall
2. Context switch to any other task
3. Interrupt (software or hardware), which loads SS with ss0, which is
0 on x86_64.
4. Context switch back to Wine.
5. sysretl

Would fixing this be as simple as changing this code in
arch/x86/kernel/process.c:

__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
.x86_tss = {
.sp0 = TOP_OF_INIT_STACK,
#ifdef CONFIG_X86_32
.ss0 = __KERNEL_DS,

by moving the ifdef down a line? Even if that fixed it, it would be
extremely fragile, but IMO it would be a good change to make
regardless (i.e. the kernel's SS would be less unpredictable).

In any event, if we really need the SS reload, I'd rather reload it in
kernel space before sysret than in user space after sysret, so we
never run user code with whatever screwed up SS hidden part we have
here. Unless, of course, Xen would corrupt it for us.

>
> Reloading SS from the GDT would obviously reset any bad state left by
> sysretl. Unfortunately we may have to put it back in, and then NOP it
> out on Intel.

At least this is easy now that alternatives work in the vdso.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/