Re: [tip:x86/vdso] x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
From: Denys Vlasenko
Date: Thu Apr 23 2015 - 05:21:16 EST
On 04/23/2015 09:37 AM, Brian Gerst wrote:
> On Tue, Mar 31, 2015 at 8:38 AM, tip-bot for Denys Vlasenko
> <tipbot@xxxxxxxxx> wrote:
>> Commit-ID: e7d6eefaaa443130079d73cd05039d90b3db7a4a
>> Gitweb: http://git.kernel.org/tip/e7d6eefaaa443130079d73cd05039d90b3db7a4a
>> Author: Denys Vlasenko <dvlasenk@xxxxxxxxxx>
>> AuthorDate: Fri, 27 Mar 2015 11:48:17 -0700
>> Committer: Ingo Molnar <mingo@xxxxxxxxxx>
>> CommitDate: Tue, 31 Mar 2015 10:45:15 +0200
>>
>> x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
>>
>> This vDSO code only gets used by 64-bit kernels, not 32-bit ones.
>>
>> On 64-bit kernels, the data segment is the same for 32-bit and
>> 64-bit userspace, and the SYSRET instruction loads %ss with its
>> selector.
>>
>> So there's no need to repeat it by hand. Segment loads are somewhat
>> expensive: tens of cycles.
>>
>> Signed-off-by: Denys Vlasenko <dvlasenk@xxxxxxxxxx>
>> [ Removed unnecessary comment. ]
>> Signed-off-by: Andy Lutomirski <luto@xxxxxxxxxx>
>> Cc: Alexei Starovoitov <ast@xxxxxxxxxxxx>
>> Cc: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
>> Cc: Borislav Petkov <bp@xxxxxxxxx>
>> Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
>> Cc: H. Peter Anvin <hpa@xxxxxxxxx>
>> Cc: Kees Cook <keescook@xxxxxxxxxxxx>
>> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
>> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
>> Cc: Will Drewry <wad@xxxxxxxxxxxx>
>> Link: http://lkml.kernel.org/r/63da6d778f69fd0f1345d9287f6764d58be519fa.1427482099.git.luto@xxxxxxxxxx
>> Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
>> ---
>> arch/x86/vdso/vdso32/syscall.S | 2 --
>> 1 file changed, 2 deletions(-)
>>
>> diff --git a/arch/x86/vdso/vdso32/syscall.S b/arch/x86/vdso/vdso32/syscall.S
>> index 5415b56..6b286bb 100644
>> --- a/arch/x86/vdso/vdso32/syscall.S
>> +++ b/arch/x86/vdso/vdso32/syscall.S
>> @@ -19,8 +19,6 @@ __kernel_vsyscall:
>> .Lpush_ebp:
>> movl %ecx, %ebp
>> syscall
>> - movl $__USER32_DS, %ecx
>> - movl %ecx, %ss
>> movl %ebp, %ecx
>> popl %ebp
>> .Lpop_ebp:
>
> This patch unfortunately is causing Wine to break on some applications:
>
> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
> Register dump:
> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
> ESI:00000040 EDI:7ffd4000
> Stack dump:
> 0x00aed60c: 00aed648 f7575e5b 7bcc8000 00000000
> 0x00aed61c: 7bc7bc09 00000010 00aed750 00000040
> 0x00aed62c: 00aed750 00aed650 7bcc8000 7bc7bbdd
> 0x00aed63c: 7bcc8000 00aed6a0 00aed750 00aed738
> 0x00aed64c: 7bc7cfa9 00000011 00aed750 00000040
> 0x00aed65c: 00000020 00000000 00000000 7bc4f141
> Backtrace:
> =>0 0xf779bc07 __kernel_vsyscall+0x7() in [vdso].so (0x00aed750)
> 1 0xf7575e5b __libc_read+0x4a() in libpthread.so.0 (0x00aed648)
> 2 0x7bc7bc09 read_reply_data+0x38(buffer=0xaed750, size=0x40)
> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/server.c:239]
> in ntdll (0x00aed648)
> 3 0x7bc7cfa9 wine_server_call+0x178() in ntdll (0x00aed738)
> 4 0x7bc840ec NtSetEvent+0x4b(handle=0x80,
> NumberOfThreadsReleased=0x0(nil))
> [/home/bgerst/src/wine/wine32/dlls/ntdll/../../../dlls/ntdll/sync.c:361]
> in ntdll (0x00aed7c8)
> 5 0x7b874afa SetEvent+0x24(handle=<couldn't compute location>)
> [/home/bgerst/src/wine/wine32/dlls/kernel32/../../../dlls/kernel32/sync.c:572]
> in kernel32 (0x00aed7e8)
> 6 0x0044e31a in battle.net launcher (+0x4e319) (0x00aed818)
> ...
>
> __kernel_vsyscall+0x7 points to "pop %ebp".
>
> This is on an AMD Phenom(tm) II X6 1055T Processor.
>
> It appears that there are some subtle differences in how sysretl works
> on AMD vs. Intel. According to the Intel docs, the SS selector and
> descriptor cache is completely reset by sysret to fixed values. The
> AMD docs however are concerning:
>
> AMD's syscall:
> SS.sel = MSR_STAR.SYSCALL_CS + 8
> SS.attr = 64-bit stack,dpl0
> SS.base = 0x00000000
> SS.limit = 0xFFFFFFFF
>
> AMD's sysret:
> SS.sel = MSR_STAR.SYSRET_CS + 8 // SS selector is changed,
> // SS base, limit, attributes unchanged.
> Not changing base or limit is no big deal, but not changing attributes
> could be the problem. It might be leaving the "64-bit stack"
> attribute set, for whatever that means.
I am not aware of any officially existing "64-bit" stack or
data segment attribute. x86 data segment descriptors
don't have any such bits, the 64-bitness of stack operations
in long mode is hardwired. (Unlike code segment descriptors,
which _do_ have a bit which controls 64-bitness).
This is not to say that CPU internally is prohibited from having
something along those lines.
However, if AMD CPUs would have a bug where after sysretl %ss
descriptor cache is left in a bad state causing stack ops to be
done in 64-bit fashion, *any* 32-bit userspace would immediately explode.
This is not the case.
What Wine could do differently from a typical Linux executable?
It may use nonzero %ss base, it may use a non-4Gb limit,
it may use 16-bit stack segment, it may use an expand-down stack segment.
(I know very little about Windows/Wine internals, so I just listed
all possibilities which came to mind).
Looking at the error message:
> Unhandled exception: stack overflow in 32-bit code (0xf779bc07).
> Register dump:
> CS:0023 SS:002b DS:002b ES:002b FS:0063 GS:006b
> EIP:f779bc07 ESP:00aed60c EBP:00aed750 EFLAGS:00010216( R- -- I -A-P- )
> EAX:00000040 EBX:00000010 ECX:00aed750 EDX:00000040
> ESI:00000040 EDI:7ffd4000
it is not coming from Wine itself, looks like it's from Windows code,
and I'd guess it just tells us that they got exception 12,
without further information on the cause.
I don't have any solid theories here, so I'd just brainstorm.
* what if %ss before syscall was NOT the usual value of 0x2b, but some
other segment, not the typical 0-base, 0xffffffff limit 32-bit expand-up one?
Not restoring proper %ss would not go well.
[but then, Intel CPUs work, and old code worked....]
* what if there is indeed a hidden "64-bit mode" bit in %ss descriptor
cache which makes stack ops use %rsp, not %esp; and %rsp somehow
has nonzero high half? Then, in compat mode, CPU checks segment limit
on every memory reference including POP (which it doesn't do in 64-bit mode)
and we got exception 12. [But we do:
movl RSP(%rsp),%esp
USERGS_SYSRET32
which ensures that %rsp never has upper bits set...]
* what if in kernel (64-bmode) we load %ss descriptor cache with a segment
whose limit is too small or with outright invalid segment? In 64-bit mode,
%ss is not actually checked by CPU. For example, nested exceptions/interrupts
even auto-load %ss with zero selector - and stack ops continue to work fine
in 64-bit mode. But as soon as we return to non-64-bit world, CPU starts
checking data sergemnt limits and validity - and if cached %ss descriptor
was made invalid while in 64-bit mode and was not changed by SYSRETL,
POP will fail.
> Reloading SS from the GDT would obviously reset any bad state left by
> sysretl. Unfortunately we may have to put it back in, and then NOP it
> out on Intel.
Sure we can, but let's try to get a better understanding
what's going on here.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/