Re: [PATCH 17/17] x86: simplify iret stack handling on SYSCALL64 fastpath

From: Denys Vlasenko
Date: Tue Aug 12 2014 - 05:22:06 EST


On 08/11/2014 10:06 PM, Andy Lutomirski wrote:
> On Mon, Aug 11, 2014 at 5:24 AM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
>> On 08/11/2014 12:42 AM, Andy Lutomirski wrote:
>>> On Mon, Aug 11, 2014 at 12:00 AM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
>>>> On 08/09/2014 12:59 AM, Andy Lutomirski wrote:
>>>>>> + * When returning through fast path, userspace sees rcx = return address,
>>>>>> + * r11 = rflags. When returning through iret (e.g. if audit is active),
>>>>>> + * these registers may contain garbage.
>>>>>> + * For ptrace we manage to avoid that: when we hit slow path on entry,
>>>>>> + * we do save rcx and r11 in pt_regs, so ptrace on exit also sees them.
>>>>>> + * If slow path is entered only on exit, there will be garbage.
>>>>>
>>>>> I don't like this. At least the current code puts something
>>>>> deterministic in there (-1) for the slow path, even though it's wrong
>>>>> and makes the slow path behave visibly differently from the fast path.
>>>>>
>>>>> Leaking uninitialized data here is extra bad, though. Keep in mind
>>>>> that, when a syscall entry is interrupted before fully setting up
>>>>> pt_regs, the interrupt frame overlaps task_pt_regs, so it's possible,
>>>>> depending on the stack slot ordering, for a kernel secret
>>>>> (kernel_stack?) to end up somewhere in task_pt_regs.
>>>>
>>>> It's easy to fix. We jump off fast path to slow path here:
>>>>
>>>> movl TI_flags+THREAD_INFO(%rsp,SIZEOF_PTREGS),%edx
>>>> andl %edi,%edx
>>>> jnz sysret_careful
>>>>
>>>> This is the only use of "sysret_careful" label.
>>>> Therefore, there we don't need to think about any other scenarios
>>>> besides "we are returning from syscall here".
>>>
>>> I may be missing something here (on vacation, not really testing
>>> things, no big monitor, etc), but how is this compatible with things
>>> like rt_sigreturn? rt_sigreturn is call from the fastpath, via a
>>> stub, and it returns through int_ret_from_syscall. The C part needs
>>> to modify all the regs, and those regs need to stick, so fixing up rcx
>>> and r11 after rt_sigreturn can't work.
>>
>> Code at "sysret_careful" label is only reachable
>> on fast path return.
>> We don't go down this code path after rt_sigreturn.
>> stub_rt_sigreturn manually steers into iret code path instead:
>>
>> ENTRY(stub_rt_sigreturn)
>> CFI_STARTPROC
>> addq $8, %rsp
>> DEFAULT_FRAME 0
>> SAVE_EXTRA_REGS
>> STORE_IRET_FRAME_CS_SS
>> call sys_rt_sigreturn
>> movq %rax,RAX(%rsp)
>> RESTORE_EXTRA_REGS
>> jmp int_ret_from_sys_call <==== NOTE THIS
>>
>> So, we don't do any rcx/r11 fixups after sys_rt_sigreturn.
>
> Oh, right. rt_sigreturn overwrites all regs, so it doesn't need a
> fixup in advance.
>
> That still leaves fork and everything that calls ptrace_event, though.

I think I have it covered:

[v]fork and clone have fully populated pt_regs.

Syscall entry/exit ptrace stops are on slow path and therefore
also have fully populated pt_regs.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/