Re: çåï[PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while output user regs

From: Andy Lutomirski
Date: Thu Dec 25 2014 - 11:23:04 EST


On Thu, Dec 25, 2014 at 7:48 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On Thu, Dec 25, 2014 at 4:13 AM, çæå(æå) <chenggang.qcg@xxxxxxxxxxxxxxx> wrote:
>> The context is NMI (PMU) or IRQ (hrtimer). It is a bit complex. The process
>> we want to sample is the current, so it is always running.
>> We need to distinguish between IRQ context, syscall or user context that are
>> interrupted by NMI.
>
> Oh. So you really are trying to get the user regs from NMI context
> even if you interrupted the kernel. This will be an unpleasant thing
> to do correctly.
>
>> Syscall: sp = current->thread.usersp;
>
> Nope. For x86_64 at least, sp is in old_rsp, not usersp, *if* the
> syscall you interrupted was syscall64 instead of one of the ia32
> entries, which are differently strange. If you've context switched
> out and back, then usersp will match it. If not, then you're looking
> at some random stale sp value.
>
> Keep in mind that there is absolutely no guarantee that TIF_IA32
> matches the syscall entry type.
>
>> old_rsp always point to current->thread.usersp. May be we
>> shouldn't use current_user_stack_pointer();
>> User: sp = task_pt_regs(task)->sp;
>> current's pt_regs are stored in kernel stack while NMI or IRQ
>> occured.
>
> This is the only easy case.
>
>> IRQ: sp = task_pt_regs(task)->sp;
>> current's pt_regs are stored in kernel stack while IRQ which was
>> interrupted occured.
>
> Sort of. It's true by the time you actually execute the IRQ handler.
>
> I think that trying to do this is doomed to either failure or extreme
> complexity. You're in an NMI, so you could be part-way through a
> context switch or you could be in the very first instruction of the
> syscall handler.
>
> On a quick look, there are plenty of other bugs in there besides just
> the stack pointer issue. The ABI check that uses TIF_IA32 in the perf
> core is completely wrong. TIF_IA32 may be equal to the actual
> userspace bitness by luck, but, if so, that's more or less just luck.
> And there's a user_mode test that should be user_mode_vm.
>
> Also, it's not just sp that's wrong. There are various places that
> you can interrupt in which many of the registers have confusing
> locations. You could try using the cfi unwind data, but that's
> unlikely to work for regs like cs and ss, and, during context switch,
> this has very little chance of working.

Even the unwinder won't be able to get rbx, rbp, r12, r13, r14, and
r15 right -- good luck handling FORK_LIKE, PTREGSCALL, etc.

--Andy

>
> What's the point of this feature? Honestly, my suggestion would be to
> delete it instead of trying to fix it. It's also not clear to me that
> there aren't serious security problems here -- it's entirely possible
> for sensitive *kernel* values to and up in task_pt_regs at certain
> times, and if you run during context switch and there's no code to
> suppress this dump during context switch, then you could be showing
> regs that belong to the wrong task.
>
> --Andy
>
>>
>> Regards
>> Chenggang
>>
>> ------------------------------------------------------------------
>> åääïAndy Lutomirski <luto@xxxxxxxxxxxxxx>
>> åéæéï2014å12æ23æ(ææä) 16:30
>> æääïroot <chenggang.qin@xxxxxxxxx>ïlinux-kernel
>> <linux-kernel@xxxxxxxxxxxxxxx>
>> æãéïçæå(æå) <chenggang.qcg@xxxxxxxxxx>ïAndrew Morton
>> <akpm@xxxxxxxxxxxxxxxxxxxx>ïArjan van de Ven <arjan@xxxxxxxxxxxxxxx>ïDavid
>> Ahern <dsahern@xxxxxxxxx>ïIngo Molnar <mingo@xxxxxxxxxx>ïMike Galbraith
>> <efault@xxxxxx>ïNamhyung Kim <namhyung@xxxxxxxxx>ïPaul Mackerras
>> <paulus@xxxxxxxxx>ïPeter Zijlstra <a.p.zijlstra@xxxxxxxxx>ïWu Fengguang
>> <fengguang.wu@xxxxxxxxx>ïYanmin Zhang <yanmin.zhang@xxxxxxxxx>
>> äãéïRe: [PATCH] perf core: Use KSTK_ESP() instead of pt_regs->sp while
>> output user regs
>>
>> On 12/22/2014 10:22 PM, root wrote:
>>> From: Chenggang Qin <chenggang.qcg@xxxxxxxxxx>
>>>
>>> For x86_64, the exact value of user stack's esp should be got by
>>> KSTK_ESP(current).
>>> current->thread.usersp is copied from PDA while enter ring0.
>>> Now, we output the value of sp from pt_regs. But pt_regs->sp has changed
>>> before
>>> it was pushed into kernel stack.
>>>
>>> So, we cannot get the correct callchain while unwind some user stacks.
>>> For example, if the stack contains __lll_unlock_wake()/__lll_lock_wait(),
>>> the
>>> callchain will break some times with the latest version of libunwind.
>>> The root cause is the sp that is used by libunwind may be wrong.
>>>
>>> If we use KSTK_ESP(current), the correct callchain can be got everytime.
>>> Other architectures also have KSTK_ESP() macro.
>>>
>>> Signed-off-by: Chenggang Qin <chenggang.qcg@xxxxxxxxxx>
>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>>> Cc: Arjan van de Ven <arjan@xxxxxxxxxxxxxxx>
>>> Cc: David Ahern <dsahern@xxxxxxxxx>
>>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>>> Cc: Mike Galbraith <efault@xxxxxx>
>>> Cc: Namhyung Kim <namhyung@xxxxxxxxx>
>>> Cc: Paul Mackerras <paulus@xxxxxxxxx>
>>> Cc: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
>>> Cc: Wu Fengguang <fengguang.wu@xxxxxxxxx>
>>> Cc: Yanmin Zhang <yanmin.zhang@xxxxxxxxx>
>>>
>>> ---
>>> arch/x86/kernel/perf_regs.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
>>> index e309cc5..5da8df8 100644
>>> --- a/arch/x86/kernel/perf_regs.c
>>> +++ b/arch/x86/kernel/perf_regs.c
>>> @@ -60,6 +60,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>>> if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
>>> return 0;
>>>
>>> + if (idx == PERF_REG_X86_SP)
>>> + return KSTK_ESP(current);
>>> +
>>
>> This patch is probably fine, but KSTK_ESP seems to be bogus:
>>
>> unsigned long KSTK_ESP(struct task_struct *task)
>> {
>> return (test_tsk_thread_flag(task, TIF_IA32)) ?
>> (task_pt_regs(task)->sp) : ((task)->thread.usersp);
>> }
>>
>> I swear that every time I've looked at anything that references TIF_IA32
>> in the last two weeks, it's been wrong. This should be something like:
>>
>> if (task_thread_info(task)->status & TS_COMPAT)
>> return task_pt_regs(task)->sp;
>> else if (task == current && task is in a syscall)
>> return current_user_stack_pointer();
>> else if (task is not running && task is in a syscall)
>> return task->thread.usersp;
>> else if (task is not in a syscall)
>> return task_pt_regs(task)->sp;
>> else
>> we're confused; give up.
>>
>> What context are you using KSTK_ESP in?
>>
>> --Andy
>>
>>> return regs_get_register(regs, pt_regs_offset[idx]);
>>> }
>>>
>>>
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC



--
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/