Re: [PATCH v5 24/25] ptrace: add PTRACE_GET_SYSCALL_INFO request
From: Kees Cook
Date: Mon Dec 10 2018 - 12:44:53 EST
On Sun, Dec 9, 2018 at 8:31 PM Dmitry V. Levin <ldv@xxxxxxxxxxxx> wrote:
>
> From: Elvira Khabirova <lineprinter@xxxxxxxxxxxx>
>
> PTRACE_GET_SYSCALL_INFO is a generic ptrace API that lets ptracer obtain
> details of the syscall the tracee is blocked in.
>
> There are two reasons for a special syscall-related ptrace request.
>
> Firstly, with the current ptrace API there are cases when ptracer cannot
> retrieve necessary information about syscalls. Some examples include:
> * The notorious int-0x80-from-64-bit-task issue. See [1] for details.
> In short, if a 64-bit task performs a syscall through int 0x80, its tracer
> has no reliable means to find out that the syscall was, in fact,
> a compat syscall, and misidentifies it.
> * Syscall-enter-stop and syscall-exit-stop look the same for the tracer.
> Common practice is to keep track of the sequence of ptrace-stops in order
> not to mix the two syscall-stops up. But it is not as simple as it looks;
> for example, strace had a (just recently fixed) long-standing bug where
> attaching strace to a tracee that is performing the execve system call
> led to the tracer identifying the following syscall-exit-stop as
> syscall-enter-stop, which messed up all the state tracking.
> * Since the introduction of commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3
> ("ptrace: Don't allow accessing an undumpable mm"), both PTRACE_PEEKDATA
> and process_vm_readv become unavailable when the process dumpable flag
> is cleared. On such architectures as ia64 this results in all syscall
> arguments being unavailable for the tracer.
>
> Secondly, ptracers also have to support a lot of arch-specific code for
> obtaining information about the tracee. For some architectures, this
> requires a ptrace(PTRACE_PEEKUSER, ...) invocation for every syscall
> argument and return value.
>
> ptrace(2) man page:
>
> long ptrace(enum __ptrace_request request, pid_t pid,
> void *addr, void *data);
> ...
> PTRACE_GET_SYSCALL_INFO
> Retrieve information about the syscall that caused the stop.
> The information is placed into the buffer pointed by "data"
> argument, which should be a pointer to a buffer of type
> "struct ptrace_syscall_info".
> The "addr" argument contains the size of the buffer pointed to
> by "data" argument (i.e., sizeof(struct ptrace_syscall_info)).
> The return value contains the number of bytes available
> to be written by the kernel.
> If the size of data to be written by the kernel exceeds the size
> specified by "addr" argument, the output is truncated.
>
> Co-authored-by: Dmitry V. Levin <ldv@xxxxxxxxxxxx>
> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> Cc: Eugene Syromyatnikov <esyr@xxxxxxxxxx>
> Cc: Kees Cook <keescook@xxxxxxxxxxxx>
> Cc: Jann Horn <jannh@xxxxxxxxxx>
> Cc: linux-api@xxxxxxxxxxxxxxx
> Cc: strace-devel@xxxxxxxxxxxxxxx
> Signed-off-by: Elvira Khabirova <lineprinter@xxxxxxxxxxxx>
> Signed-off-by: Dmitry V. Levin <ldv@xxxxxxxxxxxx>
Reviewed-by: Kees Cook <keescook@xxxxxxxxxxxx>
-Kees
> ---
>
> Notes:
> v5:
> * Change PTRACE_EVENTMSG_SYSCALL_{ENTRY,EXIT} values as requested by Oleg.
> * Change struct ptrace_syscall_info: generalize instruction_pointer,
> stack_pointer, and frame_pointer fields by moving them from
> ptrace_syscall_info.{entry,seccomp} substructures to ptrace_syscall_info
> and initializing them for all stops.
> * Add PTRACE_SYSCALL_INFO_NONE, set it when not in a syscall stop,
> so e.g. "strace -i" could use PTRACE_SYSCALL_INFO_SECCOMP to obtain
> instruction_pointer when the tracee is in a signal stop.
> * Make available for all architectures: do not conditionalize on
> CONFIG_HAVE_ARCH_TRACEHOOK since all syscall_get_* functions
> are implemented on all architectures.
>
> v4:
> * Do not introduce task_struct.ptrace_event,
> use child->last_siginfo->si_code instead.
> * Implement PTRACE_SYSCALL_INFO_SECCOMP and ptrace_syscall_info.seccomp
> support along with PTRACE_SYSCALL_INFO_{ENTRY,EXIT} and
> ptrace_syscall_info.{entry,exit}.
>
> v3:
> * Change struct ptrace_syscall_info.
> * Support PTRACE_EVENT_SECCOMP by adding ptrace_event to task_struct.
> * Add proper defines for ptrace_syscall_info.op values.
> * Rename PT_SYSCALL_IS_ENTERING and PT_SYSCALL_IS_EXITING to
> PTRACE_EVENTMSG_SYSCALL_ENTRY and PTRACE_EVENTMSG_SYSCALL_EXIT
> * and move them to uapi.
>
> v2:
> * Do not use task->ptrace.
> * Replace entry_info.is_compat with entry_info.arch, use syscall_get_arch().
> * Use addr argument of sys_ptrace to get expected size of the struct;
> return full size of the struct.
>
> include/linux/tracehook.h | 9 ++--
> include/uapi/linux/ptrace.h | 39 +++++++++++++++
> kernel/ptrace.c | 99 ++++++++++++++++++++++++++++++++++++-
> 3 files changed, 143 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
> index df20f8bdbfa3..6bc7a3d58e2f 100644
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -57,13 +57,15 @@ struct linux_binprm;
> /*
> * ptrace report for syscall entry and exit looks identical.
> */
> -static inline int ptrace_report_syscall(struct pt_regs *regs)
> +static inline int ptrace_report_syscall(struct pt_regs *regs,
> + unsigned long message)
> {
> int ptrace = current->ptrace;
>
> if (!(ptrace & PT_PTRACED))
> return 0;
>
> + current->ptrace_message = message;
> ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>
> /*
> @@ -76,6 +78,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs)
> current->exit_code = 0;
> }
>
> + current->ptrace_message = 0;
> return fatal_signal_pending(current);
> }
>
> @@ -101,7 +104,7 @@ static inline int ptrace_report_syscall(struct pt_regs *regs)
> static inline __must_check int tracehook_report_syscall_entry(
> struct pt_regs *regs)
> {
> - return ptrace_report_syscall(regs);
> + return ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_ENTRY);
> }
>
> /**
> @@ -126,7 +129,7 @@ static inline void tracehook_report_syscall_exit(struct pt_regs *regs, int step)
> if (step)
> user_single_step_report(regs);
> else
> - ptrace_report_syscall(regs);
> + ptrace_report_syscall(regs, PTRACE_EVENTMSG_SYSCALL_EXIT);
> }
>
> /**
> diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
> index d5a1b8a492b9..f0af09fe4e17 100644
> --- a/include/uapi/linux/ptrace.h
> +++ b/include/uapi/linux/ptrace.h
> @@ -73,6 +73,45 @@ struct seccomp_metadata {
> __u64 flags; /* Output: filter's flags */
> };
>
> +#define PTRACE_GET_SYSCALL_INFO 0x420e
> +#define PTRACE_SYSCALL_INFO_NONE 0
> +#define PTRACE_SYSCALL_INFO_ENTRY 1
> +#define PTRACE_SYSCALL_INFO_EXIT 2
> +#define PTRACE_SYSCALL_INFO_SECCOMP 3
> +
> +struct ptrace_syscall_info {
> + __u8 op; /* PTRACE_SYSCALL_INFO_* */
> + __u8 __pad0[3];
> + __u32 arch;
> + __u64 instruction_pointer;
> + __u64 stack_pointer;
> + __u64 frame_pointer;
> + union {
> + struct {
> + __u64 nr;
> + __u64 args[6];
> + } entry;
> + struct {
> + __s64 rval;
> + __u8 is_error;
> + __u8 __pad1[7];
> + } exit;
> + struct {
> + __u64 nr;
> + __u64 args[6];
> + __u32 ret_data;
> + __u8 __pad2[4];
> + } seccomp;
> + };
> +};
> +
> +/*
> + * These values are stored in task->ptrace_message
> + * by tracehook_report_syscall_* to describe the current syscall-stop.
> + */
> +#define PTRACE_EVENTMSG_SYSCALL_ENTRY 1
> +#define PTRACE_EVENTMSG_SYSCALL_EXIT 2
> +
> /* Read signals from a shared (process wide) queue */
> #define PTRACE_PEEKSIGINFO_SHARED (1 << 0)
>
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index c2cee9db5204..4562b2cb1087 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -30,6 +30,8 @@
> #include <linux/cn_proc.h>
> #include <linux/compat.h>
>
> +#include <asm/syscall.h> /* For syscall_get_* */
> +
> /*
> * Access another process' address space via ptrace.
> * Source/target buffer must be kernel space,
> @@ -878,7 +880,98 @@ static int ptrace_regset(struct task_struct *task, int req, unsigned int type,
> * to ensure no machine forgets it.
> */
> EXPORT_SYMBOL_GPL(task_user_regset_view);
> -#endif
> +#endif /* CONFIG_HAVE_ARCH_TRACEHOOK */
> +
> +static unsigned long
> +ptrace_get_syscall_info_entry(struct task_struct *child, struct pt_regs *regs,
> + struct ptrace_syscall_info *info)
> +{
> + unsigned long args[ARRAY_SIZE(info->entry.args)];
> + int i;
> +
> + info->op = PTRACE_SYSCALL_INFO_ENTRY;
> + info->entry.nr = syscall_get_nr(child, regs);
> + syscall_get_arguments(child, regs, 0, ARRAY_SIZE(args), args);
> + for (i = 0; i < ARRAY_SIZE(args); i++)
> + info->entry.args[i] = args[i];
> +
> + return offsetofend(struct ptrace_syscall_info, entry);
> +}
> +
> +static unsigned long
> +ptrace_get_syscall_info_seccomp(struct task_struct *child, struct pt_regs *regs,
> + struct ptrace_syscall_info *info)
> +{
> + /*
> + * As struct ptrace_syscall_info.entry is currently a subset
> + * of struct ptrace_syscall_info.seccomp, it makes sense to
> + * initialize that subset using ptrace_get_syscall_info_entry().
> + * This can be reconsidered in the future if these structures
> + * diverge significantly enough.
> + */
> + ptrace_get_syscall_info_entry(child, regs, info);
> + info->op = PTRACE_SYSCALL_INFO_SECCOMP;
> + info->seccomp.ret_data = child->ptrace_message;
> +
> + return offsetofend(struct ptrace_syscall_info, seccomp);
> +}
> +
> +static unsigned long
> +ptrace_get_syscall_info_exit(struct task_struct *child, struct pt_regs *regs,
> + struct ptrace_syscall_info *info)
> +{
> + info->op = PTRACE_SYSCALL_INFO_EXIT;
> + info->exit.rval = syscall_get_error(child, regs);
> + info->exit.is_error = !!info->exit.rval;
> + if (!info->exit.is_error)
> + info->exit.rval = syscall_get_return_value(child, regs);
> +
> + return offsetofend(struct ptrace_syscall_info, exit);
> +}
> +
> +static int
> +ptrace_get_syscall_info(struct task_struct *child, unsigned long user_size,
> + void __user *datavp)
> +{
> + struct pt_regs *regs = task_pt_regs(child);
> + struct ptrace_syscall_info info = {
> + .op = PTRACE_SYSCALL_INFO_NONE,
> + .arch = syscall_get_arch(child),
> + .instruction_pointer = instruction_pointer(regs),
> + .stack_pointer = user_stack_pointer(regs),
> + .frame_pointer = frame_pointer(regs)
> + };
> + unsigned long actual_size = offsetof(struct ptrace_syscall_info, entry);
> + unsigned long write_size;
> +
> + /*
> + * This does not need lock_task_sighand() to access
> + * child->last_siginfo because ptrace_freeze_traced()
> + * called earlier by ptrace_check_attach() ensures that
> + * the tracee cannot go away and clear its last_siginfo.
> + */
> + switch (child->last_siginfo ? child->last_siginfo->si_code : 0) {
> + case SIGTRAP | 0x80:
> + switch (child->ptrace_message) {
> + case PTRACE_EVENTMSG_SYSCALL_ENTRY:
> + actual_size = ptrace_get_syscall_info_entry(child, regs,
> + &info);
> + break;
> + case PTRACE_EVENTMSG_SYSCALL_EXIT:
> + actual_size = ptrace_get_syscall_info_exit(child, regs,
> + &info);
> + break;
> + }
> + break;
> + case SIGTRAP | (PTRACE_EVENT_SECCOMP << 8):
> + actual_size = ptrace_get_syscall_info_seccomp(child, regs,
> + &info);
> + break;
> + }
> +
> + write_size = min(actual_size, user_size);
> + return copy_to_user(datavp, &info, write_size) ? -EFAULT : actual_size;
> +}
>
> int ptrace_request(struct task_struct *child, long request,
> unsigned long addr, unsigned long data)
> @@ -1095,6 +1188,10 @@ int ptrace_request(struct task_struct *child, long request,
> ret = seccomp_get_metadata(child, addr, datavp);
> break;
>
> + case PTRACE_GET_SYSCALL_INFO:
> + ret = ptrace_get_syscall_info(child, addr, datavp);
> + break;
> +
> default:
> break;
> }
> --
> ldv
--
Kees Cook