Re: SYSCALL, ptrace and syscall restart breakages (Re: [RFC] weirdcrap with vdso on uml/i386)

From: Andrew Lutomirski
Date: Sun Aug 21 2011 - 20:44:42 EST


On Sun, Aug 21, 2011 at 12:41 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Sun, Aug 21, 2011 at 03:43:52PM +0100, Al Viro wrote:
>
>> We do not lie to ptrace and iret.  At all.  We do just what you have
>> described.  And fuck up when restart returns us to the SYSCALL / SYSENTER
>> instruction again, which expects the different calling conventions,
>> so the values arranged in registers in the way int 0x80 would expect
>> do us no good.
>
> FWIW, what really happens (for 32bit task on amd64) is this:

I think I believe your analysis...

>        * Both codepaths start with arranging the same thing on the kernel
> stack frame; one 64bit int 0x80 would create.  For the good and simple
> reason: they all have to be able to leave via IRET.  Stack layout is the
> same, but we need to fill it accordingly to calling conventions we are
> stuck with.  I.e. ->cx should be initialized with arg2 and ->bp with
> arg6, wherever those currently are on given codepath.  _That_ is what
> "lying to ptrace" is about - we store there registers according to how
> they were when we entered __kernel_vsyscall(), not as they are at the
> moment of actual SYSCALL insn.  Which is precisely the right thing to do,
> since if we *are* ptraced, the tracer expects to find the syscall argument
> in the same places, whichever variant of syscall tracee happens to be using.

This is, IMO, gross -- if the values in pt_regs matched what they were
when sysenter / syscall was issued, then we'd be fine -- we could
restart the syscall and everything would work. Apparently ptrace
users have a problem with that, so we're stuck with the "lie" (i.e.
reporting values as of __kernel_vsyscall, not as of the actual kernel
entry).

>        * If there *was* a syscall restart to be done, we are guaranteed to
> have left via IRET path.  In all cases the syscall arguments end up in
> registers, in the same way int 0x80 expected them.  What happens afterwards
> depends on how we entered, though.
>                + int 0x80: all registers are restored (with ptrace
> manipulations, if any, having left their effect) as they'd been the last
> time around.  In we go and that's it.

Which suggests an easy-ish fix: if sysenter is used or if syscall is
entered from the EIP is is supposed to be entered from, then just
change ip in the argument save to point to the int 0x80 instruction.
This might also require tweaking the userspace stack. That way,
restart would hit int 0x80 instead of syscall/sysenter and the
registers are exactly as expected.

Getting this right in the case where ptrace attaches during the
syscall might be tricky, though.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/