Re: [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr

From: Andy Lutomirski
Date: Wed Jun 22 2016 - 11:21:27 EST


On Jun 22, 2016 5:00 AM, "Pedro Alves" <palves@xxxxxxxxxx> wrote:
>
> On 06/21/2016 05:32 PM, Andy Lutomirski wrote:
> > On Jun 21, 2016 5:40 AM, "Pedro Alves" <palves@xxxxxxxxxx> wrote:
>
> > I didn't try that particular experiment. But, from that email:
> >
> >> After that, GDB can control the stopped inferior. To call function "func1()" of inferior, GDB need: Step 1, save current values of registers ($rax 0xfffffffffffffe00(64 bits -512) is cut to 0xfffffe00(32 bits -512) because inferior is a 32 bits program).
> >
> > That sounds like it may be a gdb bug. Why does gdb truncate the register?
>
> Because when debugging a 32-bit program, gdb's register cache only
> stores 32-bit-wide registers ($eax, $eip, etc., not $rax, etc.)
>
> Let me turn this around:
>
> Why does the kernel care about the upper 32-bit bits of $orig_rax when
> the task is in 32-bit mode in the first place?
>
> The 32-bit syscall entry points already only care about $eax, not $rax,
> since $rax doesn't exist in real 32-bit CPUs.

Two reasons. First, long jump to 32-bit followed by long jump to
64-bit preserves all the high bits. If there's a context switch or
similar in between, the high bits should still be preserved. Ideally
gdb would preserve them, too -- this would make debugging weird
mode-switching programs much easier.

Second, when this save-and-restore dance happens, the kernel *doesn't
know* what the syscall bitness is, because it's not in a syscall
anymore. The kernel can *guess* by checking CS or decoding the
syscall instruction, but it's just a guess -- decoding the syscall
instruction isn't absolutely guaranteed to work (UML, for example,
could unmap it temporarily), and the syscall could be int $0x80
(32-bit) even in a 64-bit code segment.

Part of the problem here is that there's a critical piece of state
that isn't visible directly through ptrace at all: where in the
syscall process the tracee is. When PTRACE_SYSCALL fires the first
time, the tracee is in syscall entry. No matter what regs are set,
the tracee will still be in syscall entry on resume. Depending on
orig_ax, it may abort the syscall. On the second event, it's in
syscall exit. This is when you might read out an ERESTART code. In
both of these syscall states, the kernel has an associated concept of
the syscall arch, and ptrace can neither read nor write it. (The fact
that ptrace can't read it is why strace screws up completely if a
64-bit program uses int $0x80.) On a signal event or a single-step
event, the tracee isn't in a syscall any more, so, when gdb shoves
syscall-like registers back into the tracee, the kernel has no direct
knowledge of whether the tracee is in a syscall or what arch it is, so
the kernel has to either assume that all regs are 64-bit (my preferred
approach) or needs to make a guess.

>
> Looking at arch/x86/entry/entry_64_compat.S, all three 64-bit x 32-bit syscall
> entry points zero-extend $eax already, and then push that as pt_regs->orig_ax.
>
> So if the kernel is giving significance to the higher 32-bits of orig_ax at
> 32-bit syscall restart time, that very much looks like a kernel bug to me.
>

It's not doing this. When gdb does this reg restore, the kernel's not
doing 32-bit syscall restart -- it's doing a generic resume, and it
just happens that syscall restart has been (incorrectly!) run at this
stage in the resume process even when the task isn't returning from a
syscall. Arguably the fact that orig_ax doesn't encode the syscall
bitness is a bug, but it's way too late to change it.

Could 64-bit gdb sign-extend eax and orig_eax when loading them if
it's forgotten their original high bits?

> >
> > I haven't played with it recently, but, in my experience, gdb seems to
> > work quite poorly in mixed-mode situations. For example, if you
> > attach 64-bit gdb to qemu-system-x86_64's gdbserver, boot a 64-bit
> > guest, and breakpoint in early 32-bit code, gdb tends to explode
> > pretty badly.
>
> Right, but that's a bit of a red herring, and not entirely gdb's fault. The case you
> mention happens because qemu does exactly the opposite of what you're suggesting below.
> It's qemu that changes the remote protocol's register layout (and thus size) in the
> remote protocol register read/write packets when the kernel changes mode, behind
> gdb's back, and gdb errors out because obviously it isn't expecting that. All gdb
> knows is that qemu is now sending bogus register read replies, different from what
> the target description qemu reported on initial remote connection described.
>

Yuck.

> I say "entirely", because gdb has its own share of fault for the remote protocol
> not including a some kind of standard mechanism to inform gdb of mode changes.

Is this something QEMU could improve? I could try pestering them to fix it.

>
> However, the usual scenario where the program _doesn't_ change mode
> during execution, is supported.
>
> >
> > On x86_64, I think gdb should treat CPU state as 64-bit no matter
> > what. The fact that a 32-bit tracee code segment is in use shouldn't
> > change much.
>
> It's not as clear or easy as you make it sound, unfortunately.
>
> For normal userspace programs, the current design across
> gdb/remote protocol/ptrace/kernel/core dump format/elf formats/ is that
> what matters is the program's architecture, not whatever the tracer's arch
> is.
>
> Should core dumping dump 64-bit CPU state as well for 32-bit programs?
> The current core dump format dumps a 32-bit elf with notes that contain
> 32-bit registers. And I think it'd be a bit odd for a 32-bit program to
> dump different cores files depending on the bitness of the kernel.

I actually think it should, but only if the binary format would work
without breaking existing core file readers.

>
> Should a gdb connected to a 64-bit gdbserver that is debugging a 32-bit
> program see different registers compared to a gdb that
> is connected to a 32-bit gdbserver that is debugging a 32-bit program?
> Currently, it doesn't. The architecture of gdbserver doesn't matter here,
> only the tracee's.

Yes, absolutely, at least if the user opts in. This would make
debugging the kernel or other strange programs that use long jumps
much, much easier.

>
> > Admittedly the kernel doesn't really help. There is some questionable
> > code involving which regsets to show to ptrace.
>
> I don't know what code you're looking at, but I consider this mandatory reading:
>
> Roland McGrath on PTRACE_GETREGSET design:
> https://sourceware.org/ml/archer/2010-q3/msg00193.html
>
> "A caveat about those requests for bi-arch systems. Unlike other
> ptrace requests, these access the native formats of the tracee
> process, rather than the native formats of the debugger process.
> So, a 64-bit debugger process using PTRACE_GETREGSET on a 32-bit
> tracee process will see the 32-bit layouts (i.e. what would appear
> in an ELF core file if that process dumped one)."
>
>
> gdb currently uses PTRACE_SETREGS for the general registers, which
> means it currently writes those as 64-bit registers. However, if gdb or any
> ptracer restores/writes $eax/$orig_eax using PTRACE_SETREGSET, it's only
> going to pass down a 32-bit value, and again it must be the kernel that sign
> extends $orig_eax if it wants to interpret it as signed 64-bit internally.
>

Egads! Do you know whether this was intentional on the kernel's part?

I'm inclined to add a PTRACE_GETREGSET_CALLER_ABI or similar that
returns the regsets according to the tracer's view, thus allowing this
to actually work sanely. (With the caveat that a 32-bit tracer may
have serious problems if the tracee is 64-bit.)

>
> But actually looking at your patch 3, I'm confused, because it seems
> to be doing what I'm suggesting?
>

The case that's still broken is syscall restart in general (the
decision to backtrack regs->ip), not the invocation of
restart_syscall. The latter only happens for a small number of
syscalls.

--Andy