Re: [PATCH v2] arm64:kgdb: Fix kernel single-stepping

From: Will Deacon
Date: Thu Feb 20 2020 - 09:22:22 EST

On Wed, Feb 19, 2020 at 09:24:03AM -0600, minyard@xxxxxxx wrote:
> From: Corey Minyard <cminyard@xxxxxxxxxx>
> I was working on a single-step bug on kgdb on an ARM64 system, and I saw
> this scenario:
> * A single step is setup to return to el1
> * The ERET return to el1
> * An interrupt is pending and runs before the instruction
> * As soon as PSTATE.D (the debug disable bit) is cleared, the single
> step happens in that location, not where it should have.
> This appears to be due to PSTATE.SS not being cleared when the exception
> happens. Per section D.2.12.5 of the ARMv8 reference manual, that
> appears to be incorrect, it says "As part of exception entry, the PE
> does all of the following: ... Sets PSTATE.SS to 0."

Sorry, but I don't follow you here. If PSTATE.SS is not cleared, why would
you take the step exception?

> However, I appear to not be the first person who has noticed this. In
> the el0-only portion of the kernel_entry macro in entry.S, I found the
> following comment: "Ensure MDSCR_EL1.SS is clear, since we can unmask
> debug exceptions when scheduling." Exactly the same scenario, except
> coming from a userland single step, not a kernel one.

No, I think you might be conflating PSTATE.SS and MDSCR_EL1.SS.

> As I was studying this, though, I realized that the following scenario
> had an issue:
> * Kernel enables MDSCR.SS, MDSCR.KDE, MDSCR.MDE (unnecessary), and
> PSTATE.SS to enable a single step in el1, for kgdb or kprobes,
> on the current CPU's MDSCR register and the process' PSTATE.SS
> register.
> * Kernel returns from the exception with ERET.
> * An interrupt or page fault happens on the instruction, causing the
> instruction to not be run, but the exception handler runs.
> * The exception causes the task to migrate to a new core.
> * The return from the exception runs on a different processor now,
> where the MDSCR values are not set up for a single step.
> * The single step fails to happen.
> This is bad for kgdb, of course, but it seems really bad for kprobes if
> this happens.

I don't see how this can happen for kprobes. Have you managed to reproduce
the failure?

> To fix both these problems, rework the handling of single steps to clear
> things out upon entry to the kernel from el1, and then to set up single
> step when returning to el1, and not do the setup in debug-monitors.c.
> This means that single stepping does not use
> enable/disable_debug_monitors(); it is no longer necessary to track
> those flags for single stepping. This is much like single stepping is
> handled for el0. A new flag is added in pt_regs to enable single
> stepping from el1. Unfortunately, the old value of PSTATE.SS cannot be
> used for this because of the hardware bug mentioned earlier.

I don't think there's a hardware bug.

It sound like you're trying to make kernel debugging per-task instead
of per-cpu, but I don't think that's the right thing to do. What if I /want/
to debug an interrupt handler? For example, I might have a watchpoint on
something accessed by timer interrupt.

> As part of this, there is an interaction between single stepping and the
> other users of debug monitors with the MDSCR.KDE bit. That bit has to
> be set for both hardware breakpoints at el1 and single stepping at el1.
> A new variable was created to store the cpu-wide value of MDSCR.KDE; the
> single stepping code makes sure not to clear that bit on kernel entry if
> it's set in the per-cpu variable.
> After fixing this and doing some more testing, I ran into another issue:
> * Kernel enables the pt_regs single step
> * Kernel returns from the exception with ERET.
> * An interrupt or page fault happens on the instruction, causing the
> instruction to not be run, but the exception handler runs.

This sounds like you've broken debug; we should take the step exception
in the exception handler. That's the way this is supposed to work.

> There's no easy way to find the pt_regs that has the single step flag
> set. So a thread info flag was added so that the single step could be
> disabled in this case. Both that flag and the flag in pt_regs must be
> set to enable a single step.

Honestly, I get the feeling that you don't really understand the code
you're changing here and it's a tonne of effort to try to untangle what
you're doing. That's not necessarily your fault because the debug
architecture is a nightmare to comprehend, but I'm not keen to change it
unless we have a really good justification. I'm sure kgdb is riddled with
bugs but, as I said before, the fixes should be in kgdb, not by tearing
up the low-level debug code (which has the potential to break other users).

Maybe it would be easier if you tried to fix one problem per patch,
preferably with a way to reproduce the issue you're seeing each time?