Re: [PATCH 1/1] Fix int1 recursion when no perf_bp_event is registeredy

From: Jeff Merkey
Date: Thu Dec 10 2015 - 16:16:46 EST

Next message: Jeff Merkey: "Re: [PATCH 1/1] Fix int1 recursion when no perf_bp_event is registeredy"
Previous message: Thomas Gleixner: "Re: [PATCH 1/1] Fix int1 recursion when no perf_bp_event is registeredy"
In reply to: Jeff Merkey: "Re: [PATCH 1/1] Fix int1 recursion when no perf_bp_event is registeredy"
Next in thread: Ingo Molnar: "Re: [PATCH 1/1] Fix int1 recursion when no perf_bp_event is registeredy"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I trigger it by writing to the dr7 and dr1, 2, 3 or four register and
set an execute breakpoint without going through
arch_install_hw_breakpoint. When the breakpoint fires, the system
crashes and hangs on the processor stuck in an endless loop inside the
int1 handler in hw_breakpoint.c --

static int hw_breakpoint_handler(struct die_args *args)
{
int i, cpu, rc = NOTIFY_STOP;
struct perf_event *bp;
unsigned long dr7, dr6;
unsigned long *dr6_p;

/* The DR6 value is pointed by args->err */
dr6_p = (unsigned long *)ERR_PTR(args->err);
dr6 = *dr6_p;

/* If it's a single step, TRAP bits are random */
if (dr6 & DR_STEP)
return NOTIFY_DONE;

/* Do an early return if no trap bits are set in DR6 */
if ((dr6 & DR_TRAP_BITS) == 0)
return NOTIFY_DONE;

get_debugreg(dr7, 7);
/* Disable breakpoints during exception handling */
set_debugreg(0UL, 7);
/*
* Assert that local interrupts are disabled
* Reset the DRn bits in the virtualized register value.
* The ptrace trigger routine will add in whatever is needed.
*/
current->thread.debugreg6 &= ~DR_TRAP_BITS;
cpu = get_cpu();

/* Handle all the breakpoints that were triggered */
for (i = 0; i < HBP_NUM; ++i) {
if (likely(!(dr6 & (DR_TRAP0 << i))))
continue;

/*
* The counter may be concurrently released but that can only
* occur from a call_rcu() path. We can then safely fetch
* the breakpoint, use its callback, touch its counter
* while we are in an rcu_read_lock() path.
*/
rcu_read_lock();

bp = per_cpu(bp_per_reg[i], cpu);
/*
* Reset the 'i'th TRAP bit in dr6 to denote completion of
* exception handling
*/
(*dr6_p) &= ~(DR_TRAP0 << i);
/*
* bp can be NULL due to lazy debug register switching
* or due to concurrent perf counter removing.
*/
if (!bp) {
rcu_read_unlock();
break;
}

perf_bp_event(bp, args->regs);

/*
* Set up resume flag to avoid breakpoint recursion when
* returning back to origin.
*/
if (bp->hw.info.type == X86_BREAKPOINT_EXECUTE)
args->regs->flags |= X86_EFLAGS_RF;

rcu_read_unlock();
}
/*
* Further processing in do_debug() is needed for a) user-space
* breakpoints (to generate signals) and b) when the system has
* taken exception due to multiple causes
*/
if ((current->thread.debugreg6 & DR_TRAP_BITS) ||
(dr6 & (~DR_TRAP_BITS)))
rc = NOTIFY_DONE;

set_debugreg(dr7, 7);
put_cpu();

return rc;
}

The check for en execute breakpoint to set the resume flag relies on
whether someone has registered a pevent bp event and should rely on
what the hell is in the dr7 register, otherwise, the resume flag never
gets set and the system just keeps getting the same int1 breakpoint
firing off from the hardware over and over again at the same address
and it result in a hard hang.

On 12/10/15, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On Thu, Dec 10, 2015 at 12:49 PM, Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> wrote:
>> On Thu, 10 Dec 2015, Andy Lutomirski wrote:
>>> On brief inspection, this smells like a microcode bug. Can you send
>>> /proc/cpuinfo output?
>>>
>>> If this is the issue, I'm not sure we want to be in the business of
>>> working around localized microcode bugs and, if we do, then I think we
>>> should explicitly detect the bug and log about it.
>>
>> I think we should handle such stuff gracefully. Yes, we should log it
>> and we also should check what the contents of the debug registers are.
>>
>> If dr7 has a break point enabled w/o perf having one installed then we
>> know that someone did a horrible hackery ....
>
> I mis-read this. I don't think this is the microcode bug.
>
> Do we know what the actual problem is? Jeff, how did you trigger this?
>
> If it's lazy DR switching (which I haven't looked at the details of),
> then it seems that this could just be triggered by some unfortunate
> combination of perf config and context switching). But, if so, then I
> think that the proposed fix is wrong -- shouldn't we fix dr7 rather
> than fudging RF to work around this instance of the problem? After
> all, if we're hitting this condition in a tight userspace loop, we're
> going to destroy performance unless we fix dr7.
>
> --Andy
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jeff Merkey: "Re: [PATCH 1/1] Fix int1 recursion when no perf_bp_event is registeredy"
Previous message: Thomas Gleixner: "Re: [PATCH 1/1] Fix int1 recursion when no perf_bp_event is registeredy"
In reply to: Jeff Merkey: "Re: [PATCH 1/1] Fix int1 recursion when no perf_bp_event is registeredy"
Next in thread: Ingo Molnar: "Re: [PATCH 1/1] Fix int1 recursion when no perf_bp_event is registeredy"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]