Re: [GIT PULL v4.5] Fix INT1 recursion with unregistered breakpoints
From: Jeff Merkey
Date: Mon Jan 11 2016 - 21:50:33 EST
On 1/11/16, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On Mon, Jan 11, 2016 at 6:26 PM, Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
>> On 1/11/16, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>> On Mon, Jan 11, 2016 at 6:07 PM, Jeff Merkey <linux.mdb@xxxxxxxxx>
>>> wrote:
>>>> On 1/11/16, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>> On Mon, Jan 11, 2016 at 5:30 PM, Jeff Merkey <linux.mdb@xxxxxxxxx>
>>>>> wrote:
>>>>>> On 1/11/16, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>>> On Mon, Jan 11, 2016 at 4:44 PM, Jeff Merkey <linux.mdb@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>>> Hi Thomas,
>>>>>>>>
>>>>>>>> I agree with #2, we should clear the breakpoint. As for #1, if
>>>>>>>> there's an execute breakpoint it MUST be cleared or it will just
>>>>>>>> fire
>>>>>>>> off again when it sees the iretd from the int1 exception handler.
>>>>>>>> I
>>>>>>>> do use the breakpoint API Thomas, this showed up while debugging
>>>>>>>> and
>>>>>>>> testing the API with "lazy debug register switching".
>>>>>>>>
>>>>>>>> So do you want me to expand the patch and clear the breakpoint?
>>>>>>>> Just
>>>>>>>> give the word and I'll get busy and GIT -R- DONE.
>>>>>>>
>>>>>>> It seems to me that you're papering over some issue instead of
>>>>>>> fixing
>>>>>>> the root cause. If you're using the API, then either you're doing
>>>>>>> it
>>>>>>> wrong or the API is broken. Can you figure out which and fix it?
>>>>>>>
>>>>>>> --Andy
>>>>>>>
>>>>>>
>>>>>> Andy,
>>>>>>
>>>>>> Linux should not crash because someone triggered a breakpoint or one
>>>>>> got triggered due to a program leaving some bits lying in a read only
>>>>>> register (DR6) which for some strange reason someone in the linux
>>>>>> world decided could be used as local storage and to pass arguments
>>>>>> between subsystems - a register intel designed to be read from for
>>>>>> status. I did not design what's in that API, I have to live with
>>>>>> it.
>>>>>
>>>>> The API appears to work, though. Are you *sure* you're using it
>>>>> correctly? Are you telling the code in kernel/hw_breakpoint.c about
>>>>> your breakpoint?
>>>>>
>>>>>> So all I am asking is that we fix this issue. It does not matter
>>>>>> to my debugger is this is fixed or not in Linux, since I carry the
>>>>>> fix
>>>>>> in my patch, but it does matter to the overall robustness of Linux.
>>>>>
>>>>> Robust against what, exactly? What's the bug?
>>>>>
>>>>> I will grant that the comments about lazy dr7 switching are
>>>>> mystifying, and cleaning them up might be nice. But there's no
>>>>> adequate explanation of what the failure mode is, how to trigger it,
>>>>> or why your patch is a reasonable fix. As it stands, you're
>>>>> duplicating code.
>>>>>
>>>>> --Andy
>>>>
>>>> Andy,
>>>>
>>>> Couple of things:
>>>>
>>>> Would you like a copy of the test harness that creates this bug to
>>>> test for yourself? I previously posted it on the list. If you don't
>>>> have it, I'll provide it.
>>>
>>> If you can send a short, buildable thing that triggers it, I'll read it.
>>>
>>>>
>>>> Since the dr6 bits get shifted around, it doesn't matter if the
>>>> breakpoint was registered or not in the API because the broken handler
>>>> will call NULL bp structures and crash whether its registered or not.
>>>>
>>>
>>> And what exactly does this have to do with anything? Your patch is
>>> all about spurious breakpoints triggered by dr7 and should have
>>> nothing much to do with the value in dr6. Unless dr6 is missing a bit
>>> due to some issue, but you never suggested any problem like that.
>>>
>>
>> It's about setting the resume flag when an execute breakpoint occurs, no
>> matter
>> what caused the breakpoint. If is not set, the system will hang with
>> that processor
>> hung on the same execution address. You cannot have an int1 exception
>> path
>> that does not set the resume flag which is the case here -- there
>> should be no path
>> where this flag does not get set on an execute breakpoint.
>
> There are many, many ways that one can corrupt kernel state to break
> things. You could screw up IST state basically anywhere and crash.
> You could screw up GSBASE. You could poke bad values into pt_regs in
> a fast syscall and hit the infamous SYSRET failure. You can write a
> buggy .fault handler that returns success and doesn't actually do
> anything. And yes, you can set a bit in dr7 without telling the
> hw_breakpoint code about it and thus infinite loop.
>
> Meanwhile, you keep claiming that kernel has a bug and that the bug
> can't be triggered without out-of-tree code. In my book, that's not a
> bug.
>
The handler that fails to set the resume flag is in tree code.
> If you want to submit a nice clean patch to hw_breakpoint_handler to
> change the behavior on an unmatched breakpoint, then submit such a
> patch and justify why (a) the new behavior is better and (b) why it
> doesn't break any actual in-tree code.
>
At last, a compromise -- accepted. In the meantime, put this patch in
to get rid of the crash. I'll code up another series and you can help me by
reviewing it and keeping me on my toes.
> Yes, hw_breakpoint_notify is a piece of shit. In particular, the
> hilarously indirect way in which it's invoked makes no sense
> whatsoever. Fixing that (as a separate patch) would be fantastic IMO.
> But putting extra workarounds into do_debug is not okay, IMO.
Totally agree......
I'll get to work. In the meantime, put this one in for me boss. I
take Sunday afternoon off, BTW.
:-)
Jeff