Re: [GIT PULL v4.5] Fix INT1 recursion with unregistered breakpoints
From: Andy Lutomirski
Date: Mon Jan 11 2016 - 21:40:58 EST
On Mon, Jan 11, 2016 at 6:26 PM, Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
> On 1/11/16, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Mon, Jan 11, 2016 at 6:07 PM, Jeff Merkey <linux.mdb@xxxxxxxxx> wrote:
>>> On 1/11/16, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>> On Mon, Jan 11, 2016 at 5:30 PM, Jeff Merkey <linux.mdb@xxxxxxxxx>
>>>> wrote:
>>>>> On 1/11/16, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>> On Mon, Jan 11, 2016 at 4:44 PM, Jeff Merkey <linux.mdb@xxxxxxxxx>
>>>>>> wrote:
>>>>>>> Hi Thomas,
>>>>>>>
>>>>>>> I agree with #2, we should clear the breakpoint. As for #1, if
>>>>>>> there's an execute breakpoint it MUST be cleared or it will just fire
>>>>>>> off again when it sees the iretd from the int1 exception handler. I
>>>>>>> do use the breakpoint API Thomas, this showed up while debugging and
>>>>>>> testing the API with "lazy debug register switching".
>>>>>>>
>>>>>>> So do you want me to expand the patch and clear the breakpoint? Just
>>>>>>> give the word and I'll get busy and GIT -R- DONE.
>>>>>>
>>>>>> It seems to me that you're papering over some issue instead of fixing
>>>>>> the root cause. If you're using the API, then either you're doing it
>>>>>> wrong or the API is broken. Can you figure out which and fix it?
>>>>>>
>>>>>> --Andy
>>>>>>
>>>>>
>>>>> Andy,
>>>>>
>>>>> Linux should not crash because someone triggered a breakpoint or one
>>>>> got triggered due to a program leaving some bits lying in a read only
>>>>> register (DR6) which for some strange reason someone in the linux
>>>>> world decided could be used as local storage and to pass arguments
>>>>> between subsystems - a register intel designed to be read from for
>>>>> status. I did not design what's in that API, I have to live with
>>>>> it.
>>>>
>>>> The API appears to work, though. Are you *sure* you're using it
>>>> correctly? Are you telling the code in kernel/hw_breakpoint.c about
>>>> your breakpoint?
>>>>
>>>>> So all I am asking is that we fix this issue. It does not matter
>>>>> to my debugger is this is fixed or not in Linux, since I carry the fix
>>>>> in my patch, but it does matter to the overall robustness of Linux.
>>>>
>>>> Robust against what, exactly? What's the bug?
>>>>
>>>> I will grant that the comments about lazy dr7 switching are
>>>> mystifying, and cleaning them up might be nice. But there's no
>>>> adequate explanation of what the failure mode is, how to trigger it,
>>>> or why your patch is a reasonable fix. As it stands, you're
>>>> duplicating code.
>>>>
>>>> --Andy
>>>
>>> Andy,
>>>
>>> Couple of things:
>>>
>>> Would you like a copy of the test harness that creates this bug to
>>> test for yourself? I previously posted it on the list. If you don't
>>> have it, I'll provide it.
>>
>> If you can send a short, buildable thing that triggers it, I'll read it.
>>
>>>
>>> Since the dr6 bits get shifted around, it doesn't matter if the
>>> breakpoint was registered or not in the API because the broken handler
>>> will call NULL bp structures and crash whether its registered or not.
>>>
>>
>> And what exactly does this have to do with anything? Your patch is
>> all about spurious breakpoints triggered by dr7 and should have
>> nothing much to do with the value in dr6. Unless dr6 is missing a bit
>> due to some issue, but you never suggested any problem like that.
>>
>
> It's about setting the resume flag when an execute breakpoint occurs, no matter
> what caused the breakpoint. If is not set, the system will hang with
> that processor
> hung on the same execution address. You cannot have an int1 exception path
> that does not set the resume flag which is the case here -- there
> should be no path
> where this flag does not get set on an execute breakpoint.
There are many, many ways that one can corrupt kernel state to break
things. You could screw up IST state basically anywhere and crash.
You could screw up GSBASE. You could poke bad values into pt_regs in
a fast syscall and hit the infamous SYSRET failure. You can write a
buggy .fault handler that returns success and doesn't actually do
anything. And yes, you can set a bit in dr7 without telling the
hw_breakpoint code about it and thus infinite loop.
Meanwhile, you keep claiming that kernel has a bug and that the bug
can't be triggered without out-of-tree code. In my book, that's not a
bug.
If you want to submit a nice clean patch to hw_breakpoint_handler to
change the behavior on an unmatched breakpoint, then submit such a
patch and justify why (a) the new behavior is better and (b) why it
doesn't break any actual in-tree code.
Yes, hw_breakpoint_notify is a piece of shit. In particular, the
hilarously indirect way in which it's invoked makes no sense
whatsoever. Fixing that (as a separate patch) would be fantastic IMO.
But putting extra workarounds into do_debug is not okay, IMO.
--Andy