Re: [RFC PATCH v2 4/8] arm64: Detect an EL1 exception frame and mark a stack trace unreliable

From: Madhavan T. Venkataraman
Date: Tue Mar 23 2021 - 11:23:32 EST




On 3/23/21 9:33 AM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 08:31:50AM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 8:04 AM, Mark Rutland wrote:
>>> On Tue, Mar 23, 2021 at 07:46:10AM -0500, Madhavan T. Venkataraman wrote:
>>>> On 3/23/21 5:42 AM, Mark Rutland wrote:
>>>>> On Mon, Mar 15, 2021 at 11:57:56AM -0500, madvenka@xxxxxxxxxxxxxxxxxxx wrote:
>>>>>> From: "Madhavan T. Venkataraman" <madvenka@xxxxxxxxxxxxxxxxxxx>
>>>>>>
>>>>>> EL1 exceptions can happen on any instruction including instructions in
>>>>>> the frame pointer prolog or epilog. Depending on where exactly they happen,
>>>>>> they could render the stack trace unreliable.
>>>>>>
>>>>>> If an EL1 exception frame is found on the stack, mark the stack trace as
>>>>>> unreliable.
>>>>>>
>>>>>> Now, the EL1 exception frame is not at any well-known offset on the stack.
>>>>>> It can be anywhere on the stack. In order to properly detect an EL1
>>>>>> exception frame the following checks must be done:
>>>>>>
>>>>>> - The frame type must be EL1_FRAME.
>>>>>>
>>>>>> - When the register state is saved in the EL1 pt_regs, the frame
>>>>>> pointer x29 is saved in pt_regs->regs[29] and the return PC
>>>>>> is saved in pt_regs->pc. These must match with the current
>>>>>> frame.
>>>>>
>>>>> Before you can do this, you need to reliably identify that you have a
>>>>> pt_regs on the stack, but this patch uses a heuristic, which is not
>>>>> reliable.
>>>>>
>>>>> However, instead you can identify whether you're trying to unwind
>>>>> through one of the EL1 entry functions, which tells you the same thing
>>>>> without even having to look at the pt_regs.
>>>>>
>>>>> We can do that based on the entry functions all being in .entry.text,
>>>>> which we could further sub-divide to split the EL0 and EL1 entry
>>>>> functions.
>>>>
>>>> Yes. I will check the entry functions. But I still think that we should
>>>> not rely on just one check. The additional checks will make it robust.
>>>> So, I suggest that the return address be checked first. If that passes,
>>>> then we can be reasonably sure that there are pt_regs. Then, check
>>>> the other things in pt_regs.
>>>
>>> What do you think this will catch?
>>
>> I am not sure that I have an exact example to mention here. But I will attempt
>> one. Let us say that a task has called arch_stack_walk() in the recent past.
>> The unwinder may have copied a stack frame onto some location in the stack
>> with one of the return addresses we check. Let us assume that there is some
>> stack corruption that makes a frame pointer point to that exact record. Now,
>> we will get a match on the return address on the next unwind.
>
> I don't see how this is material to the pt_regs case, as either:
>
> * When the unwinder considers this frame, it appears to be in the middle
> of an EL1 entry function, and the unwinder must mark the unwinding as
> unreliable regardless of the contents of any regs (so there's no need
> to look at the regs).
>
> * When the unwinder considers this frame, it does not appear to be in
> the middle of an EL1 entry function, so the unwinder does not think
> there are any regs to consider, and so we cannot detect this case.
>
> ... unless I've misunderstood the example?
>
> There's a general problem that it's possible to corrupt any portion of
> the chain to skip records, e.g.
>
> A -> B -> C -> D -> E -> F -> G -> H -> [final]
>
> ... could get corrupted to:
>
> A -> B -> D -> H -> [final]
>
> ... regardless of whether C/E/F/G had associated pt_regs. AFAICT there's
> no good way to catch this generally unless we have additional metadata
> to check the unwinding against.
>
> The likelihood of this happening without triggering other checks is
> vanishingly low, and as we don't have a reliable mechanism for detecting
> this, I don't think it's worthwhile attempting to do so.
>
> If and when we try to unwind across EL1 exception boundaries, the
> potential mismatch between the frame record and regs will be more
> significant, and I agree at that point thisd will need more thought.
>
>> Pardon me if the example is somewhat crude. My point is that it is
>> highly unlikely but not impossible for the return address to be on the
>> stack and for the unwinder to get an unfortunate match.
>
> I agree that this is possible in theory, but as above I don't think this
> is a practical concern.
>

OK. What you say makes sense.

Thanks.

Madhavan