Re: [PATCH V7 04/10] arm64: exception: handle Synchronous External Abort
From: James Morse
Date: Thu Jan 19 2017 - 13:10:13 EST
Hi Tyler,
On 18/01/17 23:26, Baicar, Tyler wrote:
> On 1/17/2017 3:31 AM, James Morse wrote:
>> On 12/01/17 18:15, Tyler Baicar wrote:
>>> SEA exceptions are often caused by an uncorrected hardware
>>> error, and are handled when data abort and instruction abort
>>> exception classes have specific values for their Fault Status
>>> Code.
>>> When SEA occurs, before killing the process, go through
>>> the handlers registered in the notification list.
>>> Update fault_info[] with specific SEA faults so that the
>>> new SEA handler is used.
>>> @@ -480,6 +496,28 @@ static int do_bad(unsigned long addr, unsigned int esr,
>>> struct pt_regs *regs)
>>> return 1;
>>> }
>>> +/*
>>> + * This abort handler deals with Synchronous External Abort.
>>> + * It calls notifiers, and then returns "fault".
>>> + */
>>> +static int do_sea(unsigned long addr, unsigned int esr, struct pt_regs *regs)
>>> +{
>>> + struct siginfo info;
>>> +
>>> + atomic_notifier_call_chain(&sea_handler_chain, 0, NULL);
>>> +
>>> + pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
>>> + fault_name(esr), esr, addr);
>>> +
>>> + info.si_signo = SIGBUS;
>>> + info.si_errno = 0;
>>> + info.si_code = 0;
>> Half of the other do_*() functions in this file read the signo and code from the
>> fault_info table.
>>
>>
>>> + info.si_addr = (void __user *)addr;
>> addr here was read from FAR_EL1, but for some of the classes of exception you
>> have listed below this register isn't updated with the faulting address.
>>
>> The ARM-ARM version 'k' in D1.10.5 "Summary of registers on faults taken to an
>> Exception level that is using Aarch64" has:
>>> The architecture permits that the FAR_ELx is UNKNOWN for Synchronous External
>>> Aborts other than Synchronous External Aborts on Translation Table Walks. In
>>> this case, the ISS.FnV bit returned in ESR_ELx indicates whether FAR_ELx is
>>> valid.
>> This is a problem if we get 'synchronous external abort' or 'synchronous parity
>> error' while a user space process was running.
> It looks like this would just cause an incorrect address to be printed in the
> above pr_err.
> Unless I'm missing something, I don't see arm64_notify_die or anything that gets
> called from
> there using the info.si_addr variable.
I may be misreading something here...
This patch has:
> info.si_addr = (void __user *)addr;
> arm64_notify_die("", regs, &info, esr);
>From arch/arm64/kernel/traps.c:arm64_notify_die():
> if (user_mode(regs)) {
> current->thread.fault_address = 0;
> current->thread.fault_code = err;
> force_sig_info(info->si_signo, info, current);
> }
So if the SEA interrupted userspace, we put maybe-unknown addr into
force_sig_info() to deliver a signal to user space. User-space then gets a copy
of the info struct containing the maybe-unknown addr.
I think this is an existing bug, but if we are separating the synchronous
external aborts from the generic do_bad handler, we should probably check the
FnV bit. (I think we should still print it out)
> What do you suggest I do here? The firmware should be reporting the physical and
> virtual
> address information if it is available in the HEST entry that the kernel will
> parse.
Its not just firmware that may trigger this, other SoCs may use it for parity or
ECC errors, and they may not always have a valid address in FAR_EL1.
I think we should check the FnV bit in the esr variable and set info.si_addr to
0 if the addr we have isn't valid:
'For some implementations, the value of si_addr may be inaccurate.' [0]
Thanks,
James
[0] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html