Re: [PATCH] arm32: fix flushcache syscall with device address

From: James Morse
Date: Tue Apr 21 2020 - 12:50:31 EST

Next message: Mathieu Poirier: "Re: [PATCH] perf: cs-etm: Update to build with latest opencsd version."
Previous message: shuah: "Re: [PATCH 5.6 00/71] 5.6.6-rc1 review"
In reply to: Jonathan Cameron: "Re: [PATCH] arm32: fix flushcache syscall with device address"
Next in thread: Russell King - ARM Linux admin: "Re: [PATCH] arm32: fix flushcache syscall with device address"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi guys,

(Subject Nit: arm64, as that is what your patch modifies)

On 21/04/2020 12:16, Jonathan Cameron wrote:
> On Tue, 21 Apr 2020 09:12:39 +0100
> Will Deacon <will@xxxxxxxxxx> wrote:
>
>> On Tue, Apr 21, 2020 at 04:08:34PM +0800, Tian Tao wrote:
>>> An issue has been observed on our Kungpeng916 systems when using a PCI
>>> express GPU. This occurs when a 32 bit application running on a 64 bit
>>> kernel issues a cache flush operation to a memory address that is in
>>> a PCI BAR of the GPU.The results in an illegal operation and
>>> subsequent crash.
>>
>> A kernel crash? If so, please can you include the log here?
>
> Deploying my finest copy typing from the image Tian Tao sent out
>
> KERNEL: /root/vmlinux-4.19.36-3patch-00228-debuginfo
> DUMPFILE: vmcore [PARTIAL DUMP]
> CPUS: 64
> DATE: Fri Mar 20 06:59:56 2020
> UPTIME: 07:01:01
> LOAD AVERAGE: 33.76, 35.45, 35.79
> TASKS: 59447
> NODENAME: cpus-new-ondemand-0509
> RELEASE: 4.19.36-3patch-0228
> VERSION: #4 SMP Fri Feb 28 15:18:51 UTC 2020
> MACHINE: aarch64 (unknown MHz)
> MEMORY: 255.7 GB
> PANIC: "kernel panic - not syncing: Asynchronous SError Interrupt"
> PID: 175108
> COMMAND: "UnityMain"
> TASK: ffff80a96999dd00 [THREAD_INFO: ffff80a96999dd00]
> CPU: 62
> STATE: TASK_RUNNING (PANIC)
>
> crash> bt
> PID: 175108 TASK: ffff80a96999dd00 CPU: 62 COMMAND: "UnityMain"
> #0 [ffff000194e1b920] machine_kexec at ffff0000080a265c
> #1 [ffff000194e1b980] __crash_kexec at ffff0000081b3ba8
> #2 [ffff000194e1bb10] panic at ffff0000080ecc98
> #3 [ffff000194e1bbf0] nmi_panic at ffff0000080ec7f4
> #4 [ffff000194e1bc10] arm64_serror_panic at fff00000809019c
> #5 [ffff000194e1bc30] do_serror at ffff00000809039c
> #6 [ffff000194e1bd90] el1_error at ffff000008083e50
> #7 [ffff000194e1bda0] __flush_icache_range at ffff0000080a9ec4
> #8 [ffff000194e1be60] el0_svc_common at fff0000080977d8
> #9 [ffff000194e1bea0] el0_svc_compat_handler at ffff0000080979b4
> #10 [ffff000194e1bff0] el0_svc_compat at ffff0000008083874
>
> PC: c90fe7f8 LR: c90ff09c SP: d2afa8e0 PSTATE: 800b0010
> X12: c56e96e4 X11: d2afaa48 X10: d0ff1000 X9: d2afab68
> x8: 000000d6 X7: 000f0002 X6: d3c61840 X5: d3c61001
> X4: d3c03000 X3: 0004d54a x2: 00000000 x1: d3c61040
> X0: d3c61000
>
>
> New advanced test for Mavis Beacon teaches typing.

Thanks for doing that!

> In summary this is all we have to hand...

Jumping back to the patch:
On 21/04/2020 09:08, Tian Tao wrote:> diff --git a/arch/arm64/kernel/sys_compat.c
b/arch/arm64/kernel/sys_compat.c
> index 3c18c24..6c07944 100644
> --- a/arch/arm64/kernel/sys_compat.c
> +++ b/arch/arm64/kernel/sys_compat.c
> @@ -32,6 +94,13 @@ __do_compat_cache_op(unsigned long start, unsigned long end)
> if (fatal_signal_pending(current))
> return 0;
>
> + /* do not flush page table is non-cacheable */
> + if (!__check_pt_cacheable(start)) {
> + cond_resched();
> + start += chunk;
> + continue;
> + }

The Arm-arm expects this to work, so we can't just skip it!

D4.4.8's "Effects of instructions that operate by VA to the PoC" has:
| For Device memory and Normal memory that is Inner Non-cacheable, Outer Non-cacheable,
| these instructions must affect the caches of all PEs in the Outer Shareable shareability
| domain of the PE on which the instruction is operating.

What does aarch64 user-space do in the same situation? Surely that also goes down in
flames too!?

I think the real problem here is you've given this kind of mapping to user-space. If the
device behind the mapping can respond like this, you must trust user-space not to poke it
inappropriately.

If its not got all the properties of memory: please don't pretend its memory.
If its a device, it probably needs to be carefully managed by a dedicated driver.

Do you know where the abort comes from and why? (The interconnect, PCIE-root-complex, or
from the GPU itself?)
Is it the wrong attributes? Too large an eviction from the cache? Wrong alignment for this
BAR? It can't handle cache-maintenance?

Thanks,

James

Next message: Mathieu Poirier: "Re: [PATCH] perf: cs-etm: Update to build with latest opencsd version."
Previous message: shuah: "Re: [PATCH 5.6 00/71] 5.6.6-rc1 review"
In reply to: Jonathan Cameron: "Re: [PATCH] arm32: fix flushcache syscall with device address"
Next in thread: Russell King - ARM Linux admin: "Re: [PATCH] arm32: fix flushcache syscall with device address"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]