Bisection came down to 1733a2ad3674("drm/nouveau/device/pci: set as
non-CPU-coherent on ARM64"), and sure enough reverting that removes the
crash.
Thanks for taking the time to bisect this. And apologies as it seems my
commit is the reason for your troubles.
The CPU coherency flag is used for two things: explicitly sync buffers
pages when required, and allocating buffers that are not explicitly
synced (like fences or pushbuffers) using the DMA API. For this latter
use, it also accesses the buffer's content using the mapping provided by
dma_alloc_coherent() instead of creating a new one. All nouveau_bos are
supposed to be written using nouveau_bo_rd32(), and this function
handles the case of an DMA-API allocated object by detecting that the
result of ttm_kmap_obj_virtual() is NULL.
But as it turns out, OUT_RINGp() also calls ttm_kmap_obj_virtual() in
order to perform a memcpy and uses its result directly - which means we
are doing memcpy on a NULL pointer. We never caught this because we
typically do not use Nouveau's fbcon with an ARM setup.
I don't really like this special access for coherent objects, and
actually had a patch in my tree to attempt to remove it (attached).
Although it is not the whole solution (see below), the issue should at
least not be visible with it applied - could you confirm?
Hi Robin, could you confirm whether the attached patch in my previous
mail helps with your problem?