Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

From: Christian König
Date: Mon Jun 14 2021 - 07:19:57 EST




Am 11.06.21 um 20:23 schrieb Ondrej Zary:
On Friday 11 June 2021 14:38:18 Christian König wrote:
Am 10.06.21 um 19:59 schrieb Christian König:
Am 10.06.21 um 19:50 schrieb Ondrej Zary:
[SNIP]
I can't see how this is called from the nouveau code, only
possibility I
see is that it is maybe called through the AGP code somehow.
Yes, you're right:
[   13.192663] Call Trace:
[   13.192678]  dump_stack+0x54/0x68
[   13.192690]  ttm_tt_init+0x11/0x8a [ttm]
[   13.192699]  ttm_agp_tt_create+0x39/0x51 [ttm]
[   13.192840]  nouveau_ttm_tt_create+0x17/0x22 [nouveau]
[   13.192856]  ttm_tt_create+0x78/0x8c [ttm]
[   13.192864]  ttm_bo_handle_move_mem+0x7d/0xca [ttm]
[   13.192873]  ttm_bo_validate+0x92/0xc8 [ttm]
[   13.192883]  ttm_bo_init_reserved+0x216/0x243 [ttm]
[   13.192892]  ttm_bo_init+0x45/0x65 [ttm]
[   13.193018]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
[   13.193150]  nouveau_bo_init+0x8c/0x94 [nouveau]
[   13.193273]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
[   13.193407]  nouveau_bo_new+0x44/0x57 [nouveau]
[   13.193537]  nouveau_channel_prep+0xa3/0x269 [nouveau]
[   13.193665]  nouveau_channel_new+0x3c/0x5f7 [nouveau]
[   13.193679]  ? slab_free_freelist_hook+0x3b/0xa7
[   13.193686]  ? kfree+0x9e/0x11a
[   13.193781]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
[   13.193908]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
[   13.193924]  ? pci_enable_device_flags+0x1e/0xac
[   13.194052]  nouveau_drm_probe+0xeb/0x188 [nouveau]
[   13.194182]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
[   13.194195]  pci_device_probe+0x89/0xe9
[   13.194205]  really_probe+0x127/0x2a7
[   13.194212]  driver_probe_device+0x5b/0x87
[   13.194219]  device_driver_attach+0x2e/0x41
[   13.194226]  __driver_attach+0x7c/0x83
[   13.194232]  bus_for_each_dev+0x4c/0x66
[   13.194238]  driver_attach+0x14/0x16
[   13.194244]  ? device_driver_attach+0x41/0x41
[   13.194251]  bus_add_driver+0xc5/0x16c
[   13.194258]  driver_register+0x87/0xb9
[   13.194265]  __pci_register_driver+0x38/0x3b
[   13.194271]  ? 0xf0c0d000
[   13.194362]  nouveau_drm_init+0x14c/0x1000 [nouveau]

How is ttm_dma_tt->dma_address allocated?
Mhm, I need to double check how AGP is supposed to work.

Since barely anybody is using it these days it is something which
breaks from time to time.
I have no idea how that ever worked in the first place since AGP isn't
supposed to sync between CPU/GPU. Everything is coherent for that case.

Anyway here is a patch which adds a check to those functions if the
dma_address array is allocated in the first place. Please test it.
Thanks, the patch fixes the problem and nouveau now works!
Should be applied to 5.12-stable too (5.11 is affected too but EOL).

I will just add a CC stable tag before pushing.


It's weird that it worked before.
Looks like dma_address was used uninitialized - it contained some random
crap:
[ 12.293304] nouveau_bo_sync_for_device: ttm_dma->dma_address=3e055971 ttm_dma->ttm.num_pages=18
[ 12.293321] ttm_dma->dma_address[0]=0x0
[ 12.293341] ttm_dma->dma_address[1]=0x0
[ 12.293360] ttm_dma->dma_address[2]=0xee728980
[ 12.293379] ttm_dma->dma_address[3]=0xed1cb120
[ 12.293397] ttm_dma->dma_address[4]=0x12
[ 12.293416] ttm_dma->dma_address[5]=0x0
[ 12.293434] ttm_dma->dma_address[6]=0x1
[ 12.293453] ttm_dma->dma_address[7]=0x0
[ 12.293471] ttm_dma->dma_address[8]=0x10000
[ 12.293490] ttm_dma->dma_address[9]=0x0
[ 12.293510] ttm_dma->dma_address[10]=0x101
[ 12.293528] ttm_dma->dma_address[11]=0xee7289ec
[ 12.293546] ttm_dma->dma_address[12]=0xee7289ec
[ 12.293564] ttm_dma->dma_address[13]=0x0
[ 12.293581] ttm_dma->dma_address[14]=0x0
[ 12.293599] ttm_dma->dma_address[15]=0x0
[ 12.293616] ttm_dma->dma_address[16]=0x0
[ 12.293634] ttm_dma->dma_address[17]=0x0
But it did not matter as dma_sync_single_for_device is a no-op here.
When dma_address is properly initialized to NULL, it crashes...

Ok that explains things, but essentially means that this only worked by coincident.

Just send out the patch to Ben, the list and you once more. Please reply with a rb, ak-by and/or tested-by so that I can push it ASAP.

Thanks,
Christian.


Thanks,
Christian.

Thanks for the backtrace,
Christian.

  I cannot find any assignment
executed (in the working code):

$ git grep dma_address\ = drivers/gpu/
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:
sg->sgl->dma_address = addr;
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
&dma->dma_address[offset >> PAGE_SHIFT];
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
(mm_node->start << PAGE_SHIFT) + offset;
drivers/gpu/drm/i915/gvt/scheduler.c:   sg->dma_address = addr;
drivers/gpu/drm/i915/i915_gpu_error.c:  sg->dma_address = it;
drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = (void *)
(ttm->ttm.pages + ttm->ttm.num_pages);
drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address =
kvmalloc_array(ttm->ttm.num_pages,
drivers/gpu/drm/ttm/ttm_tt.c:   ttm_dma->dma_address = NULL;
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
&__vmw_piter_phys_addr;
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
&__vmw_piter_dma_addr;
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
&__vmw_piter_sg_addr;

The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
ttm_sg_tt_alloc_page_directory().
Confirmed by adding printk()s that they're NOT called.