Re: [PATCH] drm/nouveau: Fix bug in buffer relocs for Nouveau

From: Christian König
Date: Mon Mar 27 2023 - 15:53:54 EST


Am 27.03.23 um 10:42 schrieb John Ogness:
On 2023-01-19, Tanmay Bhushan <007047221b@xxxxxxxxx> wrote:
dma_resv_wait_timeout returns greater than zero on success
as opposed to ttm_bo_wait_ctx. As a result of that relocs
will fail and give failure even when it was a success.
Today I switched my workstation from 6.2 to 6.3-rc3 and started seeing
lots of new kernel messages:

[ 642.138313][ T1751] nouveau 0000:f0:10.0: X[1751]: reloc wait_idle failed: 1500
[ 642.138389][ T1751] nouveau 0000:f0:10.0: X[1751]: reloc apply: 1500
[ 646.123490][ T1751] nouveau 0000:f0:10.0: X[1751]: reloc wait_idle failed: 1500
[ 646.123573][ T1751] nouveau 0000:f0:10.0: X[1751]: reloc apply: 1500

The graphics seemed to go slower or hang a bit when these messages would
appear. I then found your patch! However, I have some comments about it.

First, it should include a fixes tag:

Fixes: 41d351f29528 ("drm/nouveau: stop using ttm_bo_wait")

Signed-off-by: Tanmay Bhushan <007047221b@xxxxxxxxx>
---
drivers/gpu/drm/nouveau/nouveau_gem.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index f77e44958037..0e3690459144 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -706,9 +706,8 @@ nouveau_gem_pushbuf_reloc_apply(struct nouveau_cli *cli,
ret = dma_resv_wait_timeout(nvbo->bo.base.resv,
DMA_RESV_USAGE_BOOKKEEP,
false, 15 * HZ);
- if (ret == 0)
+ if (ret <= 0) {
ret = -EBUSY;
This is incorrect for 2 reasons:

* it treats restarts as timeouts

* this function now returns >0 on success

- if (ret) {
NV_PRINTK(err, cli, "reloc wait_idle failed: %ld\n",
ret);
break;
I rearranged things to basically correctly translate the return code of
dma_resv_wait_timeout() to match the previous ttm_bo_wait():

ret = dma_resv_wait_timeout(nvbo->bo.base.resv,
DMA_RESV_USAGE_BOOKKEEP,
false, 15 * HZ);
if (ret == 0)
ret = -EBUSY;
if (ret > 0)
ret = 0;
if (ret) {
NV_PRINTK(err, cli, "reloc wait_idle failed: %ld\n",
ret);
break;
}

So the patch just becomes:

@@ -708,6 +708,8 @@ nouveau_gem_pushbuf_reloc_apply(struct n
false, 15 * HZ);
if (ret == 0)
ret = -EBUSY;
+ if (ret > 0)
+ ret = 0;
if (ret) {
NV_PRINTK(err, cli, "reloc wait_idle failed: %ld\n",
ret);

With this variant, everything runs correctly on my workstation again.

It probably deserves a comment about why @ret is being translated. Or
perhaps a new variable should be introduced to separate the return value
of dma_resv_wait_timeout() from the return value of this function.

I'm going to take a look tomorrow, but your code already looks pretty correct to me.

And sorry for the noise, missed the different in the conversion.

Thanks,
Christian.


Either way, this is an important fix for 6.3-rc!

John Ogness