[PATCH 0/1] drm/qxl: fixes qxl_fence_wait

From: Alex Constantino
Date: Thu Mar 07 2024 - 20:12:28 EST


Hi,
As initially reported by Timo in the QXL driver will crash given enough
workload:
https://lore.kernel.org/regressions/fb0fda6a-3750-4e1b-893f-97a3e402b9af@xxxxxxxxxxxxx/
I initially came across this problem when migrating Debian VMs from Bullseye
to Bookworm. This bug will somewhat randomly but consistently happen, even
just by using neovim with plugins or playing a video. This exception would
then cascade and make Xorg crash too.

The error log from dmesg would have `[TTM] Buffer eviction failed` followed
by either a `failed to allocate VRAM BO` or `failed to allocate GEM object`.
And the error log from Xorg would have `qxl(0): error doing QXL_ALLOC`
followed by a backtrace and segmentation fault.

I can confirm the problem still exists in latest kernel versions:
https://gitlab.freedesktop.org/drm/kernel @ c6d6a82d8a9f
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git @ 1870cdc0e8de

When I was investigating this issue I ended up creating a script which
triggers the issue in just a couple of minutes when executed under uxterm.
YMMV according to your system, for example when using urxvt crashes were
not as consistent, likely due to it being more efficient and having less
video memory allocations.
For me this is the fastest way to trigger the bug. Here follows:
```
#!/bin/bash
print_gradient_with_awk() {
local arg="$1"
if [[ -n $arg ]]; then
arg=" ($arg)"
fi
awk -v arg="$arg" 'BEGIN{
s="/\\/\\/\\/\\/\\"; s=s s s s s s s s;
for (colnum = 0; colnum<77; colnum++) {
r = 255-(colnum*255/76);
g = (colnum*510/76);
b = (colnum*255/76);
if (g>255) g = 510-g;
printf "\033[48;2;%d;%d;%dm", r,g,b;
printf "\033[38;2;%d;%d;%dm", 255-r,255-g,255-b;
printf "%s\033[0m", substr(s,colnum+1,1);
}
printf "%s\n", arg;
}'
}
for i in {1..10000}; do
print_gradient_with_awk $i
done
```

Timo initially reported:
commit 5f6c871fe919 ("drm/qxl: properly free qxl releases") as working fine
commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") introducing the bug

The bug occurs whenever a timeout is reached in wait_event_timeout.
To fix this issue I updated the code to include a busy wait logic, which
was how the last working version operated. That fixes this bug while still
keeping the code simple (which I suspect was the motivation for the
5a838e5d5825 commit in the first place), as opposed to just reverting to
the last working version at 5f6c871fe919
The choice for the use of HZ as a scaling factor for the loop was that it
is also used by ttm_bo_wait_ctx which is one of the indirect callers of
qxl_fence_wait, with the other being ttm_bo_delayed_delete

To confirm the problem no longer manifests I have:
- executed my own test case pasted above
- executed Timo's test case pasted below
- played a video stream in mplayer for 3h (no audio stream because
apparently pulseaudio and/or alsa have memory leaks that make the
system run out of memory)

For quick reference here is Timo's script:
```
#!/bin/bash
chvt 3
for j in $(seq 80); do
echo "$(date) starting round $j"
if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ]; then
echo "bug was reproduced after $j tries"
exit 1
fi
for i in $(seq 100); do
dmesg > /dev/tty3
done
done
echo "bug could not be reproduced"
exit 0
```

>From what I could find online it seems that users that have been affected
by this problem just tend to move from QXL to VirtIO, that is why this bug
has been hidding for over 3 years now.
This issue was initially reported by Timo 4 months ago but the discussion
seems to have stalled.
It would be great if this could be addressed and avoid it falling through
the cracks.

Thank you for your time.


---

Alex Constantino (1):
drm/qxl: fixes qxl_fence_wait

drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)


base-commit: 1870cdc0e8dee32e3c221704a2977898ba4c10e8
--
2.39.2