Re: [PATCH] drm/virtio: warn when virtqueue has no free space for too long

From: Dmitry Osipenko

Date: Mon Jun 29 2026 - 08:11:33 EST

Hi,

On 6/18/26 10:18, Ryosuke Yasuoka wrote:
> virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() wait for
> virtqueue space using wait_event() with vqs_released as the only abort
> condition. This covers the device removal path, where
> virtio_gpu_release_vqs() sets the flag, but does not help when the host
> simply stops processing the virtqueue while the device remains present.
>
> In that case, the virtqueue fills up and subsequent command submissions
> block indefinitely in D state with no diagnostic output, making the root
> cause difficult to identify.
>
> Replace the bare wait_event() with a wait_event_timeout() loop that
> prints a warning every 5 seconds while the virtqueue remains full. The
> wait still blocks indefinitely so driver behavior is unchanged. The
> warnings help identify an unresponsive host device during
> troubleshooting.
>
> Signed-off-by: Ryosuke Yasuoka <ryasuoka@xxxxxxxxxx>
> ---
> When the host stops processing the virtio-gpu virtqueue without
> triggering device removal, the bare wait_event() in
> virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() blocks
> indefinitely with no diagnostic output. A DRM atomic commit worker
> blocks in virtio_gpu_queue_fenced_ctrl_buffer() while holding the
> modeset lock. During graceful shutdown, systemd (PID 1) needs the same
> lock — either by writing to the console via fbcon, or by closing a DRM
> file descriptor that triggers framebuffer cleanup — and blocks as well,
> making the VM unrecoverable without a forced power-off.
>
> PID: 553 COMMAND: "kworker/u4:3"
> #0 __schedule
> #1 schedule
> #2 virtio_gpu_queue_fenced_ctrl_buffer [virtio_gpu]
> #3 virtio_gpu_primary_plane_update [virtio_gpu]
> ...
>
> PID: 1 COMMAND: "systemd" (console write path)
> #0 __schedule
> #1 schedule
> #2 schedule_preempt_disabled
> #3 __ww_mutex_lock
> #4 drm_modeset_lock [drm]
> #5 drm_atomic_get_plane_state [drm]
> #6 drm_client_modeset_commit_atomic [drm]
> #7 drm_client_modeset_commit_locked [drm]
> #8 drm_fb_helper_pan_display [drm_kms_helper]
> #9 fb_pan_display
> #10 bit_update_start
> #11 fbcon_switch
> #12 redraw_screen
> ...
>
> Reproduction steps:
> 1. Build QEMU with the fault injection patch [1] that adds an
> x-ctrl-queue-broken property to virtio-gpu.
> 2. Boot the VM and trigger the fault injection from the host.
> 3. Fill the ctrlq (e.g., move the mouse on the guest's display).
> The process gets stuck in virtio_gpu_queue_fenced_ctrl_buffer()
> in D state.
> 4. Run a graceful shutdown command (shutdown now or reboot).
> 5. The shutdown process hangs.
>
> My earlier patch a46991b334f6 ("drm/virtio: abort virtqueue wait on
> device removal to avoid hung task") covers the case where the shutdown
> process reaches the device_shutdown() call path, which sets vqs_released
> to unblock the wait. However, during graceful shutdown, systemd (PID 1)
> gets stuck on the modeset lock before ever reaching device_shutdown(),
> so vqs_released is never set and the wait is never unblocked.
>
> I initially considered adding a module parameter to abort the wait with
> -ENODEV on timeout:
>
> +static unsigned int virtio_gpu_vq_timeout;
> +MODULE_PARM_DESC(vq_timeout,
> + "Timeout in seconds for virtqueue wait (0 = no timeout, default)");
> +module_param_named(vq_timeout, virtio_gpu_vq_timeout, uint, 0444);
> ...
> + if (virtio_gpu_vq_timeout) {
> + if (!wait_event_timeout(vgdev->ctrlq.ack_queue,
> + vq->num_free >= elemcnt ||
> + vgdev->vqs_released,
> + secs_to_jiffies(virtio_gpu_vq_timeout))) {
> + if (fence && vbuf->objs)
> + virtio_gpu_array_unlock_resv(vbuf->objs);
> + free_vbuf(vgdev, vbuf);
> + drm_dev_exit(idx);
> + return -ENODEV;
> + }
> + } else {
> + wait_event(vgdev->ctrlq.ack_queue,
> + vq->num_free >= elemcnt ||
> + vgdev->vqs_released);
> + }
>
> This approach aborts the wait and allows the graceful shutdown process
> to eventually proceed, albeit with a delay.
>
> But that approach has drawbacks: it allows users to set arbitrarily
> short timeouts that could destabilize the driver, and aborting commands
> mid-flight is a rough recovery path. An unconditional timeout was also
> discussed previously [2] but is not appropriate without virtio
> specification support.
>
> This patch takes a safer approach: replace the bare wait_event() with
> wait_event_timeout() and print a warning every 5 seconds while the
> virtqueue remains full. The wait still blocks indefinitely and no
> commands are aborted, so driver behavior is unchanged. The warnings
> help identify an unresponsive host device during troubleshooting.
> Once the user notices the warning, they can work around the hang by
> unbinding the VT from fbcon, removing the device, or forcing a shutdown
> via SysRq.
>
> [1] https://gist.github.com/YsuOS/fbcd181752594af35f954953a1d260b8
> [2] https://lore.kernel.org/all/8a986c52-964f-42a5-b063-fbe2b242ca36@xxxxxxxxxxxxx/
> ---
> drivers/gpu/drm/virtio/virtgpu_vq.c | 18 ++++++++++++++----
> 1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c b/drivers/gpu/drm/virtio/virtgpu_vq.c
> index 68d097ad9d1d..a546130d3b6a 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_vq.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_vq.c
> @@ -410,8 +410,13 @@ static int virtio_gpu_queue_ctrl_sgs(struct virtio_gpu_device *vgdev,
> if (vq->num_free < elemcnt) {
> spin_unlock(&vgdev->ctrlq.qlock);
> virtio_gpu_notify(vgdev);
> - wait_event(vgdev->ctrlq.ack_queue,
> - vq->num_free >= elemcnt || vgdev->vqs_released);
> + while (!wait_event_timeout(vgdev->ctrlq.ack_queue,
> + vq->num_free >= elemcnt ||
> + vgdev->vqs_released,
> + 5 * HZ) && !vgdev->vqs_released)
> + DRM_WARN("ctrlq waiting for host: no free space for %d secs\n",
> + 5);
> +
> /*
> * Set by virtio_gpu_release_vqs() to unblock
> * synchronize_srcu() wait in drm_dev_unplug().
> @@ -592,8 +597,13 @@ static void virtio_gpu_queue_cursor(struct virtio_gpu_device *vgdev,
> ret = virtqueue_add_sgs(vq, sgs, outcnt, 0, vbuf, GFP_ATOMIC);
> if (ret == -ENOSPC) {
> spin_unlock(&vgdev->cursorq.qlock);
> - wait_event(vgdev->cursorq.ack_queue,
> - vq->num_free >= outcnt || vgdev->vqs_released);
> + while (!wait_event_timeout(vgdev->cursorq.ack_queue,
> + vq->num_free >= outcnt ||
> + vgdev->vqs_released,
> + 5 * HZ) && !vgdev->vqs_released)
> + DRM_WARN("cursorq waiting for host: no free space for %d secs\n",
> + 5);
> +
> /* See comment in virtio_gpu_queue_ctrl_sgs(). */
> if (vgdev->vqs_released) {
> free_vbuf(vgdev, vbuf);

The [1] has a valid point about the hangcheck warning. Is
wait_event_timeout() worth the change then?

[1]
https://sashiko.dev/#/patchset/20260618-virtiogpu_add_timeout-v1-1-dc36cef609d9%40redhat.com

--
Best regards,
Dmitry