Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

From: Jason Wang
Date: Thu May 16 2024 - 05:53:12 EST


On Thu, May 16, 2024 at 5:46 PM Catherine Redfield
<catherine.redfield@xxxxxxxxxxxxx> wrote:
>
> Feng,
>
> Thank you for providing your debugging steps; I used them on a gce image locally and was not able to replicate the issue. I also attempted to replicate in qemu/virsh using qemu-guest-agent to enable the S3 suspend state, also without success (that is S3 suspend state worked without any problems). I have brought this back to the cloud for further debugging of their config and guest agent to try and determine what the issue is.
>
> Thank you very much for all your help on this issue and time looking into it!
> Catherine

Does this fix the issue? I guess the reason is that GCE is using legacy virtio.

https://lore.kernel.org/kvm/CACGkMEth_9Baewekq862YgZwuozwG96Z3G6oYqHzyCj2JPUZ3g@xxxxxxxxxxxxxx/T/

Thanks

>
> On Thu, May 9, 2024 at 5:03 AM Feng Liu <feliu@xxxxxxxxxx> wrote:
>>
>>
>> On 2024-05-08 a.m.7:18, Catherine Redfield wrote:
>> > *External email: Use caution opening links or attachments*
>> >
>> >
>> > On a VM with the GCP kernel (where we first identified the problem), I see:
>> >
>> > 1. The full kernel log from `journalctl --system > kernlog` attached.
>> > The specific suspend section is here:
>> >
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd[1]: Reached target sleep.target - Sleep.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd[1]: Starting systemd-suspend.service - System Suspend...
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > systemd-sleep[1413]: Performing sleep operation 'suspend'...
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend entry (deep)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Filesystems sync: 0.008 seconds
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing user space processes
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing user space processes completed (elapsed 0.001 seconds)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: OOM killer disabled.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing remaining freezable tasks
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Freezing remaining freezable tasks completed (elapsed 0.000 seconds)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: printk: Suspending console(s) (use no_console_suspend to debug)
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: port 00:03:0.0: PM: dpm_run_callback():
>> > pm_runtime_force_suspend+0x0/0x130 returns -16
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: port 00:03:0.0: PM: failed to suspend: error -16
>>
>> Thanks Joesph and Catherine's help.
>>
>> Hi,
>>
>> I have alreay synced up with Cananical guys offline about this issue.
>>
>> I can run "suspend/resume" sucessfully on my local server and VM.
>> And "PM: failed to suspend: error -16" looks like not cause by my
>> previous virtio patch ( fd27ef6b44be ("virtio-pci: Introduce admin
>> virtqueue")) which only modified "virtio_device_freeze" about "suspend"
>> action.
>>
>> So I have provide the my steps and debug patch to Joesph and Catherine.
>> I will also sync up the information here, as follow:
>>
>> I have read the qemu code and find a way to trigger "suspend/resume" on
>> my setup, and add some debug message in the latest kerenel
>>
>> My setps are:
>> 1. QEMU cmdline add following
>> ....
>> -global PIIX4_PM.disable_s3=0 \
>> -global PIIX4_PM.disable_s4=1 \
>> ....
>> -netdev type=tap,ifname=tap0,id=hostnet0,script=no,downscript=no \
>> -device
>> virtio-net-pci,netdev=hostnet0,id=net0,mac=$SSH_MAC,bus=pci.0,addr=0x3 \
>> ......
>>
>> 2. In the VM, run "systemctl suspend" to PM suspend the VM into memory
>> 3. In qemu hmp shell, run "system_wakeup" to resume the VM again
>>
>> My VM configuration:
>> NIC: 1 virtio nic emulated by QEMU
>> OS: Ubuntu 22.04.4 LTS
>> kernel: latest kernel, 6.9-rc7: ee5b455b0ada (kernel2/net-next-virito,
>> kernel2/master, master) Merge tag 'slab-for-6.9-rc7-fixes' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab)
>>
>>
>> I add some debug message on the latest kernel, and do above steps to
>> trigger "suspen/resume". Everything of VM is OK, VM could suspend/resume
>> successfully.
>> Follwing is the kernel log:
>> ----------------------------------------------------------------------------
>> ........
>> May 6 15:59:52 feliu-vm kernel: [ 43.446737] PM: suspend entry (deep)
>> May 6 16:00:04 feliu-vm kernel: [ 43.467640] Filesystems sync: 0.020
>> seconds
>> May 6 16:00:04 feliu-vm kernel: [ 43.467923] Freezing user space
>> processes
>> May 6 16:00:04 feliu-vm kernel: [ 43.470294] Freezing user space
>> processes completed (elapsed 0.002 seconds)
>> May 6 16:00:04 feliu-vm kernel: [ 43.470299] OOM killer disabled.
>> May 6 16:00:04 feliu-vm kernel: [ 43.470301] Freezing remaining
>> freezable tasks
>> May 6 16:00:04 feliu-vm kernel: [ 43.471482] Freezing remaining
>> freezable tasks completed (elapsed 0.001 seconds)
>> May 6 16:00:04 feliu-vm kernel: [ 43.471495] printk: Suspending
>> console(s) (use no_console_suspend to debug)
>> May 6 16:00:04 feliu-vm kernel: [ 43.474034] virtio_net virtio0:
>> godeng virtio device freeze
>> May 6 16:00:04 feliu-vm kernel: [ 43.475714] virtio_net virtio0 ens3:
>> godfeng virtnet_freeze done
>> May 6 16:00:04 feliu-vm kernel: [ 43.475717] virtio_net virtio0:
>> godfeng VIRTIO_F_ADMIN_VQ not enabled
>> May 6 16:00:04 feliu-vm kernel: [ 43.475719] virtio_net virtio0:
>> godeng virtio device freeze done
>> ........
>> May 6 16:00:04 feliu-vm kernel: [ 43.535382] smpboot: CPU 1 is now
>> offline
>> May 6 16:00:04 feliu-vm kernel: [ 43.537283] IRQ fixup: irq 1 move in
>> progress, old vector 32
>> May 6 16:00:04 feliu-vm kernel: [ 43.538504] smpboot: CPU 2 is now
>> offline
>> May 6 16:00:04 feliu-vm kernel: [ 43.541392] smpboot: CPU 3 is now
>> offline
>>
>> ......
>>
>> May 6 16:00:04 feliu-vm kernel: [ 54.973285] smpboot: Booting Node 0
>> Processor 15 APIC 0xf
>> May 6 16:00:04 feliu-vm kernel: [ 54.975190] CPU15 is up
>> May 6 16:00:04 feliu-vm kernel: [ 54.976011] ACPI: PM: Waking up from
>> system sleep state S3
>> May 6 16:00:04 feliu-vm kernel: [ 54.986071] virtio_net virtio0:
>> godeng virtio device restore
>> May 6 16:00:04 feliu-vm kernel: [ 54.987563] virtio_net virtio0 ens3:
>> godfeng virtnet_restore done
>> May 6 16:00:04 feliu-vm kernel: [ 54.987635] virtio_net virtio0:
>> godfeng: virtio device restore done
>> ......
>> May 6 16:00:04 feliu-vm kernel: [ 55.307221] ata8: SATA link down
>> (SStatus 0 SControl 300)
>> May 6 16:00:04 feliu-vm kernel: [ 55.442048] OOM killer enabled.
>> May 6 16:00:04 feliu-vm kernel: [ 55.442051] Restarting tasks ... done.
>> May 6 16:00:04 feliu-vm kernel: [ 55.443576] random: crng reseeded on
>> system resumption
>> May 6 16:00:04 feliu-vm kernel: [ 55.443582] PM: suspend exit
>>
>> ----------------------------------------------------------------------------
>>
>> Attachment is the full kernel log. I think maybe it is some configration
>> error.
>>
>>
>> Thanks
>> Feng
>>
>>
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: sd 0:0:1:0: [sda] Synchronizing SCSI cache
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: Some devices failed to suspend, or early wake event detected
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: OOM killer enabled.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Restarting tasks ... done.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: random: crng reseeded on system resumption
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend exit
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend entry (s2idle)
>> > -- Boot 61828bc938b44fc68a8aeedc16a23a9d --
>> > May 08 11:09:03 localhost kernel: Linux version 6.8.0-1007-gcp
>> > (buildd@lcy02-amd64-079) (x86_64-linux-gnu-gcc-13 (Ubuntu
>> > 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42)
>> > #7-Ubuntu SMP Sat Apr 20 00:58:31 UTC 2024 (Ubuntu 6.8.0-1007.7-gcp 6.8.1)
>> > May 08 11:09:03 localhost kernel: Command line:
>> > BOOT_IMAGE=/vmlinuz-6.8.0-1007-gcp
>> > root=PARTUUID=7a949935-6bf2-4cae-b404-803c95163572 ro
>> > console=ttyS0,115200 panic=-1
>> >
>> > 2. The features the devices has:
>> >
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio0/features
>> > 0110000000000000000000000000010000000000000000000000000000000000
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio1/features
>> > 1110010110011001110000100000010000000000000000000000000000000000
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio2/features
>> > 1110000000000000000000000000000000000000000000000000000000000000
>> > catred@kernel-test-202405080702:~$ cat
>> > /sys/bus/virtio/devices/virtio3/features
>> > 0000000000000000000000000000000000000000000000000000000000000000
>> >
>> > Catherine
>> >
>> > On Tue, May 7, 2024 at 11:34 PM Jason Wang <jasowang@xxxxxxxxxx
>> > <mailto:jasowang@xxxxxxxxxx>> wrote:
>> >
>> > On Sat, May 4, 2024 at 2:10 AM Joseph Salisbury
>> > <joseph.salisbury@xxxxxxxxxxxxx
>> > <mailto:joseph.salisbury@xxxxxxxxxxxxx>> wrote:
>> > >
>> > > Hi Feng,
>> > >
>> > > During testing, a kernel bug was identified with the suspend/resume
>> > > functionality on instances running in a public cloud [0]. This
>> > bug is a
>> > > regression introduced in v6.8-rc1. After a kernel bisect, the
>> > following
>> > > commit was identified as the cause of the regression:
>> > >
>> > > fd27ef6b44be ("virtio-pci: Introduce admin virtqueue")
>> >
>> > Have a quick glance at the patch it seems it should not damage the
>> > freeze/restore as it should behave as in the past.
>> >
>> > But I found something interesting:
>> >
>> > 1) assumes 1 admin vq which is not what spec said
>> > 2) special function for admin virtqueue during freeze/restore, but it
>> > doesn't do anything special than del_vq()
>> > 3) lack real users but I guess e.g the destroy_avq() needs to be
>> > synchronized with the one that is using admin virtqueue
>> >
>> > >
>> > > I was hoping to get your feedback, since you are the patch author. Do
>> > > you think gathering any additional data will help diagnose this
>> > issue?
>> >
>> > Yes, please show us
>> >
>> > 1) the kernel log here.
>> > 2) the features that the device has like
>> > /sys/bus/virtio/devices/virtio0/features
>> >
>> > > This commit is depended upon by other virtio commits, so a revert
>> > test
>> > > is not really straight forward without reverting all the
>> > dependencies.
>> > > Any ideas you have would be greatly appreciated.
>> >
>> > Thanks
>> >
>> > >
>> > >
>> > > Thanks,
>> > >
>> > > Joe
>> > >
>> > > http://pad.lv/2063315 <http://pad.lv/2063315>
>> > >
>> >