Hello,
Luca is reporting that cgroups which have kvm instances inside never
complete freezing. This can be trivially reproduced:
root@test ~# mkdir /sys/fs/cgroup/test
root@test ~# echo $fish_pid > /sys/fs/cgroup/test/cgroup.procs
root@test ~# qemu-system-x86_64 --nographic -enable-kvm
and in another terminal:
root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
root@test ~# cat /sys/fs/cgroup/test/cgroup.events
populated 1
frozen 0
root@test ~# for i in (cat /sys/fs/cgroup/test/cgroup.threads); echo $i; cat /proc/$i/stack; end
2070
[<0>] do_freezer_trap+0x42/0x70
[<0>] get_signal+0x4da/0x870
[<0>] arch_do_signal_or_restart+0x1a/0x1c0
[<0>] syscall_exit_to_user_mode+0x73/0x120
[<0>] do_syscall_64+0x87/0x140
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
2159
[<0>] do_freezer_trap+0x42/0x70
[<0>] get_signal+0x4da/0x870
[<0>] arch_do_signal_or_restart+0x1a/0x1c0
[<0>] syscall_exit_to_user_mode+0x73/0x120
[<0>] do_syscall_64+0x87/0x140
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
2160
[<0>] do_freezer_trap+0x42/0x70
[<0>] get_signal+0x4da/0x870
[<0>] arch_do_signal_or_restart+0x1a/0x1c0
[<0>] syscall_exit_to_user_mode+0x73/0x120
[<0>] do_syscall_64+0x87/0x140
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
2161
[<0>] kvm_nx_huge_page_recovery_worker+0xea/0x680
[<0>] kvm_vm_worker_thread+0x8f/0x2b0
[<0>] kthread+0xe8/0x110
[<0>] ret_from_fork+0x33/0x40
[<0>] ret_from_fork_asm+0x1a/0x30
2164
[<0>] do_freezer_trap+0x42/0x70
[<0>] get_signal+0x4da/0x870
[<0>] arch_do_signal_or_restart+0x1a/0x1c0
[<0>] syscall_exit_to_user_mode+0x73/0x120
[<0>] do_syscall_64+0x87/0x140
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
The cgroup freezing happens in the signal delivery path but
kvm_vm_worker_thread() thread never call into the signal delivery path while
joining non-root cgroups, so they never get frozen. Because the cgroup
freezer determines whether a given cgroup is frozen by comparing the number
of frozen threads to the total number of threads in the cgroup, the cgroup
never becomes frozen and users waiting for the state transition may hang
indefinitely.
There are two paths that we can take:
1. Make kvm_vm_worker_thread() call into signal delivery path.
io_wq_worker() is in a similar boat and handles signal delivery and can
be frozen and trapped like regular threads.