[RFC 0/3] Improve memory statistics for virtio balloon

From: zhenwei pi
Date: Mon Apr 15 2024 - 04:41:36 EST


Hi,

When the guest runs under critial memory pressure, the guest becomss
too slow, even sshd turns D state(uninterruptible) on memory
allocation. We can't login this VM to do any work on trouble shooting.

Guest kernel log via virtual TTY(on host side) only provides a few
necessary log after OOM. More detail memory statistics are required,
then we can know explicit memory events and estimate the pressure.

I'm going to introduce several VM counters for virtio balloon:
- oom-kill
- alloc-stall
- scan-async
- scan-direct
- reclaim-async
- reclaim-direct

Then we have a metric to analyze the memory performance:
[also describe this metric in patch
'virtio_balloon: introduce memory scan/reclaim info']

y: counter increases
n: counter does not changes
h: the rate of counter change is high
l: the rate of counter change is low

OOM: VIRTIO_BALLOON_S_OOM_KILL
STALL: VIRTIO_BALLOON_S_ALLOC_STALL
ASCAN: VIRTIO_BALLOON_S_SCAN_ASYNC
DSCAN: VIRTIO_BALLOON_S_SCAN_DIRECT
ARCLM: VIRTIO_BALLOON_S_RECLAIM_ASYNC
DRCLM: VIRTIO_BALLOON_S_RECLAIM_DIRECT

- OOM[y], STALL[*], ASCAN[*], DSCAN[*], ARCLM[*], DRCLM[*]:
the guest runs under really critial memory pressure

- OOM[n], STALL[h], ASCAN[*], DSCAN[l], ARCLM[*], DRCLM[l]:
the memory allocation stalls due to cgroup, not the global memory
pressure.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[h]:
the memory allocation stalls due to global memory pressure. The
performance gets hurt a lot. A high ratio between DRCLM/DSCAN shows
quite effective memory reclaiming.

- OOM[n], STALL[h], ASCAN[*], DSCAN[h], ARCLM[*], DRCLM[l]:
the memory allocation stalls due to global memory pressure.
the ratio between DRCLM/DSCAN gets low, the guest OS is thrashing
heavily, the serious case leads poor performance and difficult
trouble shooting. Ex, sshd may block on memory allocation when
accepting new connections, a user can't login a VM by ssh command.

- OOM[n], STALL[n], ASCAN[h], DSCAN[n], ARCLM[l], DRCLM[n]:
the low ratio between ARCLM/ASCAN shows that the guest tries to
reclaim more memory, but it can't. Once more memory is required in
future, it will struggle to reclaim memory.

zhenwei pi (3):
virtio_balloon: introduce oom-kill invocations
virtio_balloon: introduce memory allocation stall counter
virtio_balloon: introduce memory scan/reclaim info

drivers/virtio/virtio_balloon.c | 30 ++++++++++++++++++++++++++++-
include/uapi/linux/virtio_balloon.h | 16 +++++++++++++--
2 files changed, 43 insertions(+), 3 deletions(-)

--
2.34.1