[RFC] PM: suspend: Upstreaming wakeup reason capture support

From: Kelly Rossmoyer
Date: Mon Jan 10 2022 - 13:49:56 EST


# Introduction

To aid optimization, troubleshooting, and attribution of battery life, the
Android kernel currently includes a set of patches which provide enhanced
visibility into kernel suspend/resume/abort behaviors. The capabilities
and implementation of this feature have evolved significantly since an
unsuccessful attempt to upstream the original code
(https://lkml.org/lkml/2014/3/10/716), and we would like to (re)start a
conversation about upstreaming, starting with the central question: is
there support for upstreaming this set of features?

# Motivation

Of the many factors influencing battery life on Linux-powered mobile
devices, kernel suspend tends to be amongst the most impactful. Maximizing
time spent in suspend and minimizing the frequency of net-negative suspend
cycles are both important contributors to battery life optimization. But
enabling that optimization - and troubleshooting when things go wrong -
requires more observability of suspend/resume/abort behavior than Linux
currently provides. While mechanisms like `/sys/power/pm_wakeup_irq` and
wakeup_source stats are useful, they are incomplete and scattered. The
Android kernel wakeup reason patches implement significant improvements in
that area.

# Features

As of today, the active set of patches surface the following
suspend-related data:

* wakeup IRQs, including:
* multiple IRQs if more than one is pending during resume flow
* unmapped HW IRQs (wakeup-capable in HW) that should not be
occurring
* misconfigured IRQs (e.g. both enable_irq_wake() and
IRQF_NO_SUSPEND)
* threaded IRQs (not just the parent chip's IRQ)

* non-IRQ wakeups, including:
* wakeups caused by an IRQ that was consumed by lower-level SW
* wakeups from SOC architecture that don't manifest as IRQs

* abort reasons, including:
* wakeup_source activity
* failure to freeze userspace
* failure to suspend devices
* failed syscore_suspend callback

* durations from the most recent cycle, including:
* time spent doing suspend/resume work
* time spent in suspend

In addition to battery life optimization and troubleshooting, some of these
capabilities also lay the groundwork for efforts around improving
attribution of wakeups/aborts (e.g. to specific processes, device features,
external devices, etc).

# Shortcomings

While the core implementation (see below) is relatively straightforward and
localized, calls into that core are somewhat widely spread in order to
capture the breadth of events of interest. The pervasiveness of those
hooks is clearly an area where improvement would be beneficial, especially
if a cleaner solution preserved equivalent capabilities.

# Existing Code

As a reference for how Android currently implements the core code for these
features (which would need a bit of work before submission even if all
features were included), see the following link:

https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/kernel/power/wakeup_reason.c


--

Kelly Rossmoyer | Software Engineer | krossmo@xxxxxxxxxx