Re: [RFC v2 1/3] PM: Add a sysfs files to represent sleep duration

From: Limonciello, Mario
Date: Tue Nov 15 2022 - 09:13:22 EST


On 11/15/2022 04:32, Hans de Goede wrote:
Hi Mario,

On 11/14/22 20:12, Limonciello, Mario wrote:
[Public]

Thanks! Appreciate the comments.
At least conceptually is there agreement to this idea for the two sysfs files
and userspace can use them to do this comparison?

First of all let me say that I think that having some generic mechanism
which allows userspace to check if deep enough sleep-state were reached
is a good idea. And thank you for working on this!


Sure!

I wonder though if it would not be better to have some mechanism
where a list of sleep states + time spend in each time is printed ?

E.g. I know that on Intel Bay Trail and Cherry Trail devices (just an
example I'm familiar with) there are S0i0 - S0i3 and we really want
to reach S0i3 during suspend.

Sometimes on S0i1 or S0i2 is reached due to some part of the hw
not getting suspended properly.

So then we have reached "a hardware sleep state over s2idle"
but no the one we want.

At least the way it's built right now it's tracking the s0ix counter for Intel and the s0i3 counter for AMD.

BTW - when I did all the cleanups suggested in RFC v2 I notice I was taking the raw number for Intel, and I have that fixed for the next version.

I don't know if other counters exist for Intel for various hardware states.
On the current AMD silicon this is the interesting metric.


OTOH I can image that if we start adding support for functionality
like standby-connect under Linux that then we may not always
reach the deepest hw sleep-state.

Can you elaborate what you mean by standby connect? WoWLAN?
At least on the current AMD platforms WoWLAN can happen while the silicon is in the deepest hardware sleep state.


So I'm a bit worried that having just a single number for
last_hw_state_residency is not enough.

I think that it might be better to have a mechanism to set
a set of names for hw-states (once) and then set the residency
per state (*) after resume and have the sysfs file print
the entire list. >
This list could then also always include the total suspend time,
also avoiding the need for a second sysfs file and we could also
use the same format for non s2idle suspend having it print
only the total suspend time when no hw-state names are set.

So is your thought is to have a single sysfs file something like /sys/power/suspend_stats/s2idle_stats that would show this?

state \t % \t duration (us)
s0i3 \t 99.5% \t 1000

For AMD that would be a single line and I don't think it's worth the extra code. I would like to know if it actually makes sense for Intel though.

We also need to think about what will be actionable with this information by consumers of it because I'm certain it will be leading to bug reports.

Let's think about a hypothetical bug report:
"Intel System only spent 20% of time in deepest hardware state".
They attach to the bug report s2idle_stats that looks like this:

state \t % \t duration (us)
s0i2 \t 80.0% \t 1000000
s0i3 \t 20.0% \t 100000

Is that any more actionable than
/sys/power/last_hw_state_residency showing 100000
and
/sys/power/suspend_total showing 500000

I think in either case the next action is more debugging will be needed, such as turning on dynamic debug or some module parameters.

"Practically" I expect software like systemd or powerd to be reading these sysfs files.


Regards,

Hans


*) Using an array, so up to MAX_HW_RESIDENCY_STATES



A few nested replies below, but I'll clean it up for
RFC v3 or submit as PATCH v1 if there is conceptual alignment before then.

On Thu, Nov 10 2022 at 00:47, Mario Limonciello wrote:

'Add a sysfs files'?

Can you please decide whether that's 'a file' or 'multiple files'?

Yup thanks; bad find and replace in the commit message when I added
the second file.


Both AMD and Intel SoCs have a concept of reporting whether the
hardware
reached a hardware sleep state over s2idle as well as how much
time was spent in such a state.

Nice, but ...

This information is valuable to both chip designers and system designers
as it helps to identify when there are problems with power consumption
over an s2idle cycle.

To make the information discoverable, create a new sysfs file and a symbol
that drivers from supported manufacturers can use to advertise this
information. This file will only be exported when the system supports low
power idle in the ACPI table.

In order to effectively use this information you will ideally want to
compare against the total duration of sleep, so export a second sysfs file
that will show total time. This file will be exported on all systems and
used both for s2idle and s3.

The above is incomprehensible word salad. Can you come up with some
coherent explanation of what you are trying to achieve please?

+void pm_set_hw_state_residency(u64 duration)
+{
+ suspend_stats.last_hw_state_residency = duration;
+}
+EXPORT_SYMBOL_GPL(pm_set_hw_state_residency);
+
+void pm_account_suspend_type(const struct timespec64 *t)
+{
+ suspend_stats.last_suspend_total += (s64)t->tv_sec *
USEC_PER_SEC +
+ t->tv_nsec /
NSEC_PER_USEC;

Conversion functions for timespecs to scalar nanoseconds exist for a
reason. Why does this need special treatment and open code it?

Will fixup to use conversion functions.


+}
+EXPORT_SYMBOL_GPL(pm_account_suspend_type);

So none of these functions has any kind of documentation. kernel-doc
exists for a reason especially for exported functions.

That said, what's the justification to export any of these functions at
all? AFAICT pm_account_suspend_type() is only used by builtin code...

I think you're right; they shouldn't export; will fix.


+static umode_t suspend_attr_is_visible(struct kobject *kobj, struct
attribute *attr, int idx)
+{
+ if (attr != &last_hw_state_residency.attr)
+ return 0444;
+#ifdef CONFIG_ACPI
+ if (acpi_gbl_FADT.flags & ACPI_FADT_LOW_POWER_S0)
+ return 0444;
+#endif
+ return 0;
+}
+
static const struct attribute_group suspend_attr_group = {
.name = "suspend_stats",
.attrs = suspend_attrs,
+ .is_visible = suspend_attr_is_visible,

How is this change related to the changelog above? We are not hiding
subtle changes to the existing code in some conglomorate patch. See
Documentation/process/...

It was from feedback from RFC v1 from David Box that this file should only
be visible when s2idle is supported on the hardware. Will adjust commit
message to make it clearer.


--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -24,6 +24,7 @@
#include <linux/compiler.h>
#include <linux/audit.h>
#include <linux/random.h>
+#include <linux/suspend.h>

#include "tick-internal.h"
#include "ntp_internal.h"
@@ -1698,6 +1699,7 @@ static void
__timekeeping_inject_sleeptime(struct timekeeper *tk,
tk_set_wall_to_mono(tk, timespec64_sub(tk->wall_to_monotonic,
*delta));
tk_update_sleep_time(tk, timespec64_to_ktime(*delta));
tk_debug_account_sleep_time(delta);
+ pm_account_suspend_type(delta);

That function name is really self explaining - NOT !

pm_account_suspend_type(delta);

So this will account a suspend type depending on the time spent in
suspend, right?

It's totally obvious that the suspend type (whatever it is) depends on
the time delta argument... especially when the function at hand has
absolutely nothing to do with a type:


I fat fingered this. In my mind I thought I wrote pm_account_suspend_time()
Will fix.

+void pm_account_suspend_type(const struct timespec64 *t)
+{
+ suspend_stats.last_suspend_total += (s64)t->tv_sec *
USEC_PER_SEC +
+ t->tv_nsec /
NSEC_PER_USEC;
+}

Sigh....

Thanks,

tglx