Suspend and hibernation status report

From: Rafael J. Wysocki
Date: Fri Jul 27 2007 - 04:48:52 EST


Hi,

Below is a document describing the current state of development of the suspend
and hibernation infrastructure: how it works, what known problems there are in
it and what the future development plans are (at least as far as I am
concerned).

[It's almost exactly one yaer after I released the previous swsusp status
report and that's mostly because in the Summer I have more time to write such
things. Thus, probably, the next report will be released next Summer, but
since the present one is quite long, the next one is going to be incremental.
;-)]

As usual, comments, suggestions, opinions etc are welcome.

Greetings,
Rafael


---
Hibernation and Suspend Status Report

I. Introduction

One year ago I wrote a report documenting the status of development of swsusp
(ie. software suspend, or hibernation, subsystem) that can be found at
http://lkml.org/lkml/2006/7/25/105 . Although I thought I would be able to
release an updated version of the report within 3-4 months, this turned out to
be very difficult due to several substantial changes made to swsusp since then,
causing it to be a moving target from the documentation-writing perspective.
Moreover, in the meantime I started to work on the core suspend code used,
among other things, for transitioning the system into the ACPI S3 sleep state,
known as the suspend to RAM, which currently has some things in common with
swsusp. For this reason, I thought it would be a good idea to document these
two subsystems together, but that increased the number of things to cover and
added to the delay. Finally, however, I have had some time to complete the
present document.

In analogy with the previous report, this document is intended as an
introductory presentation of the current (ie. as in the 2.6.23-rc1 kernel)
design of the suspend (ie. suspend-to-RAM and standby) and hibernation code,
the status of it, known problems with it and the future development plans.
Thus, I will first explain how this code works and identify all of the distinct
parts of it. Next, I will describe each of these parts in more detail and
discuss the known problems related to them. Finally, I will outline the
possible directions of future development related to suspend and hibernation.

II. Terminology

Before I start to talk about technical details, some terms that will be used
throughout of the rest of this document need to be defined. They are the
following:
* system working state - any state, in which the system's processors can carry
out useful computations
* system sleep state - state, in which no useful work can be done by the
system's processors, but its main memory is powered and, consequently, the
contents of memory are preserved, so that the computations carried out when
the system was last in a working state can be continued after transitioning
the system back to the working state
* system hibernation state - state, in which the system's processors are off and
its main memory is not powered, but the information necessary for continuing
the computations carried out when the system was last in a working state is
preserved in a storage space, such as a disk
* ACPI S4 state - system hibernation state, in which some information is
preserved by the ACPI platform, in accordance with the ACPI specification
* system suspend - operation, in which the system leaves a working state and
enters a sleep state
* system resume - operation, in which the system leaves a sleep state and enters
a working state
* system hibernation - operation, in which the system leaves a working state and
enters a hibernation state
* system restore - operation, in which the system leaves a hibernation state and
enters a working state
* device full power state - state of a device, in which it is fully operational
and draws maximum power
* device low power state - state of a device, in which it draws less power than
in the full power state and may not be fully operational
* device quiescent state - state of a device, in which it does not generate
interrupts and/or it will not take part in any DMA transfers
* device off state - state of a device, in which it draws minimal power and is
not regarded as operational
* device suspend - operation, in which the device is put into a low power state
compatible with the system sleep state that is going to be entered
* device wake up - operation, in which the device is put into the full power
state or to a low power state compatible with the system working state that is
going to be entered

III. System suspend outline

System suspend support is included in the kernel if CONFIG_PM is set in the
.config . Then, there is the file /sys/power/state, by reading which one can
check what suspend states are available on given system.

At present, two different suspend states can generally be supported, "standby"
and "mem", but some platforms support only one of them and many platforms do not
support any sleep states at all. If both are supported, "standby" is the state
in which the system draws more power, but can be switched to a working state
faster than from the "mem" sleep state.

A transition to a system sleep state can be started by writing the name of a
system sleep state supported by the platform ("mem" or "standby") to
/sys/power/state (there is another method to do that, with the help of the
hibernation userland interface, but it should only be used as a part of the
suspend-to-both functionality described later). If that happens, the kernel
performs the following actions:

(1) power management notifiers are executed with PM_SUSPEND_PREPARE
(2) tasks are frozen
(3) target system sleep state is announced to the platform-handling code
(4) devices are suspended
(5) platform-specific global suspend preparation methods are executed
(6) non-boot CPUs are taken off-line
(7) interrupts are disabled on the remaining (main) CPU
(8) late suspend of devices is carried out
(9) platform-specific global methods are invoked to put the system to sleep

Of course all of this happens if there are no errors in the way. However, for
example, if one of the devices refuses to suspend, we need to wake up all of the
devices that have already been suspended, inform the platform that the
transition to the low power state will not occur, enable the non-boot CPUs and
thaw tasks. Finally, we have to execute the power management notifiers to
inform their owners that the transition has been canceled.

A resume starts when the platform notices a wake-up event, such as the opening
of a laptop's lid or pressing the power button. Then, the platform prepares
itself and the main processor for entering a system working state and returns
the control to the kernel. Next, the following actions are performed:

(10) the main CPU is switched to the appropriate mode, if necessary
(11) early resume of devices is carried out
(12) interrupts are enabled on the main CPU
(13) non-boot CPUs are enabled
(14) platform-specific global resume preparation methods are invoked
(15) devices are woken up
(16) tasks are thawed
(17) power management notifiers are executed with PM_POST_SUSPEND

For each of steps (1)-(17) above there is a separate part of the suspend code
responsible for its completion.

IV. System hibernation outline

System hibernation support is included in the kernel if CONFIG_SOFTWARE_SUSPEND
is set in the .config . Then, the hibernation state called "disk" is listed
in the /sys/power/state file.

Currently there are two possible ways of carrying out a system hibernation. The
first of them is entirely kernel-driven and the second one requires a userland
task that will drive the hibernation procedure calling the kernel to perform
specific, more or less atomic, actions. Only the first method is covered in
this part of the report, because it is generally simpler and the actions of the
kernel are pretty much the same in both cases. The other method will be
described later.

The kernel-driven hibernation procedure is started by writing "disk" to
/sys/power/state. Then, the kernel performs the following actions:

(1) power management notifiers are executed with PM_HIBERNATION_PREPARE
(2) tasks are frozen
(3) some memory is released, if necessary
(4) (optional, on ACPI systems) target system sleep state (S4) is announced to
the platform-handling code
(5) devices are suspended for hibernation
(6) (optional) platform-specific global hibernation preparation methods are
invoked
(7) non-boot CPUs are taken off-line
(8) interrupts are disabled on the main CPU
(9) late suspend of devices for hibernation is carried out
(10) atomic copy of the system memory (aka hibernation image) is created
(11) early resume of devices is carried out
(12) interrupts are enabled on the main CPU
(13) non-boot CPUs are enabled
(14) (optional, but necessary if (6) is performed) platform-specific global
hibernation-related methods are invoked
(15) devices are woken up
(16) hibernation image is saved in a storage space
(17) devices are put into the off state
(18) the system is powered off _or_ (optionally, on ACPI systems)
platform-specific global methods are invoked to put the system into the S4
sleep state

In analogy with the system suspend described in Section III, if any of the
operations listed above fails, the operations that have already been performed
need to be reverted, so that the system can flawlessly continue operating in the
working state. In particular, the power management notifiers need to be called
to inform their owners that the system state transition has been canceled.

System restore is started by booting the kernel with the "resume=<partition>"
command line parameter, where <partition> is the one the hibernation image has
been written to in step (16). This partition may be a swap partition or a
partition containing the swap file with the hibernation image, in which case the
additional kernel command line parameter "resume_offset=<offset>" is needed,
where <offset> points to the location of the swap file's header (see
Documentation/power/swsusp-and-swap-files.txt in the kernel tree for details).

The kernel booted with the "resume=<partition>" and (optionally)
"resume_offset=<offset>" command line parameters, often referred to as the boot
kernel, is responsible for loading the hibernation image into memory and passing
control to the kernel contained in the hibernation image, that from now on will
be referred to as the target kernel. The following operations are performed by
it:

(19) hibernation image is loaded into RAM
(20) tasks are frozen
(21) devices are suspended for jumping to the target kernel
(22) (optional, but necessary if (6) was done during the hibernation)
platform-specific global restore preparation functions are executed
(23) non-boot CPUs are taken off-line
(24) interrupts are disabled on the remaining CPU
(25) late suspend of devices (for jumping to the target kernel) is carried out
(26) control is passed to the target kernel

If any of steps (19)-(23) fails, the boot kernel continues running as in the
case of a normal non-restore boot. Otherwise, the target kernel gets the
control and the following operations are performed by it:

(27) early resume of devices is carried out
(28) interrupts are enabled on the main CPU
(29) non-boot CPUs are enabled
(30) (optional, but necessary if (6) is performed) finish of the system state
transition is announced to the platform
(31) devices are woken up
(32) tasks are thawed
(33) power management notifiers are executed with PM_POST_HIBERNATION

Again, for each of steps (1)-(33) there is a part of the hibernation code
responsible for completing it and some of these parts are shared with the
suspend code outlined in Section III.

V. Power management notifiers

This is a new feature, introduced very recently in order to allow subsystems
that need to know if a system state transition is going to happen to register
notifiers called right before and right after any such transition. The
parameter passed to the notifiers determines if the transition in question is a
suspend or a hibernation.

This mechanism is described in detail in Documentation/power/notifiers.txt .
At present, it is only used to disable user mode helpers before the freezing of
tasks.

VI. Freezing and thawing tasks

Steps (2) and (16) of the suspend-resume cycle described in Section III as well
as steps (2) and (32) of the hibernation-restore cycle outlined in Section IV
are done by a special code called the freezer. Generally speaking, it requests
tasks to "park" themselves in a safe place, called "the refrigerator", in which
they do not hold any locks, to not start any new I/O operations, do not allocate
memory and do not do anything else that might destructively interfere with the
suspend or hibernation procedure. Userland processes are made enter the
refrigerator by the kernel's signal-handling code, but kernel threads should
enter the refrigerator voluntarily, by calling the function try_to_freeze(),
where it is appropriate from their point of view. Moreover, kernel threads that
want to receive freeze requests from the freezer have to explicitly mark
themselves as freezable and they are responsible for entering the refrigerator
relatively quickly after receiving a freeze request. The freezable kernel
threads are only asked to enter the refrigerator after userland processes have
been frozen and sys_sync() is called before sending any freeze requests to
kernel threads. A frozen task is only allowed to exit the refrigerator at the
freezer's request. Detailed description of this mechanism is available in
Documentation/power/freezing-of-tasks.txt .

The freezing of tasks generally works, although there are some known problems
with it. First of all, uninterruptible tasks cannot be frozen, so if there are
any such tasks in the system, except for the tasks waiting for vfork()
completions handled in a special way, it is impossible to suspend or hibernate
it. This is a strong limitation stemming from the fact that uninterruptible
tasks can hold locks that might be necessary for suspending devices later during
the suspend or hibernation procedure. Unfortunately, it also leads to problems
in the situations, in which one userland task may wait in the
TASK_UNINTERRUPTIBLE state for another userland task. Namely, in such cases the
task that is being waited for may be frozen before the task that waits for it
and the freezing of tasks will fail as a result.

Another known issue related to the freezer is that some system calls, such as
sys_poll(), may be interrupted by fake signals sent by it to userland tasks.

VII. Freeing memory

Step (3) of the hibernation procedure outlined in Section IV is completed by
calling the same functions that are normally used by kswapd, but in a slightly
different way. The part of code responsible for that is referred to as the
memory shrinker (it may sometimes be called by the suspend code as well, so it
can be treated as a shared piece of code). It generally works well, but it
seems to be inefficient if there are lots of slab objects to free.

VIII. Platform support

On ACPI systems there are parts of the platform that should only be accessed
by the kernel through the execution of so-called ACPI control methods encoded
in the AML language. These control methods are executed with the help of the
AML interpreter included in the kernel's ACPI subsystem.

Since the platform is responsible for registering and acting upon events
supposed to wake up the system being in a sleep state, as well as for passing
control back to the kernel after such an event, it requires special handling
during every suspend. Also, during a resume the platform has to be put into a
state that is compatible with the system working state being entered.

The handling of an ACPI platform related to suspend and resume is done on two
levels. First, some global ACPI control methods need to be executed, which is
done in steps (5), (9) and (14) of the procedure outlined in Section III, with
the help of the information passed to the platform-handling code in step (3).
Second, some device-specific ACPI control methods are executed while devices are
being suspended. The ordering of execution of different ACPI control methods
involved in suspend and resume operations is strictly defined by the ACPI
specification and it currently is reflected by the ordering of the kernel's
suspend and resume code.

As far as system hibernation is concerned, in principle the platform support
is optional. However, some ACPI platforms do not work correctly after a restore
if the appropriate ACPI control methods are not executed during transitions to
and from the hibernation state. For this reason, the platform support in the
hibernation code is enabled by default, but the users can request that it be
disabled by writing "shutdown" to the /sys/power/disk control file before
the hibernation. By reading this file one can see if the platform support will
be used during subsequent hibernations (the active setting is shown inside the
square braces and "[platform]" means that the platform hibernation support is
enabled).

During a hibernation-restore cycle global ACPI control methods are executed in
steps (6), (14), (18) and (30) listed in Section IV. Additionally, the
platform-handling code is informed of the target system sleep state (ACPI S4) in
step (4) and the ACPI general purpose events (GPEs) are disabled in step (22)
(if the restore fails, they are enabled during the subsequent clean-up
procedure). The restore code in the boot kernel uses the platform support
routines if special flag in the image header is set by the hibernation code.
Still, the current hibernation and restore code does not exactly follow the ACPI
specification. Namely, the specification requires that the ACPI subsystem be
not enabled during a restore until the image is loaded into memory and the
control is passed to the target kernel, but in our current implementation the
ACPI subsystem is already enabled in the boot kernel before loading the image.

Apart from this, in step (14) of the hibernation procedure we inform the
platform that the system will not enter the sleep state, which is not what is
going to happen. We do that in order to be able to resume devices needed for
saving the image and in step (18) the platform is prepared for entering the S4
sleep state from the start.

IX. Handling of devices

Steps (4), (8), (11), and (15) of the suspend-resume cycle outlined in Section
III, as well as steps (5), (9), (11), (15), (21), (25), (27), and (31) of the
hibernation-restore cycle described in Section IV are completed in a large part
by device drivers. Namely, each device driver supporting the suspend and/or
resume of devices handled by it is required to define the .suspend() and
.resume() callbacks and register them with the driver model, as described in
Documentation/power/devices.txt . These callbacks are used by the power
management core to suspend the driver's devices in step (4) of the
suspend-resume cycle and in steps (5) and (21) of the hibernation-restore cycle.

At present, the same callbacks are used for both suspend and hibernation. In
the case of a suspend they are called with the second parameter equal to
PMSG_SUSPEND, whereas for a hibernation the second parameter passed to each of
them is equal to PMSG_FREEZE. Moreover, the drivers' .suspend() callbacks are
also executed in step (21) of the hibernation-restore cycle, in order to prepare
devices for passing control to the target kernel, in which case the second
parameter passed to them is equal to PMSG_PRETHAW. Thus, theoretically, the
drivers can use the second parameter of their .suspend() callbacks to
distinguish between suspend, hibernation and restore operations, although only a
few drivers actually do that.

Similarly, the same .resume() callbacks are used for waking up devices in step
(11) of the suspend-resume cycle, as well as in steps (15) and (31) of the
hibernation-restore cycle. Since these callbacks take only one parameter,
being a pointer to the device object associated with given device, the drivers
have no means to distinguish between different reasons for which the devices may
be woken up and they need to perform basically the same actions in each of these
cases.

In order to suspend devices the power management core walks the dpm_active list
in the reverse order. This list is set up during the kernel initialization and
devices are put on it in the order in which they are registered with the driver
model. Thus, the devices that have been registered last, are suspended first
and so on, which guarantees that basic dependencies between devices will not be
violated (ie. parent devices are always suspended after the devices that depend
on them). For each device the core checks if:
* the device's class has defined a .suspend() callback, in which case this
callback is executed,
* the device's type has defined a .suspend() callback, in which case this
callback is executed,
* the device's bus type has defined a .suspend() callback, in which case this
callback is executed.
All of the .suspend() callbacks defined by device classes, types and bus types
are always executed as long as none of them returns an error. This means that,
for example, if a device class has defined the .suspend() callback and a bus
type has done that too, then both of these callbacks will be executed for each
device belonging to this class and associated with this bus type and it is up to
the class, bus type and driver code to cope with that correctly. If any of the
.suspend() callbacks listed above returns an error, the suspending of devices is
immediately terminated and the devices that have already been suspended are
woken up. The .suspend() callbacks defined by device drivers are executed by
the device class, device type and bus type .suspend() callbacks.

The suspended devices are moved from the dpm_active list to the dpm_off list in
the order in which they have been suspended (note that a device may be regarded
as suspended even if no .suspend() callbacks have been executed for it, for
instance, when there are no such callbacks defined for it). This list is used
by the power management core for waking up devices. Namely, for each device on
it the power management core checks if:
* the device's bus type has defined a .resume() callback, in which case this
callback is executed,
* the device's type has defined a .resume() callback, in which case this
callback is executed,
* the device's class has defined a .resume() callback, in which case this
callback is executed.
Again, all bus type, device type and device class .resume() callbacks that have
been defined are always executed for each device that they fit to. Moreover,
any errors returned by them are discarded. All devices for which they have been
executed are unconditionally moved from the dpm_off to the dpm_active list, in
such a way that the original ordering of the dpm_active list is eventually
restored.

Apart from "ordinary" devices, the suspending and resuming of which is described
above, there are special devices that need some handling in steps (8) and (11)
of the suspend-resume cycle and in steps (9), (11), (25), and (27) of the
hibernation-restore cycle. There are two kinds of such devices:
* devices the bus types and drivers of which define .suspend_late() and/or
.resume_early() callbacks,
* system devices (aka sysdevs)

The devices handled with the help of .suspend_late() callbacks are moved from
the dpm_off list to the dpm_off_irq list, which is used later to check if the
.resume_early() callbacks have been defined for them and to execute these
callbacks if that is the case. All devices on the dpm_off_irq list are moved
from there back to the dpm_off list before the "ordinary" waking up of devices
described above. It should be noted that the right ordering of devices is
always preserved by all of these operations. Moreover, the .suspend() and
.resume() callbacks may be defined for a device for which .suspend_late() and
.resume_early() are also defined and all of these callbacks will always be
executed in the right order.

System devices are handled in a special way, independent of the above general
framework. Specifically, system device classes and drivers can define
.suspend() and .resume() callbacks that are used to handle their devices.
However, these callbacks are only executed when one CPU is on-line and with
interrupts disabled by it. Thus, if any of such devices needs to be handled
with interrupts enabled too, it is necessary to create a separate device object
for it that will be treated in the ordinary way. For this reason, from the
power management point of view, system devices are rather inflexible and the use
of them is no longer recommended. The existing ones are expected to be
gradually phased out or replaced with device objects corresponding to the
"platform" bus type.

The main problem with the current approach to the handling of devices is that
the same callbacks are used for both suspend and hibernation, which leads to
confusion and introduces unnecessary limitations. For example, it generally is
not necessary, and may even be harmful, to put devices into low power states
before step (10) of the hibernation procedure. In fact, it should be sufficient
to put devices into quiescent states in step (5) of it and to put them back into
the full power state (or into the low power states in which they were before the
hibernation procedure has been started) in step (15). Then, the execution of
platform-specific functions in steps (6) and (14) should not be necessary and
the entire hibernation procedure might be simplified. It also is generally
unnecessary to put devices into low power states in step (21), during a restore.
Moreover, the boot kernel need not handle the same set of devices as the target
kernel, which means that the callbacks used by the target kernel to "wake up"
devices must be prepared to deal with the situation in which their devices have
not been initialized or, worse yet, have been initialized by the platform
firmware in an inappropriate way. Generally, they need not be in the same
states in which they were left in step (5). Yet, this obviously is not the case
during a resume, since the states of devices generally need not change between
steps (4) and (15) of the suspend-resume cycle. Thus, by requesting that all of
the .resume() callbacks need to be able to deal with uninitialized devices, we
impose an unnecessary limitation on the suspend code, which should be avoided.

The next major limitation is related to the handling of removable storage
devices. Namely, if some filesystems are mounted out of removable devices, such
as USB storage devices or memory cards, before a suspend or hibernation, they
will not be accessible after the corresponding resume or restore and the users
may lose data as a result of this. The problem is that for removable, or rather
"hotpluggable", devices the suspend operation usually causes the device to
disconnect, as though it were physically disconnected from the system. There is
the kernel configuration parameter CONFIG_USB_PERSIST which allows one to work
around this behavior, but it generally is dangerous and the use of it is not
recommended, unless the user knows exactly what she is doing.

The third major problem with the handling of devices is related to graphics
adapters that often are not touched by the platform after it has registered a
wake-up event and before it passes control back to the kernel during a resume.
Usually, the kernel also does not know how to bring the graphics adapter back to
the pre-suspend state and that may lead to various undesirable effects, from the
image corruption up to and including a crash of the resuming system, depending
on the type of the graphics card, platform firmware and its version and other
similar factors. A workaround of this that seems to work in the majority of
cases is to use a userland tool able to put the graphics adapter into the right
state after a resume, given some simple instructions how to do it, such as s2ram
(http://en.opensuse.org/s2ram).

At present, the majority of reported and tracked bugs related to suspend and
hibernation are associated with the platform support, described in Section VIII,
and with the handling of devices. Unfortunately, these bugs are usually
reproducible only on a limited number of machines and hard to debug.

X. Handling of non-boot CPUs

Steps (6) and (13) of the suspend-resume cycle, as well as steps (7), (13),
(23), and (29) of the hibernation-restore cycle are completed with the help of
the CPU hotplug infrastructure, which basically is external with respect to the
suspend and hibernation code. There were some problems with this mechanism in
the past, but currently it is generally reported to work, even on 4-way
machines.

XI. Snapshotting memory and restoring its state

The snapshotting of memory, step (10) of the hibernation procedure, is completed
by making a copy of each memory page that needs to be saved. For this reason,
the hibernation code needs as much as 50% of free RAM to create the image. This
is a serious limitation, as it generally affects the system responsiveness after
a restore and sometimes requires quite a lot of memory to be freed in step (3).
Still, usually there are many saveable pages in the system that will not be
accessed when userland processes are frozen, and in principle these pages could
be included in the hibernation image without copying. Unfortunately, however,
no efficient method of identifying them pages has been proposed yet. If you
have any ideas and/or hints, please help.

The code that restores the memory state from the hibernation image in steps (19)
and (26) of the hibernation-restore cycle is able to handle images much greater
than 50% of RAM. It practically is only limited by the amount of memory
occupied by the boot kernel and its data structures. Thus, it would be possible
to use hibernation images as big as 80% or even 90% of RAM if the "snapshotting"
code could create them.

Apart from the above limitation, there are no any known problems with this part
of the hibernation code. Also, it uses data structures that are completely
independent of the rest of the kernel's memory management subsystem and are
allocated on demand, during the hibernation and restore.

XII. Saving and loading the hibernation image

The hibernation image is saved in a swap partition or in a swap file in step
(16) and loaded from it in step (19) of the hibernation-restore cycle, with the
help of standard block I/O callbacks and/or functions designed for accessing
swap devices and/or swap files. This code has not been changed for a long time.

There are almost no problems with this part of the hibernation code.
Practically, there have not been any bugs found in it for the last year. Yet,
it is quite limited, since it does not support image compression that may
substantially increase the speed of saving and loading the image. It also is
only capable of using swap space (ie. swap partitions or swap files) for saving
hibernation images and only one swap partition or swap file can be used at a
time.

XIII. Userland hibernation interface

Some users of the hibernation subsystem want it to be able to perform certain
transformations of the hibernation image, such as encryption and/or compression,
before saving it. Moreover, some of them would like the hibernation and restore
code to use splash screens and display graphical progress meters. Still, the
idea of implementing all these things in the kernel space is questionable, so it
has been made possible to export the hibernation image out of the kernel, in
order for some userland tools to be able to carry out the desired operations and
save the image afterwards. This is the basic role of the userland hibernation
interface, which also allows userland processes to drive the entire hibernation
and restore procedure.

The userland hibernation interface has been implemented as a special software
character device with appropriate file operations and some special ioctls. It
is described quite thoroughly in Documentation/power/userland-swsusp.txt, so
please refer to this document for details. A reference implementation of the
userland tools that use this interface is available at http://suspend.sf.net .
At present, this method of driving the hibernation and restore procedures is
used by default in OpenSUSE and is optionally available for the users of some
other major distributions.

One of the features provided by the userland hibernation interface is the
possibility to create and save a hibernation image and suspend to RAM right
after that. Then, the system can be resumed with the help of the platform,
if there is still enough battery power, or the state of it can be restored on
the basis of the hibernation image. This often is referred to as the
suspend-to-both capability. To make it possible, the hibernation userland
interface includes a special ioctl allowing one to make the system enter the
"mem" sleep state if some additional conditions are met. However, it is
strongly recommended to use this ioctl only as a part of the suspend-to-both
functionality.

XIV. Debugging

Problems related to suspend and hibernation are usually difficult to debug,
since most often they are only reproducible on a limited number of systems and
it generally is difficult to obtain any diagnostic information from a system
after or during a failing resume or restore. Nevertheless, there are some
facilities that can be used to debug suspend and hibernation issues.

First, some standard debugging techniques that can be used in such cases are
described in Documentation/power/basic-pm-debugging.txt and
Documentation/power/drivers-testing.txt . There also is the suspend-resume
events tracing functionality, available when CONFIG_PM_TRACE is set in .config
(in addition to CONFIG_PM being set), described in
Documentation/power/s2ram.txt .

Recently, we have added a feature allowing the user to make the kernel beep
in the early phase of resume, right after it has received control from the
platform, which may help confirm that the control is really passed from the
platform to the kernel. This feature can by activated by executing the
following command:

# r=`cat /proc/sys/kernel/acpi_video_flags` && r=`expr $r + 4` && \
> echo $r > /proc/sys/kernel/acpi_video_flags

XV. Reporting bugs and problems

If you find a bug in the suspend/hibernation code or have a problem related to
it, please report it, preferably to linux-pm@xxxxxxxxxxxxxxxxxxxxxxxxxx . You
can also use the kernel bugzilla (http://bugzilla.kernel.org/) for this purpose,
in which case please file the report with the "Hibernation/Suspend" component
of "Power Management" and add the e-mail address rjwysocki@xxxxxxx to its Cc
list.

The list of bugs related to suspend and hibernation being tracked at the moment
can be found at http://bugzilla.kernel.org/show_bug.cgi?id=7216 .

XVI. Future development plans

As you have certainly realized, there are some known problems and limitations
related to the suspend and hibernation code, so I do not consider these
subsystems as finished work. Therefore I intend to work on improving them or
even redesigning them to a reasonable extent, if that is desirable. However,
in my opinion that should be done in an organized way, so that we do not
introduce regressions and do not end up with a solution worse than the current
one.

In my opinion, the part of the suspend and hibernation code that should be
taken care of first is the handling of devices. Namely, I think that we should
first separate the hibernation-related handling of devices from the
suspend-related handling of them in order to overcome limitations mentioned in
Section IX. This also will be necessary if we want to try some new approaches
to hibernation, such as the kexec-based one recently discussed on the LKML.
For this reason, I think that it will be necessary to introduce some
hibernation-related callbacks to be used in steps (5), (9), (11), (15), (21),
(25), (27), and (31) of the hibernation-restore cycle instead of the existing
.suspend(), .resume(), .suspend_late() and .resume_early() callbacks which
should only be used during suspend and resume. We have discussed this issue
for a couple of times on the linux-pm list (linux-pm@xxxxxxxxxxxxxxxxxxxxxxxxxx)
and it generally seems to be known how the hibernation-specific callbacks should
work.

The next thing that seems reasonable to do is to eliminate the freezing of
tasks, described in Section VI, from the suspend and resume code, since the
limitations related to it are regarded by many people as too restrictive.
Still, for this purpose we will need to make device drivers be able to block
userland tasks on I/O after their .suspend() callbacks have been executed.
Currently, there are only a few drivers which can do that and there are drivers
which openly assume the userland tasks to be frozen in the initial phase of
suspend. Thus, quite a lot of work needs to be done on the drivers before we
can drop the freezing of tasks from the suspend code path.

When drivers are able to block userland tasks on I/O after executing their
.suspend() callbacks, or analogous hibernation-specific callbacks (to be
introduced), we may also be able to eliminate the freezing of tasks from the
hibernation code path or leave only a much simplified and less intrusive form
of it. In theory, that can be achieved by using a kexec-based hibernation
framework, but I think that there also are other possibilities worthy of
considering. Apart from this, I think that we have not yet explored all
possibilities to improve the current framework, including the freezing of tasks,
so as long as the freezer is in use, I am going to improve it and fix reported
problems related to it.

There also is the alternative hibernation framework TuxOnIce maintained by Nigel
Cunningham, which is more feature-rich than the current in-kernel hibernation
code. It therefore seems reasonable to incorporate at least some of the more
advanced TuxOnIce features into the in-kernel code. I believe that by combining
TuxOnIce with the current in-kernel hibernation implementation we can obtain a
relatively simple, but powerful and solid hibernation framework, so I am going
to work in this direction, after the separation of the suspend-specific and
hibernation-specific device handling is done at the core and device class/device
type/bus type level.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/