swsusp status report

From: Rafael J. Wysocki
Date: Tue Jul 25 2006 - 07:23:47 EST


Hi,

The following document describes the current state of the development of
swsusp: how it works, what known problems there are in it, which of them
are worked on and where some help is needed.

If there are no disasters, an updated report will be released in 3-4 months.

If you have any questions, comments, suggestions, please let me know.

Greetings,
Rafael


--
swsusp Status Report

I. Introduction

As you probably know, swsusp is the part of the kernel that deals with the
suspend to disk. In other words, it is what gets compiled if you set
CONFIG_SOFTWARE_SUSPEND=y in .config. However, the name 'software suspend'
does not mean it is independent of devices and, most importantly, device
drivers. Moreover, swsusp is not an entirely autonomous subsystem, as it
shares some code with the other parts of the kernel.

This document is intended as an introductory presentation of the swsusp
design, the current (ie. as in the 2.6.18-rc2 kernel) state of the code,
the known problems with it etc. For this reason I will first explain
how swsusp works and identify all of the distinct parts of it. Then I will
describe each of these parts in detail and discuss the problems related to
it.

II. Outline

Currently there are two possible ways of carrying out a suspend to disk.
The first of them is entirely kernel-driven and the second one requires
a userland task that will drive the suspend procedure calling the kernel
to perform specific, more or less atomic, actions. In this document
I will only cover the first method, because it is generally simpler
and the actions of the kernel are pretty much the same in both cases.

The kernel-driven method of suspending to disk is initiated by writing
'disk' to /sys/power/state. Then, the kernel performs the following
actions:

(1) non-boot CPUs are taken off-line
(2) tasks are frozen
(3) some memory is released, if necessary
(4) devices are frozen
(5) atomic copy of the memory (aka suspend image) is created
(6) devices are woken up
(7) the suspend image is written to a swap partition
(8) the system is powered off

Of course all of this happens if there are no errors in the way. However,
for example, if one of the devices refuses to freeze, we need to wake up
all of the devices that have already been frozen, thaw processes, and enable
non-boot CPUs.

The kernel-driven resume procedure may be started by booting the kernel with
the 'resume=<swap_partition>' command line parameter, where <swap_partition>
is the one the suspend image has been written to in step (7). Then, the
following actions are performed:

(9) the suspend image is read into RAM
(10) devices are prepared to resume
(11) system memory state is restored from the suspend image
(12) devices are woken up
(13) tasks are thawed
(14) non-boot CPUs are enabled

Almost each of the steps (1)-(14) above is carried out by a separate part of
swsusp.

III. Handling of non-boot CPUs

Steps (1) and (14) above are completed with the help of the CPU hotplug
infrastructure which basically is external with respect to swsusp. There were
some problems with this mechanism in the past, but currently it is generally
reported to work, even on 4-way machines. Of course, it has not been tested
very much yet, as the number of SMP notebooks is quite limited, but this
is going to change shortly. Anyway, if you have a problem with swsusp
that only appears for SMP kernels, please report it.

IV. Freezing and thawing tasks

Steps (2) and (13) are done by the code which is shared with the
suspend-to-RAM infrastructure on the majority of architectures that support
it (ppc is the only exception known to me) and which is called 'the freezer'.
It 'freezes' tasks by sending them fake 'freeze' signals in reaction to which
they should enter special function called 'the refrigerator' where they are
doing nothing in the TASK_UNINTERRUPTIBLE state, waiting for the freezer to
let them run again. Userland processes are made enter the refrigerator by the
kernel's signal-handling code, but kernel threads should enter the
refrigerator voluntarily, by calling the function try_to_freeze() where
it is appropriate. Moreover, kernel threads are only 'asked' to enter the
refrigerator after all of the userland processes have been frozen and sync()
is called before freezing any kernel threads. A 'frozen' task is allowed
to return from the refrigerator when the freezer resets the PF_FROZEN flag for
it.

It follows from the above description that uninterruptible tasks cannot be
frozen. Consequently, it is impossible to suspend, either to disk or to RAM,
if there are any uninterruptible tasks in the system (for this reason the
freezer has to wait for all of the vfork completions to be completed).

This mechanism generally works, although there are some known problems with
it. First, there is an issue related to cifsd that refuses to freeze if it
has lost the connection to the server before suspend (eg. the network cable
has been disconnected) which is currently worked on (please refer to
http://bugzilla.kernel.org/show_bug.cgi?id=6811 for details). There also is a
problem with the freezing of traced tasks that are waiting on breakpoints, but
this seems to have been nailed down already (please see
http://bugzilla.kernel.org/show_bug.cgi?id=6787). Finally, it is reported
that calling sync() after userland processes have been frozen is not enough
to prevent some filesystems from writing data afterwards (apparently XFS does
this). This currently is a pending issue.

If you know of any other problems with the freezer, please report them.

V. Freeing memory

Step (3) of the suspend procedure is completed by calling the same
functions that are normally used by kswapd, but in a slightly different way.
The part of swsusp responsible for that is referred to as 'the memory
shrinker' and it may sometimes be called by the suspend-to-RAM code as well,
so it should be treated as a shared piece of code. It generally works (there
have been some bug fixes regarding it merged after 2.6.17), but it seems to be
inefficient if there are lots of slab objects to free. Currently I do not know
how to fix this, so if you have any ideas, please help.

VI. Handling of devices

Steps (4), (6), (10), and (12) of the suspend-resume cycle are completed
in a large part by device drivers. Thus as far as fixing problems related to
these steps is concerned, we have to rely on driver authors and maintainers.

Unfortunately the vast majority of reported problems with swsusp is related
the the freezing and/or waking up of devices. The problems of this type
are also quite difficult to debug and fix, particularly because they are
often almost impossible to take care of without access to the hardware on
which they appear. Worse yet, sometimes they only appear in specific hardware
configurations. Therefore, if you report a problem related to the freezing or
waking up devices, or preparing them to resume, please always make sure that
the report will go to the appropriate driver maintainer and/or author.

The 'core' code responsible for the completion of steps (4), (6), (10),
and (12) is going to change. Namely, step (10), i.e. the preparation of
devices for resume, is currently done in the same way as step (4), the
freezing of devices. However, David Brownell noticed that in fact this was
not exactly correct, because it introduced an additional operation between
the freezing of devices in step (4) (before suspend) and waking them up
in step (12) (after resume). For this reason he proposed to treat step (10)
in a different way and submitted patches that implement his idea. These
patches are currently in the -mm tree. Fortunately, the new approach will not
require any changes to the vast majority of drivers, because they do not need
to differentiate step (4) from step (10) anyway. Still, there appear to be
some drivers that do need this and they will have to be modified in accordance
with the new core code.

The code that performs steps (4), (6), (10), and (12) of the suspend-resume
cycle is generally shared between swsusp and the suspend-to-RAM
infrastructure, but the suspend-to-RAM calls to the device drivers' suspend
routines are made with a different parameter value (PMSG_SUSPEND instead of
PMSG_FREEZE). Also, since step (10) is not necessary for the suspend to RAM,
there generally are some swsusp-specific pieces of code in the device
drivers.

I must admit that there are many suspend-related or resume-related problems
with device drivers. On almost every box I have recently tested, I have
had such problems with at least one device driver. Still, we cannot do
very much about it without the help of the drivers' authors, unless the
drivers in question are very simple.

There also is one major limitation related to the code that freezes devices.
Namely, if some filesystems are mounted out of removable devices before
suspend, they will not be accessible after resume and the users may lose
data. The problem is that for removable, or rather 'hotpluggable', devices
the 'freezing' or 'suspending' operation causes the device to disconnect,
as though it were physically disconnected from the system. This currently is
a pending issue.

VII. Snapshotting memory and restoring its state

The snapshotting of memory, step (5), is completed by making a copy of each
memory page that needs to be saved. For this purpose swsusp uses the
indentity kernel mapping which is a limitation on i386, because the high
memory cannot be accessed in the process. Moreover, on i386 swsusp either
releases the highmem pages or copies their contents to the normal zone before
creating the suspend image. This is inefficient, because for one saved
highmem page swsusp needs two pages in the normal zone, but I am going
to change this shortly.

Since each saveable page has to be copied, swsusp needs as much as 50% of
free RAM, or free normal zone on i386, to create the image. This also is a
limitation, as it generally affects the system responsiveness after resume
and sometimes requires swsusp to free quite a lot of memory in step (3).
Still, there are many saveable pages in the system that will not be accessed
when processes are frozen, and in principle these pages could be included in
the suspend image without copying. Unfortunately, however, I do not know
how to identify these pages in a reliable and fast way, so if you have any
ideas and/or hints, please help.

The code that restores the memory state from the suspend image in step
(11) also uses the kernel identity mapping to address memory, so it cannot
access highmem pages on i386, but it practically has no other limitations as
far as the image size is concerned. In other words, it would be possible to
restore suspend images as big as 80% or even 90% of RAM, or the normal zone
on i386, if the 'snapshotting' code were able to create them.

The code that performs steps (5) and (11) of the suspend-resume cycle is
quite robust and there is only one known problem with it, which seems to
be x86_64-specific. Namely, on x86_64 machines with more than 2 GB of RAM
there are memory gaps and/or reserved memory areas between the 2nd and 3rd
Gbyte of physical memory and swsusp tries to save these areas as though
they were RAM which leads to oopses. This issue is now being worked on.

VIII. Saving and loading the suspend image

The suspend image is saved to a swap partition in step (7) of the
suspend-resume cycle and loaded from it in step (9) with the help of standard
block IO callbacks and/or functions designed for accessing swap devices and/or
swap files. This code has not changed for a long time, but recently Andrew
Morton has made it use asynchronous IO (the patches are waiting in the -mm
tree now).

There are almost no problems with this part of swsusp. There have been only
a couple of minor bugs found in it, and fixed, for the last 6 months. Yet,
it has one major limitation which is that it can only use swap partitions
for saving the suspend image and only one swap partition can be used at a
time. To overcome this limitation I am considering the addition of support
for swap files to this part of swsusp.

IX. Userland interface

Some users of the suspend-to-disk subsystem want it to be able to perform
certain transformations of the suspend image, like encryption and/or
compression, before it is saved. Moreover, some of them would like the
suspend and resume code to use splash screens and display graphical
progress meters. Still the idea of implementig these operations in the
kernel space is questionable, so it has been made possible to export the
suspend image out of the kernel. This is the basic role of the swsusp
userland interface, which also allows a userland process to drive the
entire suspend and resume procedure.

The swsusp userland interface has been implemented as a special software
character device with appropriate file operations and some special ioctls.
It is described quite thoroughly in Documentation/power/userland-swsusp.txt,
so please refer to this document for details. A reference implementation
of the userland tools that use this interface is available from
http://suspend.sf.net.

X. Reporting bugs and problems

If you find a bug in swsusp or have a problem related to it, please report
it, preferably on LKML with a Cc to suspend-devel@xxxxxxxxxxxxxxxx You can
also use the kernel bugzilla in which case please add my e-mail address,
rjwysocki@xxxxxxx, to the Cc list of your bug report.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/