APM Suspend/resume problems: analysis and proposal

Richard Gooch (rgooch@atnf.csiro.au)
Wed, 25 Nov 1998 16:23:42 +1100


Hi, all. I've been investigating the suspend/resume problems I've
had with a Dell Inspiron 3200. After a day or so of poking and
prodding, I've exposed what is either some broken assumptions in the
Linux APM code or a broken Dell APM BIOS. Even if it's mainly the
BIOS's fault, there is still a problem with clock drift. In any case,
I've come up with a solution that is robust, and which I'd like to
pass by other who are more intimate with the APM code than I.

If people are happy with my proposed solutions, I'll put together a
shiny new patch and post it to the list.

History:
I first tried the system suspend feature on the Inspiron 3200
(i.e. close the lid and then open it) with a non-APM kernel. The
system state was faithfully saved, except for:

1) system time was behind: not updated properly

2) XFree86 showed a blank screen. ctrl-alt-plus and ctrl-alt-minus
fixed that.

Problem (2) I'm ignoring for now. The "obvious" solution to (1) was to
build an APM kernel, which I did (2.1.129). When I used system
suspend/resume with an APM kernel, the system state was saved OK, but
when I resumed, I got a *second* suspend event a few seconds later! To
recover from that, I needed to press the power button. This was one
problem.

Another problem was that the system sometimes lost a few seconds to 15
seconds after a resume. I also noticed that is sometimes took 15 to 20
seconds after I close the lid before the system actually suspended.

Analysis:
I eventually traced the source of the second suspend event to the APM
BIOS, presumably some brokenness. While the kernel has the
CONFIG_APM_IGNORE_MULTIPLE_SUSPEND option, this only helps for
sequences thus:
suspend, suspend, suspend, resume.

What I'm seeing is a sequence like:
suspend, resume, suspend, resume.

I noticed that the time between the first resume and the second
suspend is up to and including 2 jiffies. The fix I implemented for
this is to ignore a suspend event less than HZ jiffies (i.e. 1 second)
after a resume event.

****PROPOSAL 1: add this to the kernel, available as a config option.

The other problem of lost time turned out to quite subtle. I
eventually tracked it down to an excessively long time taken for
get_cmos_time() in arch/i386/kernel/time.c. In fact, sometimes the
first iteration loop (the one that does 1000000 iterations waiting for
the UIP to rise) times out! I've seen this thing take 14000 jiffies
(that's 14 seconds, assuming that jiffies are still at 100 Hz). It was
only ever supposed to take 1 second! This incredibly slow reading of
the RTC seems to result in inaccurate time values being returned.

The cause of the slow reading of the RTC appears to be a slowing down
of the CPU (and possibly the bus) after the lid is closed, *prior* to
apm_set_power_state(APM_STATE_SUSPEND) being called. I added
benchmarking code to measure this effect, and found that mdelay(100)
was taking 303 ms (using TSC-based timing) or 18 jiffies.

Note: the RTC needs to be read before activating the suspend in order
to determine the offset between the RTC and the system clock. If
everyone used GMT in their RTC, there would be no problem, but because
of windoze some people are better off with localtime in their RTC. The
observant will note there is another current thread on linux-kernel
which notes that the kernel has no knowledge of the timezone.

Seeing these effects made me realise that the current algorithm for
restoring the time is flawed. Even if one discounts the effects I've
measured on the Inspiron (I can well accept that the advance slowing
down of the system is a Dell-ism), there is still a problem. When
reading the RTC, get_cmos_time() returns "immediately" after the RTC
ticks over to a new second. However, suspend() in
arch/i386/kernel/apm.c then subtracts CURRENT_TIME from the returned
value. The problem here is that CURRENT_TIME only has a resolution of
1 second, and hence the clock_cmos_diff value which is used to
"calibrate" the system time versus the RTC may be up to 1 second in
error. So each time the system goes through a suspend/resume cycle we
can loose up to 1 second. This effect is cumulative. Not good.

So my quick fix was to abolish the calibration of the clock_cmos_diff
variable and fix it to 0. Since I run my GMT in my RTC, I never have
to worry about DST changes :-) However, this is not a general
solution. With the fix, though, the problems I was experiencing went
away. No more clock drift, and it didn't take 20 seconds for the
system to suspend after closing the lid.

****PROPOSAL 2: add an ioctl() to /proc/apm so that the kernel can be
told the offset between the system clock and the RTC.

Regards,

Richard....

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/