Re: [SUSPECTED SPAM] Re: [linux-pm] Proposal for a new algorithmfor reading & writing a hibernation image.

From: Nigel Cunningham
Date: Fri Jun 04 2010 - 20:47:23 EST


Hi.

On 05/06/10 10:36, Maxim Levitsky wrote:
On Sat, 2010-06-05 at 09:58 +1000, Nigel Cunningham wrote:
On 05/06/10 09:39, Maxim Levitsky wrote:
On Thu, 2010-06-03 at 16:50 +0200, Pavel Machek wrote:
"Nigel Cunningham"<ncunningham@xxxxxxxxxxx> wrote:
On 30/05/10 15:25, Pavel Machek wrote:
Hi!

2. Prior to writing any of the image, also set up new 4k page tables
such that an attempt to make a change to any of the pages we're about to
write to disk will result in a page fault, giving us an opportunity to
flag the page as needing an atomic copy later. Once this is done, write
protection for the page can be disabled and the write that caused the
fault allowed to proceed.

Tricky.

page faulting code touches memory, too...

Yeah. I realise we'd need to make the pages that are used to record the
faults be unprotected themselves. I'm imagining a bitmap for that.

Do you see any reason that it could be inherently impossible? That's
what I really want to know before (potentially) wasting time trying it.

I'm not sure it is impossible, but it certainly seems way too complex to be
practical.

2mb pages will probably present a problem, as will bat mappings on powerpc.


Some time ago, after tuxonce caused medium fs corruption twice on my
root filesystem (superblock gone for example), I was thinking too about
how to make it safe to save whole memory.

I'd be asking why you got the corruption. On the odd occasion where it
has been reported, it's usually been because the person didn't set up
their initramfs correctly (resumed after mounting filesystems). Is there
any chance that you did that?

Your tuxonice is so fast that it resembles suspend to ram.

That depends on hard drive speed and CPU speed. I've just gotten a new
SSD drive, and can understand your statement now, but I wouldn't have
said the same beforehand.
Nope, I have a slow laptop drive.

Oh, okay. Not much ram then? I would have thought that in most cases - and especially with a slow laptop drive - suspend to ram would be waaay faster. Ah well, there is a huge variation in specs.

I have radically different proposal.


Lets create a kind of self-contained very small operation system that
will know to do just one thing, write the memory to disk.
From now on I am calling this OS, a suspend module.
Physically its code can be contained in linux kernel, or loaded as a
module.


Let see how things will work first:

1. Linux loads the suspend module to memory (if it is inside kernel
image, that becomes unnecessary)

At that point, its even possible to add some user plug-ins to that
module for example to draw splash screen. Of course all such plug-ins
must be root approved.


2. Linux turns off all devices, but hard disk.
Drivers for hard drives will register for this exception.


3. Linux creates a list of memory areas to save (or exclude from save,
doesn't matter)

4. Linux creates a list of hard disk sectors that will contain the
image.
This ensures support for swap partition and swap files as well.

5. Linux allocates small 'scratch space'
Of course if memory is very tight, some swapping can happen, but that
isn't significant.


6. Linux creates new page tables that cover: the suspend module, both of
above lists, scratch space, and (optionally) the framebuffer RW,
and rest of memory RO.

7. Linux switches to new page table, and passes control to that module.
Even if the module wanted to it won't be able to change system memory.
It won't even know how to do so.

8. Module optionally encrypts and/or compresses (and saves result to
scratch page)

9. Module uses very simplified disk drivers to write the memory to disk.
These drivers can even omit using interrupts because there is nothing
else to do.
It can also draw progress bar on framebuffer using optional plugin

10. Module passes control back to linux, which just shuts system off.

Sounds a lot like kexec based hibernation that was suggested a year or
two back. Have you thought about resuming, too? That's the trickier part
of the process.
Why its tricky?

We can just reseve say 25 MB of memory and make resuming kernel only use
it for all its needs.

Well, I suppose in this scenario, you can do it all atomically. I was thinking of where we do a two-part restore (still trying to maximise image size, but without a separate kernel).

Now what code will be in the module:

1. Optional compression& encryption - easy
2. Draw modules, also optional and easy


3. New disk drivers.
This is the hard part, but if we cover libata and ahci, we will cover
the common case.
Other cases can be handled by existing code that saved 1/2 of ram.

To my mind, supporting only some hardware isn't an option.



4. Arch specific code. Since it doesn't deal with interrupts nor memory
managment, it won't be lot of code.
Again standard swsusp can be used for arches that that module wasn't
ported to.

Perhaps I'm being a pessimist, but it sounds to me like this is going to
be a way bigger project than you're allowing for.
I also thinks so. This is just an idea.


To add a comment on your idea.

I think is is possible to use page faults to see which memory regions
changed. Actually its is very interesting idea.

You just need to install your own page fault handler, and make sure it
doesn't touch any memory.

If the memory it writes to isn't protected, there'll be no recursive page fault and no problem, right? I'm imagining this page fault handler will only set a flag to record that the page needs to be atomically copied, copy the original contents to a page previously prepared for the purpose, remove the write protection for the page and allow the write to continue. That should be okay, right?

Of course the sucky part will be how to edit the page tables.
You might need to write your own code to do so to be sure.
And this has to be arch specific.

Yeah. I wondered whether the code that's already used for creating page tables for the atomic restore could be reused, at least in part.

Since userspace is frozen, you can be sure that faults can only be
caused by access to WO memory or kernel bugs.

Userspace helpers or uswsusp shouldn't be forgotten.

Regards,

Nigel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/