Re: Suspend2 merge preparation: Rationale behind the freezer changes.
From: Nigel Cunningham
Date: Fri May 21 2004 - 07:34:45 EST
Hi.
Pavel Machek wrote:
Thanks for this. (Btw you might want to Cc me on such mails, I read
both list and personal mails, but you'll get way better response time.)
Humble apologies! :>
First of all, let me explain that although swsusp and suspend2 work at a
very fundamental level in the same way, there are also some important
differences. Of particular relevance to this conversation is the fact
that swsusp makes what is as close to an atomic copy of the entire image
to be saved as we can get and then saves it. In contrast, suspend2 saves
one portion of the memory (lru pages), makes an atomic copy of the rest
and then saves the atomic copy of the second part.
Hmm, I did not realize this difference. Doing these hacks with LRU
seems pretty crazy to me...
No... this is what you already know, just described differently. You
mentioned in your documentation that suspend2 overcomes the half of
memory limitation by saving the image in two parts: the second part is
LRU (unless I have my terminology confused: I'm talking about pages on
the inactive and active lists). By the time this is done, all other
processes are stopped, so there's no danger of corruption anyway.
Suspend2 itself doesn't affect LRU because I switched from using the
'normal' swap read/write calls ages ago, as part of adding swapfile
support. For 2.6 we're directly using BIOs.
Would it be possible to stop processes when they try to manipulate
LRU, instead?
They're already stopped.
Secondly, we have a more basic problem with the existing freezer
implementation. A fundamental assumption made by it is that the order in
which processes are signalled does not matter; that there will be no
deadlocks caused by freezing one process before another. This simply
isn't true.
It better should be. If it is not true, then kill -STOP -1 does not
work, and that would be a kernel bug, right?
We already discussed the example of trying to do an ls on an NFS share
and the NFS threads being frozen first. I can come up with more examples
if you'd like. I guess the simplest one (off the top of my head) would
be freezing kjournald while processes are submitting and waiting on I/O.
When user thread is stopped, it should better not hold any lock,
because otherwise we have problem anyway.
Yes, but we're not just talking about user threads. We could
differentiate kernel threads and user threads (presumably using another
PF_ flag?) and attempt to freeze the user threads first.
Kernel threads are different, and each must be handled separately,
maybe even with some ordering. But there's relatively small number of
kernel threads...
Yes, but what order? I played with that problem for ages. Perhaps I just
didn't find the right combination.
Thirdly, the existing implementation does not allow us to quickly stop
activity. Under heavy load, particularly heavy I/O (assuming the freezer
does work), it make take quite a while for processes to respond to the
pseudo-signal and enter the refrigerator. New processes may also be
spawned, further complicating matters. The busier the system is, the
more hit-and-miss freezing becomes.
I agree it can take longer, but modulo bugs, it should be always possible.
Should... I'll find some time to roll a freezer implementation that does
what you're suggesting (try user space threads first, seek an order for
kernel space threads). If we can do it that way, it will be less
invasive. I'll see...
The implementation of the freezer that I have developed addresses these
concerns by adding an atomic count of the number of procesess in
critical paths. The first part of the freezer simply waits for the
number of processes in critical paths to reach zero.
Exactly, you slowed down critical paths of kernel... This makes patch
big, ugly, and is bad idea.
Maybe I wasn't clear enough. When we're not suspending, all that is
added to the paths that are modified is:
- 9 tests, possibly resulting in refrigerator entry or immediately
dropping through, setting the PF_FRIDGE_WAIT flag and incrementing the
atomic_t at the start of a busy path.
- 2 tests, possibly resetting the flag & decrementing the counter at the
end.
- 3 tests, setting a local variable, restting the FRIDGE_WAIT flag and
decrementing the atomic_t when dropping locks and sleeping in kernel.
- 10 tests, possibly resulting in refrigerator entry or immediately
dropping through, restoring the PF_FRIDGE_WAIT flag and reincrementing
the atomic_t after such sleeps.
I've been using this approach for months, and my Celeron 933 doesn't
feel slow at all. I've had no complains from users either.
We really need to ask how critical these paths really are: some of them
are certainly more commonly used, such as sys_read & sys_write. The vast
majority, however, are less commonly used. I wonder if it's worth
getting a benchmarking program. I'll try your suggestion above first.
These four macros play a further role. When we begin to wait for the
activity counter to reach zero, a flag is set to record this fact. Macro
calls check this flag, and a process reaching a START or RESTARTING
activity macro while the flag is set will be refrigerated at that point
until after the suspend cycle is completed. This helps us quiesce the
system more quickly.
Adding hooks to "fast" stuff like read()/write()/open is no-no. Adding
small number of hooks to slower stuff like exec()/exit() might be
acceptable. Could you get away with that?
No. Reading and writing is exactly what we want to be able to pause.
Otherwise we get processes stuck waiting on pages.
Summary:
- I'll try your user space first, kernel space afterwards suggestion.
- I'll also look into benchmarking the system with and without suspend2
compiled in (ie with and without the hooks, since they compile away to
nothing without CONFIG_SOFTWARE_SUSPEND2
Regards,
Nigel
--
Nigel & Michelle Cunningham
C/- Westminster Presbyterian Church Belconnen
61 Templeton Street, Cook, ACT 2614.
+61 (2) 6251 7727(wk); +61 (2) 6254 0216 (home)
Evolution (n): A hypothetical process whereby infinitely improbable
events occur
with alarming frequency, order arises from chaos, and no one is given
credit.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/