Re: Swapping algorithm

David S. Miller (davem@caip.rutgers.edu)
Tue, 30 Apr 1996 00:16:48 -0400


Date: Mon, 29 Apr 1996 21:46:43 +0100 (BST)
From: "J.J. Burgess" <92jjb@eng.cam.ac.uk>

Here is a more considered view of how this might be done, its the
most efficient and easiest way I can think of.

Swapping handles 3 differnt things:

1. Clear out unused pages - init code etc.
Done already (poorly) can be done slowly
results in page out (not swap) of clean
code segments.

Speculative unpaging is great for this,
every time memory starts getting short,
or load gets very low we should look for
these. When we actually need to swap, we
look first at pages who were unmapped ageas
ago and haven't been accessed since.

Maybe make a seperate kernel thread than kswapd that does this, maybe
call it "vmscrubd" ;-)

2. Get rid of most infrequently used pages.

Ok, you're getting to the point where the information we keep around
now (which is basically limited history LRU) is not really enough for
a really robust and smart swapper.

3. Sequentially accessed data, too much to hold at once

I think that with our current stuff we could easily modify it to
detect whether sequential or basically random accesses are causing
swaps to happen. Something like:

vmscan() {
if(current->page_we're_bringing_in - PAGE_SIZE ==
current->last_page_we_brought_in)
current->sequential_swap = 1;
else
current->sequential_swap = 0;
}

You get the idea.

Speculative unmapping copes with all this, and since a
heavy paging currently results in much idle time,
waiting on disc access, hopefully we'll have a lot
of spare time to do this.

One of the main points in the idea Larry McVoy is trying to push, and
I am as well, is that yeah you're sitting there waiting for a page to
come in but in the _same_ amount of time you could bring in an entire
sector of swap pages. We eat an entire rotational delay worse case
for each swapin of a page, we should be using that "idle" time to our
advantage. 56kb swap clusting does this, it would require a bit of a
rework of the swapping code right now to get it doing what it should,
but this added to the sequential vs. random swap detection it would be
nice (ie. we bypass this clustering crap if we see random access
patterns).

If you want something truly wicked, for the random case we can put
them all into similar sectors (ie. away from sectors used for
clustered swaps) and many times we'll end up being able to ask for a
huge chunk of that random swap sector because random swap page
requests have piled up in the right way. Here is where we can get
some cool performance numbers no one else can.

Swapping policy: Swap pages who have been unmapped the longest,
by including with all pages a time reference, in jiffies, which
as well as the status (normal, unmapped, swapped) tells us when
it was swapped out, swapped in or last accessed (from unmapping) or when unmapped.
i.e. the last time it changed status.

How do we do the unmapping?
- Surely we can use most of the current swapping code, to unmap but
not swap out the processes access to a page of memory, generating a
page fault on access, where upon we give the page back and update
our last_accessed time.

How do we choose what to unmap?
- ?
We don't want to unmap pages too frequently, although the penalty in time
of unmapping a page is surely a lot less than swapping.

Unmap is cheap, do it whenever it seems reasonable, however avoid the
case where a page has been around a long time but is the processes
working set still (ie. a loop that runs forever, you never want to
swap somethingl like that out).

How often do we do the unmapping?
- ?
Perhaps all the time we're idle, the only penalty is an increase in the process
execution time as it slowly gets its pages back. However if this was happening
for every page the CPU accessed this wouldn't be good.
Implementing a mechanism to back off unmapping frequently used pages is hard.

You can modify the clock hand of the vmscan algorithm to have some
kind of knowledge of where "dry" areas are in the free page pool,
pages older than X jiffied plus !(modified || referenced) are
considered "dry".

If we unmap all memory with a (guess) penalty of 1ms overhead.
16M / 4k = 4096 pages. i.e. 4 seconds for whole sweep, this is a long time
we must do it in an order fashion, a bit at a time.

Cluster the unmaps and you get 16M / 56k == .3 seconds. But 56k is
our max clustering size and we can't cluster unmaps some of the time,
so it's probably more like 2 seconds with clustering.

We can use the 'last time was known accessed' to determine what we unmap.
Maybe we sweep the page tables and unmap these pages, these will then
have there timers updated if and when they are again accessed.

See my "dry" pages idea above.

In addition to the above swapping policy, if we have a whole load of sequential
pages all unused for a similarly long period of time we swap them all rather
than just individually. How do we decide how close is close who knows?
This will have to be tested and tuned. - Maybe we could unmap them all
simultaneously, unmap an entire process address space at once maybe. Seems
a bit rough on the process, not quite as bad as swapping it all though.

Certainly tested and tuned. The swap tester of yours can be modified
to do two different kinds of runs, one sequential (like you do now)
and one completely random (using a seed for random() from
/dev/urandom).

Doing all the above will guarantee we don't swap out frequently used pages,
although I don't have a heavy swapping system myself (16M Ram) so I don't
have much experience of what people do in very bad swap situations.

Do some make -j's, on even 24mb machines this can create bad swap
situations and really cool races/deadlocks in your swapping
algorithm.

Later,
David S. Miller
davem@caip.rutgers.edu