Re: Swapping algorithm

Linus Torvalds (torvalds@cs.helsinki.fi)
Thu, 18 Apr 1996 09:09:26 +0300 (EET DST)


On Wed, 17 Apr 1996, David S. Miller wrote:
>
> I believe that an improvement could be made to the swapping algorithm if
> it looks for large continuous areas of memory to throw out, rather than
> single pages, the reason for this is that the swapping drive will be much
> slower at gathering multiple single pages (i.e. 1 seek (10ms) 1 page
> transfer (4k / 4MB/s = 1ms)) is much slower than if the pages swapped
> were continuous in the process space. It would be good if this was in
> addition to LRU.
>
> Wheee... sounds a lot like what Larry McVoy suggested to Linus and
> Stephen many moons ago. I like it myself too, however Linus and
> Stephen have some reservations about it. mea culpa

I don't like the approach, no.

However, that doesn't mean that linux has to be stupid about swapping out
larger blocks of memory. The current setup is quite able to swap out
things as one _huge_ write request - both the memory management and the
device drivers support that kind of page-out.

The reason you don't see large swap-outs is
- the "get_swap_page()" function which allocates swap pages tries to
cluster the pages, but it also does a kind of striping across disks,
which will partly mean that the clustering gets scattered anyway if
you have multiple swap devices.
- much more importantly, try_to_swap_out() will try to swap out only one
page at a time. It should not be very hard to make try_to_swap_out()
do a small loop so that it tries to swap out a few contiguous
(ie _virtuallly_ contiguous) pages if "p->swap_cnt" is large.

Note that I'd be more than happy if somebody would like to look into
this, but you need to be aware that it's not one of the "obvious" parts
of the kernel, to say the least. Most importantly, try_to_swap_out() may
not sleep at inopportune times, because the process that it tries to swap
out might just disappear from under it if try_to_swap_out sleeps ;-)

Umm.. If somebody wants to test, there is a rather simple way to do this,
and I'm appending patches which may or may not work. Totally untested, as
usual.. There are probably other issues here too, but somebody who knows
something about the mm layer may find this approach a good starting point.

Essentially, instead of returning immediately when we find something to
swap out, we try to swap out multiple pages from the same process. The
"wait" behaviour is not very good, though (if wait is true, we should do
the wait at the very end of the swap-out not for each page).

NOTE NOTE NOTE! Not only is this untested, it really isn't the "right"
way to fix the problem (this one does it in swap_out() instead of
try_to_swap_out()). This patch is _not_ going into my current kernel.
It's meant solely as a starting point if somebody wants to play around
with things.

Linus

-----
--- /usr/src/linux/mm/vmscan.c Fri Apr 12 09:49:48 1996
+++ experimental/mm/vmscan.c Thu Apr 18 09:00:37 1996
@@ -277,6 +277,7 @@
static int swap_task;
int loop, counter;
struct task_struct *p;
+ int success = 0;

counter = ((PAGEOUT_WEIGHT * nr_tasks) >> 10) >> priority;
for(; counter >= 0; counter--) {
@@ -288,9 +289,9 @@
while(1) {
if (swap_task >= NR_TASKS) {
swap_task = 1;
- if (loop)
+ if (loop || success)
/* all processes are unswappable or already swapped out */
- return 0;
+ return success;
loop = 1;
}

@@ -299,6 +300,8 @@
break;

swap_task++;
+ if (success)
+ return success;
}

/*
@@ -309,20 +312,27 @@
multiplying by (RSS / 1MB) */
p->swap_cnt = AGE_CLUSTER_SIZE(p->mm->rss);
}
- if (!--p->swap_cnt)
+ if (!--p->swap_cnt) {
swap_task++;
+ if (success)
+ break;
+ }
switch (swap_out_process(p, dma, wait)) {
case 0:
- if (p->swap_cnt)
+ if (p->swap_cnt) {
swap_task++;
+ if (success)
+ return success;
+ }
break;
case 1:
- return 1;
+ success = 1;
+ /* fall through */
default:
break;
}
}
- return 0;
+ return success;
}

/*
-----