Re: [PATCH] mm: implement swap prefetching

From: Andrew Morton
Date: Mon Feb 06 2006 - 19:34:50 EST


Con Kolivas <kernel@xxxxxxxxxxx> wrote:
>
> Andrew et al
>
> I'm resubmitting the swap prefetching patch for inclusion in -mm and hopefully
> mainline.

Resubmitting is good, thanks.

> After you removed it from -mm there were some people that described
> the benefits it afforded their workloads. -mm being ever so slightly quieter
> at the moment please reconsider.
>

I wish I could convince myself this is sufficiently beneficial..

I've been running 2.6.15-rc2-mm2 on my main workstation (x86, 2G) since
whenever. (Am lazy, haven't gotten around to upgrading that machine). It
has swap prefetch.

I can't say I noticed any difference, although I did turn it off in /proc a
few reboots ago because it was irritating me for some reason which I forget
(sorry).

One thing about 2.6.15-rc2-mm2 is that the `so' and `si' columns in
`vmstat' always read zero. I don't know whether that bug is due to the
prefetch patch or not.

>
> The amount prefetched in each group is configurable via the tunable in
> /proc/sys/vm/swap_prefetch. This is set to a value based on memory size. When
> laptop_mode is enabled it prefetches in ten times larger blocks to minimise
> the time spent reading.

That's incomprehensible, sorry.

I think it'd be much clearer if the thing was called swap_prefetch_kbytes
or swap_prefetch_mbytes or (worse) swap_prefetch_pages - putting the units in the
name really helps clarify things.

And if such a change is made, the internal variable should also be renamed.
Right now it's "swap_prefetch", which sounds like a boolean.

> +swap_prefetch
> +
> +This is the amount of data prefetched per prefetching interval when
> +swap prefetching is compiled in. The value means multiples of 128K,
> +except when laptop_mode is enabled and then it is ten times larger.
> +Setting it to 0 disables prefetching entirely.

What does "ten times larger" mean? If laptop_mode, this thing is in units
of 1280 kbytes and if !laptop_mode it's in units of 128 kbytes?

If so (or if not), this tunable is quite obscure and hard-to-understand.
Can you find a way to make this more user-friendly?

> +/* only used by prefetch externally */
> +/* mm/swap_prefetch.c */
> +extern void prepare_prefetch(void);
> +extern void add_to_swapped_list(unsigned long index);
> +extern void remove_from_swapped_list(unsigned long index);
> +extern void delay_prefetch(void);

I'd suggest that "prefetch" is too generic a term. We prefetch lots of
things in the kernel. Please rename all globally-visible identifiers with
s/prefetch/swap_prefetch/.


> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.16-rc2-ck1/mm/swap_prefetch.c 2006-02-04 18:38:24.000000000 +1100
> @@ -0,0 +1,431 @@
> +/*
> + * linux/mm/swap_prefetch.c
> + *
> + * Copyright (C) 2005 Con Kolivas
> + *
> + * Written by Con Kolivas <kernel@xxxxxxxxxxx>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/swap.h>
> +#include <linux/ioprio.h>
> +#include <linux/kthread.h>
> +#include <linux/pagemap.h>
> +#include <linux/syscalls.h>
> +#include <linux/writeback.h>
> +
> +/* Time to delay prefetching if vm is busy or prefetching unsuccessful */
> +#define PREFETCH_DELAY (HZ * 5)
> +/* Time between attempting prefetching when vm is idle */
> +#define PREFETCH_INTERVAL (HZ)
> +
> +/* sysctl - how many SWAP_CLUSTER_MAX pages to prefetch at a time */
> +int swap_prefetch __read_mostly;
> +
> +struct swapped_root {
> + unsigned long busy; /* vm busy */
> + spinlock_t lock; /* protects all data */
> + struct list_head list; /* MRU list of swapped pages */
> + struct radix_tree_root swap_tree; /* Lookup tree of pages */
> + unsigned int count; /* Number of entries */
> + unsigned int maxcount; /* Maximum entries allowed */
> + kmem_cache_t *cache;
/* Of struct swapped_entry */
> +};

> +struct swapped_entry {
> + swp_entry_t swp_entry;
> + struct list_head swapped_list;
> +};
> +
> +static struct swapped_root swapped = {
> + .busy = 0,
> + .lock = SPIN_LOCK_UNLOCKED,
> + .list = LIST_HEAD_INIT(swapped.list),
> + .swap_tree = RADIX_TREE_INIT(GFP_ATOMIC),
> + .count = 0,
> +};

Description of `busy' and `count'?

> +static task_t *kprefetchd_task;
> +
> +/* Max mapped we will prefetch to */
> +static unsigned long mapped_limit __read_mostly;
> +/* Last total free pages */
> +static unsigned long last_free = 0;
> +static unsigned long temp_free = 0;

Unneeded initialisation.

> +
> +/*
> + * Accounting is sloppy on purpose. As adding and removing entries from the
> + * list happens during swapping in and out we don't want to be spinning on
> + * locks. It is cheaper to just miss adding an entry since having a reference
> + * to every entry is not critical.
> + */
> +void add_to_swapped_list(unsigned long index)
> +{
> + struct swapped_entry *entry;
> + int error;
> +
> + if (unlikely(!spin_trylock(&swapped.lock)))
> + goto out;

hm, spin_trylock() should internally do unlikely(), but it doesn't. (It's
a bit of a mess, too).

> + if (swapped.count >= swapped.maxcount) {

/*
* <comment about LRU>
*/
> + entry = list_entry(swapped.list.next,
> + struct swapped_entry, swapped_list);
> + radix_tree_delete(&swapped.swap_tree, entry->swp_entry.val);
> + list_del(&entry->swapped_list);
> + swapped.count--;
> + } else {
> + entry = kmem_cache_alloc(swapped.cache, GFP_ATOMIC);
> + if (unlikely(!entry))
> + /* bad, can't allocate more mem */
> + goto out_locked;
> + }
> +
> + entry->swp_entry.val = index;
> +
> + error = radix_tree_preload(GFP_ATOMIC);

I suspect we don't need to preload here. We can handle radix_tree_insert()
failure, so just go ahead and do it.

> +static inline int high_zone(struct zone *zone)
> +{
> + if (zone == NULL)
> + return 0;
> + return is_highmem(zone);
> +}
> +
> +/*
> + * Find the zone with the most free pages, recheck the watermarks and
> + * then directly allocate the ram. We don't want prefetch to use
> + * __alloc_pages and go calling on reclaim.
> + */
> +static struct page *prefetch_get_page(void)
> +{
> + struct zone *zone = NULL, *z;
> + struct page *page = NULL;
> + struct zonelist *zonelist;
> + long most_free = 0;
> +
> + for_each_zone(z) {
> + long free;
> +
> + if (z->present_pages == 0)
> + continue;
> +
> + /* We don't prefetch into DMA */
> + if (zone_idx(z) == ZONE_DMA)
> + continue;
> +
> + free = z->free_pages;
> + /* Select the zone with the most free ram preferring high */
> + if ((free > most_free && (!high_zone(zone) || high_zone(z))) ||
> + (!high_zone(zone) && high_zone(z))) {
> + most_free = free;
> + zone = z;
> + }
> + }

<stares at the above expression for three minutes>

I think it'll always select ZONE_HIGHMEM no matter what. Users of 1G x86
boxes not happy.

> +/*
> + * How many pages to prefetch at a time. We prefetch SWAP_CLUSTER_MAX *
> + * swap_prefetch per PREFETCH_INTERVAL, but prefetch ten times as much at a
> + * time in laptop_mode to minimise the time we keep the disk spinning.
> + */
> +static inline unsigned long prefetch_pages(void)
> +{
> + return (SWAP_CLUSTER_MAX * swap_prefetch * (1 + 9 * !!laptop_mode));
> +}

I don't think this should be done in-kernel. There's a nice script to
start and stop laptop mode. We can make this decision in that script.

> +/*
> + * We want to be absolutely certain it's ok to start prefetching.
> + */
> +static int prefetch_suitable(void)
> +{
> + struct page_state ps;
> + unsigned long limit;
> + struct zone *z;
> + int ret = 0;
> +
> + /* Purposefully racy and might return false positive which is ok */
> + if (__test_and_clear_bit(0, &swapped.busy))
> + goto out;
> +
> + temp_free = 0;
> + /*
> + * Have some hysteresis between where page reclaiming and prefetching
> + * will occur to prevent ping-ponging between them.
> + */
> + for_each_zone(z) {
> + unsigned long free;
> +
> + if (z->present_pages == 0)
> + continue;
> + free = z->free_pages;
> + if (z->pages_high * 3 > free)
> + goto out;
> + temp_free += free;
> + }
> +
> + /*
> + * We check to see that pages are not being allocated elsewhere
> + * at any significant rate implying any degree of memory pressure
> + * (eg during file reads)
> + */
> + if (last_free) {
> + if (temp_free + SWAP_CLUSTER_MAX < last_free) {
> + last_free = temp_free;
> + goto out;
> + }
> + } else
> + last_free = temp_free;

What is the actual threshold rate here?
SWAP_CLUSTER_MAX/(how fast your CPU is)? Seems a bit vague?

> + get_page_state(&ps);

get_page_state() can be super-expensive. How frequently is this called?

> +
> +static int kprefetchd(void *__unused)
> +{
> + set_user_nice(current, 19);
> + /* Set ioprio to lowest if supported by i/o scheduler */
> + sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
> +
> + do {
> + enum trickle_return prefetched;
> +
> + try_to_freeze();
> +
> + /*
> + * TRICKLE_FAILED implies no entries left - we do not schedule
> + * a wakeup, and further delay the next one.
> + */
> + prefetched = trickle_swap();
> + switch (prefetched) {
> + case TRICKLE_SUCCESS:
> + last_free = temp_free;

This `last_free' thing is really confusing. It's central to the algorithms
yet its name is largely meaningless. last_free *what*? It seems to mean
"total number of free pages on the last prefetching pass", yes? Wanna
think of a better name and a better comment for it?


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/