Re: [PATCH] mm: Implement Swap Prefetching v23

From: Con Kolivas
Date: Thu Feb 09 2006 - 22:56:34 EST


On Friday 10 February 2006 14:25, Andrew Morton wrote:
> Con Kolivas <kernel@xxxxxxxxxxx> wrote:
> > Here's a respin with Nick's suggestions and a modification to not cost us
> > extra slab on non-numa.
>
> v23? I'm sure we can do better than that.

:D

> > This patch implements swap prefetching when the vm is relatively idle and
> > there is free ram available.
>
> I think "free ram available" is the critical thing here. If it doesn't
> evict anyhing else then OK, it basically uses unutilised disk bandwidth for
> free.
>
> But where does it put the pages? If it was really "free", they'd go onto
> the tail of the inactive list.

It puts them in swapcache. This seems to work nicely as a nowhere-land place
where they don't have much affect on anything until we need them or need more
ram. This has worked well, but I'm open to other suggestions.

> And what about all those unpaged text pages which the app will need to
> fault back in?

Tracking these would make the perceived time to wakeup much much faster but
also make the list a heck of a lot larger.. but more to the point I have no
idea how to do it and get what we really want on the list.

> > Once pages have been added to the swapped list, a timer is started,
> > testing for conditions suitable to prefetch swap pages every 5 seconds.
> > Suitable conditions are defined as lack of swapping out or in any pages,
> > and no watermark tests failing. Significant amounts of dirtied ram and
> > changes in free ram representing disk writes or reads also prevent
> > prefetching.
>
> OK. The has-the-disk-been-used-recently test still isn't there?

No, but you'd have to free up +/- SWAP_SCAN_MAX as many pages as you fill when
reading from disk for this indirect method to not pick it up. It's easy
enough to see that it's effective: You can sit and watch vmstat for many
minutes after say an oom-kill in the hope you see it quiet enough before
prefetch does anything. After allowing 'tail -f /dev/zero' to be oomkilled in
a full gui environment it usually takes >5 mins before swap prefetch starts
prefetching.

> > @@ -844,6 +844,7 @@ int zone_watermark_ok(struct zone *z, in
> > if (free_pages <= min)
> > return 0;
> > }
> > +
> > return 1;
> > }
>
> ?

Legacy of v1->v22..

>
> > + read_lock(&swapper_space.tree_lock);
>
> That's interesting. We conventionally do read_lock_irq() on an
> address_space.tree_lock. Because
> end_page_writeback()->rotate_reclaimable_page()->test_clear_page_writeback(
>) needs to take a write_lock from interrupt context. end_swap_bio_write()
> calls end_page_writeback() so I don't immediately see why we don't have a
> deadlock here.

Wasn't sure of the semantics. I was lucky I guess.

> Ordinarily we'd just use find_get_page(), but
>
> > + /* Entry may already exist */
> > + page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
> > + read_unlock(&swapper_space.tree_lock);
> > + if (page) {
> > + remove_from_swapped_list(entry.val);
> > + goto out;
> > + }
>
> you don't take a ref on the page.
>
> Which makes one wonder what happens here if `page' got whipped out of
> swapcache between the lookup and the remove_from_swapped_list(). Probably
> nothing much.
>
> Did you consider borrowing swapper_space.tree_lock to provide the list and
> radix-tree locking throughout here?

I did. I was concerned that since most of the functions occur as we swap in or
out that we'd end up with more contention of that lock.

> > +out:
> > + if (mru) {
> > + spin_lock(&swapped.lock);
> > + swapped.list.next = mru;
> > + spin_unlock(&swapped.lock);
> > + }
>
> That looks strange. What happens to swapped.list.prev?

Left dangling for future null dereferencing... Only would have been hit on
numa.

> > + schedule_timeout_interruptible(MAX_SCHEDULE_TIMEOUT);
>
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();

Ack.

Thanks! I guess I will be working on v24 in the near future...

Cheers,
Con
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/