Re: [PATCH 5/5] PM/Hibernate: Do not release preallocated memoryunnecessarily (rev. 2)

From: Wu Fengguang
Date: Wed May 06 2009 - 09:52:20 EST


On Wed, May 06, 2009 at 07:05:09AM +0800, Rafael J. Wysocki wrote:
> On Tuesday 05 May 2009, Wu Fengguang wrote:
> > On Mon, May 04, 2009 at 08:22:38AM +0800, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rjw@xxxxxxx>
> > >
> > > Since the hibernation code is now going to use allocations of memory
> > > to create enough room for the image, it can also use the page frames
> > > allocated at this stage as image page frames. The low-level
> > > hibernation code needs to be rearranged for this purpose, but it
> > > allows us to avoid freeing a great number of pages and allocating
> > > these same pages once again later, so it generally is worth doing.
> > >
> > > [rev. 2: Change the strategy of preallocating memory to allocate as
> > > many pages as needed to get the right image size in one shot (the
> > > excessive allocated pages are released afterwards).]
> >
> > Rafael, I tried out your patches and found doubled memory shrink speed!
> >
> > [ 579.641781] PM: Preallocating image memory ... done (allocated 383900 pages, 128000 image pages kept)
> > [ 583.087875] PM: Allocated 1535600 kbytes in 3.43 seconds (447.69 MB/s)
>
> Unfortunately, I'm observing a regression and a huge one.
>
> On my Atom-based test box with 1 GB of RAM after a fresh boot and starting X
> with KDE 4 there are ~256 MB free. To create an image we need to free ~300 MB
> and that takes ~2 s with the old code and ~15 s with the new one.
>
> It helps to call shrink_all_memory() once with a sufficiently large argument
> before the preallocation.

Yes there are some strange behaviors. I tried to populate the page
cache with 1/30 mapped file pages and others normal file pages, all
referenced once. I get this on "echo disk > /sys/power/state":

[ 462.820098] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 462.827161] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 462.834249] PM: Basic memory bitmaps created
[ 462.838631] PM: Syncing filesystems ... done.
[ 463.167805] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 463.175738] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 463.183834] PM: Preallocating image memory ... done (allocated 383898 pages, 128000 image pages kept)
[ 469.605741] PM: Allocated 1535592 kbytes in 6.41 seconds (239.56 MB/s)
[ 469.612325]
[ 469.768796] Restarting tasks ... done.
[ 469.775044] PM: Basic memory bitmaps freed

Immediately after that, I copied a big sparse file into memory, and get this:

[ 508.097913] PM: Marking nosave pages: 0000000000001000 - 0000000000006000
[ 508.104799] PM: Marking nosave pages: 000000000009f000 - 0000000000100000
[ 508.111702] PM: Basic memory bitmaps created
[ 508.116073] PM: Syncing filesystems ... done.
[ 509.208608] Freezing user space processes ... (elapsed 0.00 seconds) done.
[ 509.216692] Freezing remaining freezable tasks ... (elapsed 0.00 seconds) done.
[ 509.224708] PM: Preallocating image memory ... done (allocated 383872 pages, 128000 image pages kept)
[ 520.951882] PM: Allocated 1535488 kbytes in 11.71 seconds (131.12 MB/s)

It's much worse.

Your patches are really interesting exercises for the vmscan code ;-)

> > + error = memory_bm_create(&orig_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> > +
> > + error = memory_bm_create(&copy_bm, GFP_IMAGE, PG_ANY);
> > + if (error)
> > + goto err_out;
> >
> > memory_bm_create() is called a number of times, each time it will
> > call create_mem_extents()/memory_bm_free(). Can they be optimized to
> > be called only once?
>
> Possibly, but not right now if you please? This is just moving code BTW.

OK.

>
> > A side note: there are somehow duplicated *_extent_*() logics in the
> > filesystems, is it possible that we abstract out some of the common code?
>
> I think we can do it, but it really is low priority to me at the moment.

OK. Just was a wild thought.

>
> > + for_each_populated_zone(zone) {
> > + size += snapshot_additional_pages(zone);
> > + count += zone_page_state(zone, NR_FREE_PAGES);
> > + if (!is_highmem(zone))
> > + count -= zone->lowmem_reserve[ZONE_NORMAL];
> > + }
> >
> > Why [ZONE_NORMAL] instead of [zone]? ZONE_NORMAL may not always be the largest zone,
> > for example, My 4GB laptop has a tiny ZONE_NORMAL and a large ZONE_DMA32.
>
> Ah, this is a leftover and it should be changed or even dropped. Can you
> please remind me how exactly lowmem_reserve[] is supposed to work?

totalreserve_pages could be better. When free memory drops below that
threshold(it actually works per zone), kswapd will wake up trying to
reclaim pages. If the total reclaimable+free pages are as low as
totalreserve_pages, that would drive kswapd mad - scanning the whole
zones, trying to squeeze the last pages out of them. Sure kswapd will
stop somewhere, but the resulting scan:reclaim ratio would be pretty
high and therefore hurt performance.

So we shall stop preallocation when reclaimable pages go down to
something like (5*totalreserve_pages). The vmscan mad may come earlier
because of unbalanced distributions of reclaimable pages among the zones.

> > At last, I'd express my major concern about the transition to preallocate
> > based memory shrinking: will it lead to more random swapping IOs?
>
> Hmm. I don't see immediately why would it. Maybe the regression I'm seeing
> is related to that ...

OK. Anyway a preallocate based shrinking policy could be far from optimal.
I'd suggest to switch to user space directed shrinking via fadvise(DONTNEED),
and leave the kernel one a fail safe path. The user space tool could
gather page information from the filecache interface which I've been
maintaining out of tree, and to drop inactive/active pages from large
files first. That should be a better policy at least for rotational disks.

Thanks,
Fengguang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/