Re: [PATCH v2] mm/zswap: Check all pool pages instead of one pool pages

From: Cai Liu
Date: Tue Jan 21 2014 - 08:52:33 EST

Next message: Linus Walleij: "[PATCH] ARM: u300: move timer driver to clocksource"
Previous message: Grant Likely: "[PATCH 1/3] scripts/dtc: Update shipped files to build with bison 2.7.12"
In reply to: Minchan Kim: "Re: [PATCH v2] mm/zswap: Check all pool pages instead of one poolpages"
Next in thread: Minchan Kim: "Re: [PATCH v2] mm/zswap: Check all pool pages instead of one poolpages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Minchan

2014/1/21 Minchan Kim <minchan@xxxxxxxxxx>:
> Hello,
>
> On Tue, Jan 21, 2014 at 02:35:07PM +0800, Cai Liu wrote:
>> 2014/1/21 Minchan Kim <minchan@xxxxxxxxxx>:
>> > Please check your MUA and don't break thread.
>> >
>> > On Tue, Jan 21, 2014 at 11:07:42AM +0800, Cai Liu wrote:
>> >> Thanks for your review.
>> >>
>> >> 2014/1/21 Minchan Kim <minchan@xxxxxxxxxx>:
>> >> > Hello Cai,
>> >> >
>> >> > On Mon, Jan 20, 2014 at 03:50:18PM +0800, Cai Liu wrote:
>> >> >> zswap can support multiple swapfiles. So we need to check
>> >> >> all zbud pool pages in zswap.
>> >> >>
>> >> >> Version 2:
>> >> >> * add *total_zbud_pages* in zbud to record all the pages in pools
>> >> >> * move the updating of pool pages statistics to
>> >> >> alloc_zbud_page/free_zbud_page to hide the details
>> >> >>
>> >> >> Signed-off-by: Cai Liu <cai.liu@xxxxxxxxxxx>
>> >> >> ---
>> >> >> include/linux/zbud.h | 2 +-
>> >> >> mm/zbud.c | 44 ++++++++++++++++++++++++++++++++------------
>> >> >> mm/zswap.c | 4 ++--
>> >> >> 3 files changed, 35 insertions(+), 15 deletions(-)
>> >> >>
>> >> >> diff --git a/include/linux/zbud.h b/include/linux/zbud.h
>> >> >> index 2571a5c..1dbc13e 100644
>> >> >> --- a/include/linux/zbud.h
>> >> >> +++ b/include/linux/zbud.h
>> >> >> @@ -17,6 +17,6 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle);
>> >> >> int zbud_reclaim_page(struct zbud_pool *pool, unsigned int retries);
>> >> >> void *zbud_map(struct zbud_pool *pool, unsigned long handle);
>> >> >> void zbud_unmap(struct zbud_pool *pool, unsigned long handle);
>> >> >> -u64 zbud_get_pool_size(struct zbud_pool *pool);
>> >> >> +u64 zbud_get_pool_size(void);
>> >> >>
>> >> >> #endif /* _ZBUD_H_ */
>> >> >> diff --git a/mm/zbud.c b/mm/zbud.c
>> >> >> index 9451361..711aaf4 100644
>> >> >> --- a/mm/zbud.c
>> >> >> +++ b/mm/zbud.c
>> >> >> @@ -52,6 +52,13 @@
>> >> >> #include <linux/spinlock.h>
>> >> >> #include <linux/zbud.h>
>> >> >>
>> >> >> +/*********************************
>> >> >> +* statistics
>> >> >> +**********************************/
>> >> >> +
>> >> >> +/* zbud pages in all pools */
>> >> >> +static u64 total_zbud_pages;
>> >> >> +
>> >> >> /*****************
>> >> >> * Structures
>> >> >> *****************/
>> >> >> @@ -142,10 +149,28 @@ static struct zbud_header *init_zbud_page(struct page *page)
>> >> >> return zhdr;
>> >> >> }
>> >> >>
>> >> >> +static struct page *alloc_zbud_page(struct zbud_pool *pool, gfp_t gfp)
>> >> >> +{
>> >> >> + struct page *page;
>> >> >> +
>> >> >> + page = alloc_page(gfp);
>> >> >> +
>> >> >> + if (page) {
>> >> >> + pool->pages_nr++;
>> >> >> + total_zbud_pages++;
>> >> >
>> >> > Who protect race?
>> >>
>> >> Yes, here the pool->pages_nr and also the total_zbud_pages are not protected.
>> >> I will re-do it.
>> >>
>> >> I will change *total_zbud_pages* to atomic type.
>> >
>> > Wait, it doesn't make sense. Now, you assume zbud allocator would be used
>> > for only zswap. It's true until now but we couldn't make sure it in future.
>> > If other user start to use zbud allocator, total_zbud_pages would be pointless.
>>
>> Yes, you are right. ZBUD is a common module. So in this patch calculate the
>> zswap pool size in zbud is not suitable.
>>
>> >
>> > Another concern is that what's your scenario for above two swap?
>> > How often we need to call zbud_get_pool_size?
>> > In previous your patch, you reduced the number of call so IIRC,
>> > we only called it in zswap_is_full and for debugfs.
>>
>> zbud_get_pool_size() is called frequently when adding/freeing zswap
>> entry happen in zswap . This is why in this patch I added a counter in zbud,
>> and then in zswap the iteration of zswap_list to calculate the pool size will
>> not be needed.
>
> We can remove updating zswap_pool_pages in zswap_frontswap_store and
> zswap_free_entry as I said. So zswap_is_full is only hot spot.
> Do you think it's still big overhead? Why? Maybe locking to prevent
> destroying? Then, we can use RCU to minimize the overhead as I mentioned.

I get your point. Yes, In my previous patch, zswap_is_full() was the
only path to call
zbud_get_pool_size(). And your suggestion on patch v1 to remove the unnecessary
iteration will reduce the overhead further.

So adding the calculating of all the pool size in zswap.c is better.

>>
>> > Of course, it would need some lock or refcount to prevent destroy
>> > of zswap_tree in parallel with zswap_frontswap_invalidate_area but
>> > zswap_is_full doesn't need to be exact so RCU would be good fit.
>> >
>> > Most important point is that now zswap doesn't consider multiple swap.
>> > For example, Let's assume you uses two swap A and B with different priority
>> > and A already has charged 19% long time ago and let's assume that A swap is
>> > full now so VM start to use B so that B has charged 1% recently.
>> > It menas zswap charged (19% + 1%)i is full by default.
>> >
>> > Then, if VM want to swap out more pages into B, zbud_reclaim_page
>> > would be evict one of pages in B's pool and it would be repeated
>> > continuously. It's totally LRU reverse problem and swap thrashing in B
>> > would happen.
>> >
>>
>> The scenario is below:
>> There are 2 swap A, B in system. If pool size of A reach 19% of ram
>> size and swap A
>> is also full. Then swap B will be used. Pool size of B will be
>> increased until it hit
>> the 20% of the ram size. By now zswap pool size is about 39% of ram size.
>> If there are more than 2 swap file/device, zswap pool will expand out
>> of control
>> and there may be no swapout happened.
>
> I know.
>
>>
>> I think the original intention of zswap designer is to keep the total
>> zswap pools size below
>> 20% of RAM size.
>
> My point is your patch still doesn't solve the example I mentioned.

Hmm. My patch only make sure all the zswap pools use maximum 20% of
RAM size. It is a new problem in your example. The zbud_reclaim_page would
not swap out the oldest zbud page when multiple swaps are used.

Maybe the new problem can be resolved in another patch.

Thanks.

>
>>
>> Thanks.
>>
>> > Please say your usecase scenario and if it's really problem,
>> > we need more surgery.
>> >
>> > Thanks.
>> >
>> >> For *pool->pages_nr*, one way is to use pool->lock to protect. But I
>> >> think it is too heavy.
>> >> So does it ok to change pages_nr to atomic type too?
>> >>
>> >>
>> >> >
>> >> >> + }
>> >> >> +
>> >> >> + return page;
>> >> >> +}
>> >> >> +
>> >> >> +
>> >> >> /* Resets the struct page fields and frees the page */
>> >> >> -static void free_zbud_page(struct zbud_header *zhdr)
>> >> >> +static void free_zbud_page(struct zbud_pool *pool, struct zbud_header *zhdr)
>> >> >> {
>> >> >> __free_page(virt_to_page(zhdr));
>> >> >> +
>> >> >> + pool->pages_nr--;
>> >> >> + total_zbud_pages--;
>> >> >> }
>> >> >>
>> >> >> /*
>> >> >> @@ -279,11 +304,10 @@ int zbud_alloc(struct zbud_pool *pool, int size, gfp_t gfp,
>> >> >>
>> >> >> /* Couldn't find unbuddied zbud page, create new one */
>> >> >> spin_unlock(&pool->lock);
>> >> >> - page = alloc_page(gfp);
>> >> >> + page = alloc_zbud_page(pool, gfp);
>> >> >> if (!page)
>> >> >> return -ENOMEM;
>> >> >> spin_lock(&pool->lock);
>> >> >> - pool->pages_nr++;
>> >> >> zhdr = init_zbud_page(page);
>> >> >> bud = FIRST;
>> >> >>
>> >> >> @@ -349,8 +373,7 @@ void zbud_free(struct zbud_pool *pool, unsigned long handle)
>> >> >> if (zhdr->first_chunks == 0 && zhdr->last_chunks == 0) {
>> >> >> /* zbud page is empty, free */
>> >> >> list_del(&zhdr->lru);
>> >> >> - free_zbud_page(zhdr);
>> >> >> - pool->pages_nr--;
>> >> >> + free_zbud_page(pool, zhdr);
>> >> >> } else {
>> >> >> /* Add to unbuddied list */
>> >> >> freechunks = num_free_chunks(zhdr);
>> >> >> @@ -447,8 +470,7 @@ next:
>> >> >> * Both buddies are now free, free the zbud page and
>> >> >> * return success.
>> >> >> */
>> >> >> - free_zbud_page(zhdr);
>> >> >> - pool->pages_nr--;
>> >> >> + free_zbud_page(pool, zhdr);
>> >> >> spin_unlock(&pool->lock);
>> >> >> return 0;
>> >> >> } else if (zhdr->first_chunks == 0 ||
>> >> >> @@ -496,14 +518,12 @@ void zbud_unmap(struct zbud_pool *pool, unsigned long handle)
>> >> >>
>> >> >> /**
>> >> >> * zbud_get_pool_size() - gets the zbud pool size in pages
>> >> >> - * @pool: pool whose size is being queried
>> >> >> *
>> >> >> - * Returns: size in pages of the given pool. The pool lock need not be
>> >> >> - * taken to access pages_nr.
>> >> >> + * Returns: size in pages of all the zbud pools.
>> >> >> */
>> >> >> -u64 zbud_get_pool_size(struct zbud_pool *pool)
>> >> >> +u64 zbud_get_pool_size(void)
>> >> >> {
>> >> >> - return pool->pages_nr;
>> >> >> + return total_zbud_pages;
>> >> >> }
>> >> >>
>> >> >> static int __init init_zbud(void)
>> >> >> diff --git a/mm/zswap.c b/mm/zswap.c
>> >> >> index 5a63f78..ef44d9d 100644
>> >> >> --- a/mm/zswap.c
>> >> >> +++ b/mm/zswap.c
>> >> >> @@ -291,7 +291,7 @@ static void zswap_free_entry(struct zswap_tree *tree,
>> >> >> zbud_free(tree->pool, entry->handle);
>> >> >> zswap_entry_cache_free(entry);
>> >> >> atomic_dec(&zswap_stored_pages);
>> >> >> - zswap_pool_pages = zbud_get_pool_size(tree->pool);
>> >> >> + zswap_pool_pages = zbud_get_pool_size();
>> >> >> }
>> >> >>
>> >> >> /* caller must hold the tree lock */
>> >> >> @@ -716,7 +716,7 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>> >> >>
>> >> >> /* update stats */
>> >> >> atomic_inc(&zswap_stored_pages);
>> >> >> - zswap_pool_pages = zbud_get_pool_size(tree->pool);
>> >> >> + zswap_pool_pages = zbud_get_pool_size();
>> >> >>
>> >> >> return 0;
>> >> >>
>> >> >> --
>> >> >> 1.7.10.4
>> >> >>
>> >> >> --
>> >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> >> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
>> >> >> see: http://www.linux-mm.org/ .
>> >> >> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>> >> >
>> >> > --
>> >> > Kind regards,
>> >> > Minchan Kim
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Linus Walleij: "[PATCH] ARM: u300: move timer driver to clocksource"
Previous message: Grant Likely: "[PATCH 1/3] scripts/dtc: Update shipped files to build with bison 2.7.12"
In reply to: Minchan Kim: "Re: [PATCH v2] mm/zswap: Check all pool pages instead of one poolpages"
Next in thread: Minchan Kim: "Re: [PATCH v2] mm/zswap: Check all pool pages instead of one poolpages"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]