Re: [PATCH 08/27] mm, vmscan: Simplify the logic deciding whether kswapd sleeps
From: Mel Gorman
Date: Thu Jun 16 2016 - 04:30:45 EST
On Wed, Jun 15, 2016 at 05:18:00PM +0200, Vlastimil Babka wrote:
> >@@ -1209,9 +1209,10 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
> >
> > arch_refresh_nodedata(nid, pgdat);
> > } else {
> >- /* Reset the nr_zones and classzone_idx to 0 before reuse */
> >+ /* Reset the nr_zones, order and classzone_idx before reuse */
> > pgdat->nr_zones = 0;
> >- pgdat->classzone_idx = 0;
> >+ pgdat->kswapd_order = 0;
> >+ pgdat->kswapd_classzone_idx = -1;
> > }
> >
> > /* we can use NODE_DATA(nid) from here */
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 4ce578b969da..d8cb483d5cad 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -6036,7 +6036,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
> > unsigned long end_pfn = 0;
> >
> > /* pg_data_t should be reset to zero when it's allocated */
> >- WARN_ON(pgdat->nr_zones || pgdat->classzone_idx);
> >+ WARN_ON(pgdat->nr_zones || pgdat->kswapd_classzone_idx);
>
> Above you changed the reset value of kswapd_classzone_idx from 0 to -1, so
> won't this trigger? Also should we check kswapd_order that's newly reset
> too?
>
Good spot. The memory initialisation paths are ok but node memory hotplug
was broken.
> >
> > reset_deferred_meminit(pgdat);
> > pgdat->node_id = nid;
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index 96bf841f9352..14b34eebedff 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -2727,7 +2727,7 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> >
> > /* kswapd must be awake if processes are being throttled */
> > if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> >- pgdat->classzone_idx = min(pgdat->classzone_idx,
> >+ pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
> > (enum zone_type)ZONE_NORMAL);
> > wake_up_interruptible(&pgdat->kswapd_wait);
> > }
> >@@ -3211,6 +3211,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
> >
> > prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> >
> >+ /* If kswapd has not been woken recently, then full sleep */
> >+ if (classzone_idx == -1) {
> >+ classzone_idx = balanced_classzone_idx = MAX_NR_ZONES - 1;
> >+ goto full_sleep;
>
> This will skip the wakeup_kcompactd() part.
>
I wrestled with this one. I decided to leave it alone on the grounds
that if kswapd has not been woken recently then compaction efforts also
have not failed and kcompactd is not required.
> >@@ -3311,38 +3316,25 @@ static int kswapd(void *p)
> > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> > set_freezable();
> >
> >- order = new_order = 0;
> >- classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> >- balanced_classzone_idx = classzone_idx;
> >+ pgdat->kswapd_order = order = 0;
> >+ pgdat->kswapd_classzone_idx = classzone_idx = -1;
> > for ( ; ; ) {
> > bool ret;
> >
> >+kswapd_try_sleep:
> >+ kswapd_try_to_sleep(pgdat, order, classzone_idx, classzone_idx);
>
> The last two parameters are now the same, remove one?
>
Yes. A few more basic simplifications are then possible.
> >@@ -3352,12 +3344,19 @@ static int kswapd(void *p)
> > * We can speed up thawing tasks if we don't call balance_pgdat
> > * after returning from the refrigerator
> > */
> >- if (!ret) {
> >- trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> >+ if (ret)
> >+ continue;
> >
> >- /* return value ignored until next patch */
> >- balance_pgdat(pgdat, order, classzone_idx);
> >- }
> >+ /*
> >+ * Try reclaim the requested order but if that fails
> >+ * then try sleeping on the basis of the order reclaimed.
>
> Is the last word really meant to be "reclaimed", or "requested"?
>
No, I really meant reclaimed.
> >+ */
> >+ trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> >+ if (balance_pgdat(pgdat, order, classzone_idx) < order)
> >+ goto kswapd_try_sleep;
>
> AFAICS now kswapd_try_to_sleep() will use the "requested" order. That's
> needed for proper wakeup_kcompactd(), but won't it prevent kswapd from
> actually going to sleep, because zone_balanced() in prepare-sleep will be
> false? So I think you need to give it both orders to do the right thing?
>
You're right. There is a risk that kswapd stays awake longer in high
fragmentation scenarios.
Should be fixed now by passing in both orders.
--
Mel Gorman
SUSE Labs