Re: [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOCreserves are low and swap is backed by network storage

From: Mel Gorman
Date: Wed May 02 2012 - 12:24:21 EST

Next message: Arnaldo Carvalho de Melo: "[GIT PULL 0/2] perf/urgent fixes"
Previous message: Mel Gorman: "Re: [PATCH 05/16] mm: allow PF_MEMALLOC from softirq context"
In reply to: Andrew Morton: "Re: [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOCreserves are low and swap is backed by network storage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, May 01, 2012 at 03:24:37PM -0700, Andrew Morton wrote:
> On Mon, 16 Apr 2012 13:17:02 +0100
> Mel Gorman <mgorman@xxxxxxx> wrote:
>
> > If swap is backed by network storage such as NBD, there is a risk
> > that a large number of reclaimers can hang the system by consuming
> > all PF_MEMALLOC reserves. To avoid these hangs, the administrator
> > must tune min_free_kbytes in advance which is a bit fragile.
> >
> > This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
> > are in use. If the system is routinely getting throttled the system
> > administrator can increase min_free_kbytes so degradation is smoother
> > but the system will keep running.
> >
> >
> > ...
> >
> > +static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
> > +{
> > + struct zone *zone;
> > + unsigned long pfmemalloc_reserve = 0;
> > + unsigned long free_pages = 0;
> > + int i;
> > + bool wmark_ok;
> > +
> > + for (i = 0; i <= ZONE_NORMAL; i++) {
> > + zone = &pgdat->node_zones[i];
> > + pfmemalloc_reserve += min_wmark_pages(zone);
> > + free_pages += zone_page_state(zone, NR_FREE_PAGES);
> > + }
> > +
> > + wmark_ok = (free_pages > pfmemalloc_reserve / 2) ? true : false;
>
> wmark_ok = free_pages > pfmemalloc_reserve / 2;
>

Of course, I don't know what I was on when I wrote that particular line.

> > +
> > + /* kswapd must be awake if processes are being throttled */
> > + if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> > + pgdat->classzone_idx = min(pgdat->classzone_idx,
> > + (enum zone_type)ZONE_NORMAL);
> > + wake_up_interruptible(&pgdat->kswapd_wait);
> > + }
> > +
> > + return wmark_ok;
> > +}
> > +
> > +/*
> > + * Throttle direct reclaimers if backing storage is backed by the network
> > + * and the PFMEMALLOC reserve for the preferred node is getting dangerously
> > + * depleted. kswapd will continue to make progress and wake the processes
> > + * when the low watermark is reached
> > + */
> > +static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> > + nodemask_t *nodemask)
> > +{
> > + struct zone *zone;
> > + int high_zoneidx = gfp_zone(gfp_mask);
> > + pg_data_t *pgdat;
> > +
> > + /* Kernel threads such as kjournald should not be throttled */
>
> The comment should explain "why", not "what". Particularly when the
> "what" was bleedin obvious ;)
>
> Also... why?
>

/*
* Kernel threads should not be throttled as they may be indirectly
* responsible for cleaning pages necessary for reclaim to make forward
* progress. kjournald for example may enter direct reclaim while
* committing a transaction where throttling it could forcing other
* processes to block on log_wait_commit()
*/

Does that help?

> > + if (current->flags & PF_KTHREAD)
> > + return;
> > +
> > + /* Check if the pfmemalloc reserves are ok */
> > + first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
> > + pgdat = zone->zone_pgdat;
> > + if (pfmemalloc_watermark_ok(pgdat))
> > + return;
> > +
> > + /*
> > + * If the caller cannot enter the filesystem, it's possible that it
> > + * is processing a journal transaction. In this case, it is not safe
> > + * to block on pfmemalloc_wait as kswapd could also be blocked waiting
> > + * to start a transaction. Instead, throttle for up to a second before
> > + * the reclaim must continue.
> > + */
>
> I suppose this applies to fs locks in general, not just to
> journal_start()?
>

Yes. I updated the comment to reflect that.

/*
* If the caller cannot enter the filesystem, it's possible that it
* is due to the caller holding an FS lock or performing a journal
* transaction in the case of a filesystem like ext[3|4]. In this case,
* it is not safe to block on pfmemalloc_wait as kswapd could be
* blocked waiting on the same lock. Instead, throttle for up to a
* second before continuing.
*/

> > + if (!(gfp_mask & __GFP_FS)) {
> > + wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
> > + pfmemalloc_watermark_ok(pgdat), HZ);
> > + return;
> > + }
> > +
> > + /* Throttle until kswapd wakes the process */
> > + wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
> > + pfmemalloc_watermark_ok(pgdat));
> > +}
> > +
> > unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > gfp_t gfp_mask, nodemask_t *nodemask)
> > {
> >
> > ...
> >
> > @@ -2610,6 +2686,20 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > if (remaining)
> > return true;
> >
> > + /*
> > + * There is a potential race between when kswapd checks it watermarks
>
> "its"
>

Fixed.

> > + * and a process gets throttled. There is also a potential race if
> > + * processes get throttled, kswapd wakes, a large process exits therby
> > + * balancing the zones that causes kswapd to miss a wakeup. If kswapd
> > + * is going to sleep, no process should be sleeping on pfmemalloc_wait
> > + * so wake them now if necessary. If necessary, processes will wake
> > + * kswapd and get throttled again
> > + */
>
> Yes, the possibility for missed wakeups here worried me. There's no
> synchronization and it would be easy to leave holes.
>
> It's good that there is no timeout on the throttling - a timeout would
> cover up rare races most nastily.
>

Yes and I wanted to avoid that. If there is a lost wakup, sysrq+t should
show processes stuck in throttle_direct_reclaim() while kswapd is asleep.

> > + if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
> > + wake_up(&pgdat->pfmemalloc_wait);
> > + return true;
> > + }
>
> A bool-returning function called "sleeping_prematurely" should have no
> side-effects. But it now performs wakeups. Wanna see if there is a
> way of making this nicer?
>

Minimally, the two instances of "There is a potential race" was a
merging mistake so I deleted the one in kswapd_try_to_sleep().

I looked at moving this wake_up outside sleeping_prematurely() but it
looked worse really. What I did instead was rename
sleeping_prematurely() to prepare_kswapd_sleep() and and commented it
like this

/*
* Prepare kswapd for sleeping. This verifies that there are no processes
* waiting in throttle_direct_reclaim() and that watermarks have been
* met.
*
* Returns true if kswapd is ready to sleep
*/
static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
int classzone_idx)

It's cheating a bit but a name like "prepare" implies that it may have
side-effects.

> > /* Check the watermark levels */
> > for (i = 0; i <= classzone_idx; i++) {
> > struct zone *zone = pgdat->node_zones + i;
> > @@ -2871,6 +2961,12 @@ loop_again:
> > }
> >
> > }
> > +
> > + /* Wake throttled direct reclaimers if low watermark is met */
>
> s/"what"/"why"/ !
>

/*
* If the low watermark is met there is no need for processes
* to be throttled on pfmemalloc_wait as they should not be
* able to safely make forward progress. Wake them
*/

?

Here is how the patch currently stands

---8<---
mm: Throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage

If swap is backed by network storage such as NBD, there is a risk
that a large number of reclaimers can hang the system by consuming
all PF_MEMALLOC reserves. To avoid these hangs, the administrator
must tune min_free_kbytes in advance which is a bit fragile.

This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
are in use. If the system is routinely getting throttled the system
administrator can increase min_free_kbytes so degradation is smoother
but the system will keep running.

Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 1 +
mm/vmscan.c | 128 +++++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 122 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dff7115..e6b733d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -663,6 +663,7 @@ typedef struct pglist_data {
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait;
+ wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd;
int kswapd_max_order;
enum zone_type classzone_idx;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e225a7c..b9eb64a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4326,6 +4326,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
pgdat_resize_init(pgdat);
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
+ init_waitqueue_head(&pgdat->pfmemalloc_wait);
pgdat->kswapd_max_order = 0;
pgdat_page_cgroup_init(pgdat);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33c332b..6f322e8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2431,6 +2431,80 @@ out:
return 0;
}

+static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
+{
+ struct zone *zone;
+ unsigned long pfmemalloc_reserve = 0;
+ unsigned long free_pages = 0;
+ int i;
+ bool wmark_ok;
+
+ for (i = 0; i <= ZONE_NORMAL; i++) {
+ zone = &pgdat->node_zones[i];
+ pfmemalloc_reserve += min_wmark_pages(zone);
+ free_pages += zone_page_state(zone, NR_FREE_PAGES);
+ }
+
+ wmark_ok = free_pages > pfmemalloc_reserve / 2;
+
+ /* kswapd must be awake if processes are being throttled */
+ if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+ pgdat->classzone_idx = min(pgdat->classzone_idx,
+ (enum zone_type)ZONE_NORMAL);
+ wake_up_interruptible(&pgdat->kswapd_wait);
+ }
+
+ return wmark_ok;
+}
+
+/*
+ * Throttle direct reclaimers if backing storage is backed by the network
+ * and the PFMEMALLOC reserve for the preferred node is getting dangerously
+ * depleted. kswapd will continue to make progress and wake the processes
+ * when the low watermark is reached
+ */
+static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+ nodemask_t *nodemask)
+{
+ struct zone *zone;
+ int high_zoneidx = gfp_zone(gfp_mask);
+ pg_data_t *pgdat;
+
+ /*
+ * Kernel threads should not be throttled as they may be indirectly
+ * responsible for cleaning pages necessary for reclaim to make forward
+ * progress. kjournald for example may enter direct reclaim while
+ * committing a transaction where throttling it could forcing other
+ * processes to block on log_wait_commit().
+ */
+ if (current->flags & PF_KTHREAD)
+ return;
+
+ /* Check if the pfmemalloc reserves are ok */
+ first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
+ pgdat = zone->zone_pgdat;
+ if (pfmemalloc_watermark_ok(pgdat))
+ return;
+
+ /*
+ * If the caller cannot enter the filesystem, it's possible that it
+ * is due to the caller holding an FS lock or performing a journal
+ * transaction in the case of a filesystem like ext[3|4]. In this case,
+ * it is not safe to block on pfmemalloc_wait as kswapd could be
+ * blocked waiting on the same lock. Instead, throttle for up to a
+ * second before continuing.
+ */
+ if (!(gfp_mask & __GFP_FS)) {
+ wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(pgdat), HZ);
+ return;
+ }
+
+ /* Throttle until kswapd wakes the process */
+ wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
+ pfmemalloc_watermark_ok(pgdat));
+}
+
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
@@ -2449,6 +2523,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.gfp_mask = sc.gfp_mask,
};

+ throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
+
+ /*
+ * Do not enter reclaim if fatal signal is pending. 1 is returned so
+ * that the page allocator does not consider triggering OOM
+ */
+ if (fatal_signal_pending(current))
+ return 1;
+
trace_mm_vmscan_direct_reclaim_begin(order,
sc.may_writepage,
gfp_mask);
@@ -2598,8 +2681,13 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
return balanced_pages >= (present_pages >> 2);
}

-/* is kswapd sleeping prematurely? */
-static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
+/*
+ * Prepare kswapd for sleeping. This verifies that there are no processes
+ * waiting in throttle_direct_reclaim() and that watermarks have been met.
+ *
+ * Returns true if kswapd is ready to sleep
+ */
+static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
int classzone_idx)
{
int i;
@@ -2608,7 +2696,21 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,

/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
- return true;
+ return false;
+
+ /*
+ * There is a potential race between when kswapd checks its watermarks
+ * and a process gets throttled. There is also a potential race if
+ * processes get throttled, kswapd wakes, a large process exits therby
+ * balancing the zones that causes kswapd to miss a wakeup. If kswapd
+ * is going to sleep, no process should be sleeping on pfmemalloc_wait
+ * so wake them now if necessary. If necessary, processes will wake
+ * kswapd and get throttled again
+ */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
+ wake_up(&pgdat->pfmemalloc_wait);
+ return false;
+ }

/* Check the watermark levels */
for (i = 0; i <= classzone_idx; i++) {
@@ -2641,9 +2743,9 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
* must be balanced
*/
if (order)
- return !pgdat_balanced(pgdat, balanced, classzone_idx);
+ return pgdat_balanced(pgdat, balanced, classzone_idx);
else
- return !all_zones_ok;
+ return all_zones_ok;
}

/*
@@ -2871,6 +2973,16 @@ loop_again:
}

}
+
+ /*
+ * If the low watermark is met there is no need for processes
+ * to be throttled on pfmemalloc_wait as they should not be
+ * able to safely make forward progress. Wake them
+ */
+ if (waitqueue_active(&pgdat->pfmemalloc_wait) &&
+ pfmemalloc_watermark_ok(pgdat))
+ wake_up(&pgdat->pfmemalloc_wait);
+
if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
break; /* kswapd: all done */
/*
@@ -2971,7 +3083,7 @@ out:
}

/*
- * Return the order we were reclaiming at so sleeping_prematurely()
+ * Return the order we were reclaiming at so prepare_kswapd_sleep()
* makes a decision on the order we were last reclaiming at. However,
* if another caller entered the allocator slow path while kswapd
* was awake, order will remain at the higher level
@@ -2991,7 +3103,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);

/* Try to sleep for a short interval */
- if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+ if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
remaining = schedule_timeout(HZ/10);
finish_wait(&pgdat->kswapd_wait, &wait);
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -3001,7 +3113,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
* After a short sleep, check if it was a premature sleep. If not, then
* go fully to sleep until explicitly woken up.
*/
- if (!sleeping_prematurely(pgdat, order, remaining, classzone_idx)) {
+ if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
trace_mm_vmscan_kswapd_sleep(pgdat->node_id);

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Arnaldo Carvalho de Melo: "[GIT PULL 0/2] perf/urgent fixes"
Previous message: Mel Gorman: "Re: [PATCH 05/16] mm: allow PF_MEMALLOC from softirq context"
In reply to: Andrew Morton: "Re: [PATCH 15/16] mm: Throttle direct reclaimers if PF_MEMALLOCreserves are low and swap is backed by network storage"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]