[PATCH 5/5] cpuset: memory reclaim rate meter

From: Paul Jackson
Date: Fri Nov 04 2005 - 00:32:47 EST


Provide a simple per-cpuset metric of memory stress, tracking the
-rate- that the tasks in a cpuset call try_to_free_pages(), the
synchronous (direct) memory reclaim code.

This enables batch managers monitoring jobs running in dedicated
cpusets to efficiently detect what level of memory stress that job
is encountering.

This is useful both on tightly managed systems running a wide mix
of submitted jobs, which may choose to terminate or reprioritize
jobs that are trying to use more memory than allowed on the nodes
assigned them, and with tightly coupled, long running, massively
parallel scientific computing jobs that will dramatically fail to
meet required performance goals if they start to swap.

This patch just provides a very economical way for the batch manager
to monitor a cpuset for signs of memory distress. It's up to the
batch manager or other user code to decide what to do about it and
take action.

A per-cpuset simple digital filter (requires a spinlock and 3 words
of data per-cpuset) is kept, and updated by any task attached to that
cpuset, if it enters the synchronous (direct) page reclaim code.

A per-cpuset file provides an integer number representing the recent
(half-life of 10 seconds) rate of direct page reclaims caused by
the tasks in the cpuset, in units of reclaims attempted per second,
times 1000.

Signed-off-by: Paul Jackson <pj@xxxxxxx>

---

An earlier patch that tried to address this same problem
is in the thread:

http://lkml.org/lkml/2005/3/19/148
Date Sat, 19 Mar 2005 17:48:46 -0800 (PST)
Subject [Patch] cpusets policy kill no swap

It was rejected, as it hardwired policy, mechanism and
detection together, in a manner that would only have
been useful to very specialized customer needs.

include/linux/cpuset.h | 3
kernel/cpuset.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++++-
mm/page_alloc.c | 1
3 files changed, 161 insertions(+), 1 deletion(-)

--- 2.6.14-rc5-mm1-cpuset-patches.orig/include/linux/cpuset.h 2005-11-03 19:07:00.026535184 -0800
+++ 2.6.14-rc5-mm1-cpuset-patches/include/linux/cpuset.h 2005-11-03 19:07:00.130051971 -0800
@@ -26,6 +26,7 @@ void cpuset_update_current_mems_allowed(
int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
extern int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask);
extern int cpuset_excl_nodes_overlap(const struct task_struct *p);
+extern void cpuset_synchronous_page_reclaim_bump(void);
extern struct file_operations proc_cpuset_operations;
extern char *cpuset_task_status_allowed(struct task_struct *task, char *buffer);

@@ -60,6 +61,8 @@ static inline int cpuset_excl_nodes_over
return 1;
}

+static ineline void cpuset_synchronous_page_reclaim_bump(void) {}
+
static inline char *cpuset_task_status_allowed(struct task_struct *task,
char *buffer)
{
--- 2.6.14-rc5-mm1-cpuset-patches.orig/kernel/cpuset.c 2005-11-03 19:07:00.050949521 -0800
+++ 2.6.14-rc5-mm1-cpuset-patches/kernel/cpuset.c 2005-11-03 19:28:12.507607967 -0800
@@ -56,6 +56,15 @@

#define CPUSET_SUPER_MAGIC 0x27e0eb

+/* See "Frequency meter" comments, below. */
+
+struct fmeter {
+ int cnt; /* unprocessed events count */
+ int val; /* most recent output value */
+ time_t time; /* clock (secs) when val computed */
+ spinlock_t lock; /* guards read or write of above */
+};
+
struct cpuset {
unsigned long flags; /* "unsigned long" so bitops work */
cpumask_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
@@ -81,7 +90,9 @@ struct cpuset {
* Copy of global cpuset_mems_generation as of the most
* recent time this cpuset changed its mems_allowed.
*/
- int mems_generation;
+ int mems_generation;
+
+ struct fmeter fmeter; /* memory_reclaim_rate filter */
};

/* bits in struct cpuset flags field */
@@ -174,6 +185,9 @@ static struct cpuset top_cpuset = {
.cpus_allowed = CPU_MASK_ALL,
.mems_allowed = NODE_MASK_ALL,
.marker_pid = 0,
+ .fmeter.cnt = 0,
+ .fmeter.val = 0,
+ .fmeter.time = 0,
.count = ATOMIC_INIT(0),
.sibling = LIST_HEAD_INIT(top_cpuset.sibling),
.children = LIST_HEAD_INIT(top_cpuset.children),
@@ -887,6 +901,104 @@ static int update_flag(cpuset_flagbits_t
}

/*
+ * Frequency meter - How fast is some event occuring?
+ *
+ * These routines manage a digitally filtered, constant time based,
+ * event frequency meter. There are four routines:
+ * fmeter_init() - initialize a frequency meter.
+ * fmeter_markevent() - called each time the event happens.
+ * fmeter_getrate() - returns the recent rate of such events.
+ * fmeter_update() - internal routine used to update fmeter.
+ *
+ * A common data structure is passed to each of these routines,
+ * which is used to keep track of the state required to manage the
+ * frequency meter and its digital filter.
+ *
+ * The filter works on the number of events marked per unit time.
+ * The filter is single-pole low-pass recursive (IIR). The time unit
+ * is 1 second. Arithmetic is done using 32-bit integers scaled to
+ * simulate 3 decimal digits of precision (multiplied by 1000).
+ *
+ * With an FM_COEF of 933, and a time base of 1 second, the filter
+ * has a half-life of 10 seconds, meaning that if the events quit
+ * happening, then the rate returned from the fmeter_getrate()
+ * will be cut in half each 10 seconds, until it converges to zero.
+ *
+ * It is not worth doing a real infinitely recursive filter. If more
+ * than FM_MAXTICKS ticks have elapsed since the last filter event,
+ * just compute FM_MAXTICKS ticks worth, by which point the level
+ * will be stable.
+ *
+ * Limit the count of unprocessed events to FM_MAXCNT, so as to avoid
+ * arithmetic overflow in the fmeter_update() routine.
+ *
+ * Given the simple 32 bit integer arithmetic used, this meter works
+ * best for reporting rates between one per millisecond (msec) and
+ * one per 32 (approx) seconds. At constant rates faster than one
+ * per msec it maxes out at values just under 1,000,000. At constant
+ * rates between one per msec, and one per second it will stabilize
+ * to a value N*1000, where N is the rate of events per second.
+ * At constant rates between one per second and one per 32 seconds,
+ * it will be choppy, moving up on the seconds that have an event,
+ * and then decaying until the next event. At rates slower than
+ * about one in 32 seconds, it decays all the way back to zero between
+ * each event.
+ */
+
+#define FM_COEF 933 /* coefficient for half-life of 10 secs */
+#define FM_MAXTICKS ((time_t)99) /* useless computing more ticks than this */
+#define FM_MAXCNT 1000000 /* limit cnt to avoid overflow */
+#define FM_SCALE 1000 /* faux fixed point scale */
+
+/* Initialize a frequency meter */
+static void fmeter_init(struct fmeter *fmp)
+{
+ fmp->cnt = 0;
+ fmp->val = 0;
+ fmp->time = 0;
+ spin_lock_init(&fmp->lock);
+}
+
+/* Internal meter update - process cnt events and update value */
+static void fmeter_update(struct fmeter *fmp)
+{
+ time_t now = get_seconds();
+ time_t ticks = now - fmp->time;
+
+ if (ticks == 0)
+ return;
+
+ ticks = min(FM_MAXTICKS, ticks);
+ while (ticks-- > 0)
+ fmp->val = (FM_COEF * fmp->val) / FM_SCALE;
+ fmp->time = now;
+
+ fmp->val += ((FM_SCALE - FM_COEF) * fmp->cnt) / FM_SCALE;
+ fmp->cnt = 0;
+}
+
+/* Process any previous ticks, then bump cnt by one (times scale). */
+static void fmeter_markevent(struct fmeter *fmp)
+{
+ spin_lock(&fmp->lock);
+ fmeter_update(fmp);
+ fmp->cnt = min(FM_MAXCNT, fmp->cnt + FM_SCALE);
+ spin_unlock(&fmp->lock);
+}
+
+/* Process any previous ticks, then return current value. */
+static int fmeter_getrate(struct fmeter *fmp)
+{
+ int val;
+
+ spin_lock(&fmp->lock);
+ fmeter_update(fmp);
+ val = fmp->val;
+ spin_unlock(&fmp->lock);
+ return val;
+}
+
+/*
* Attack task specified by pid in 'pidbuf' to cpuset 'cs', possibly
* writing the path of the old cpuset in 'ppathbuf' if it needs to be
* notified on release.
@@ -964,6 +1076,7 @@ typedef enum {
FILE_MEM_EXCLUSIVE,
FILE_NOTIFY_ON_RELEASE,
FILE_MARKER_PID,
+ FILE_RECLAIM_RATE,
FILE_TASKLIST,
} cpuset_filetype_t;

@@ -1021,6 +1134,9 @@ static ssize_t cpuset_common_file_write(
case FILE_MARKER_PID:
retval = update_marker_pid(cs, buffer);
break;
+ case FILE_RECLAIM_RATE:
+ retval = -EACCES;
+ break;
case FILE_TASKLIST:
retval = attach_task(cs, buffer, &pathbuf);
break;
@@ -1127,6 +1243,9 @@ static ssize_t cpuset_common_file_read(s
case FILE_MARKER_PID:
s += sprintf(s, "%d", cs->marker_pid);
break;
+ case FILE_RECLAIM_RATE:
+ s += sprintf(s, "%d", fmeter_getrate(&cs->fmeter));
+ break;
default:
retval = -EINVAL;
goto out;
@@ -1480,6 +1599,11 @@ static struct cftype cft_marker_pid = {
.private = FILE_MARKER_PID,
};

+static struct cftype cft_reclaim_rate = {
+ .name = "memory_reclaim_rate",
+ .private = FILE_RECLAIM_RATE,
+};
+
static int cpuset_populate_dir(struct dentry *cs_dentry)
{
int err;
@@ -1496,6 +1620,8 @@ static int cpuset_populate_dir(struct de
return err;
if ((err = cpuset_add_file(cs_dentry, &cft_marker_pid)) < 0)
return err;
+ if ((err = cpuset_add_file(cs_dentry, &cft_reclaim_rate)) < 0)
+ return err;
if ((err = cpuset_add_file(cs_dentry, &cft_tasks)) < 0)
return err;
return 0;
@@ -1531,6 +1657,7 @@ static long cpuset_create(struct cpuset
INIT_LIST_HEAD(&cs->children);
atomic_inc(&cpuset_mems_generation);
cs->mems_generation = atomic_read(&cpuset_mems_generation);
+ fmeter_init(&cs->fmeter);

cs->parent = parent;

@@ -1620,6 +1747,7 @@ int __init cpuset_init(void)
top_cpuset.cpus_allowed = CPU_MASK_ALL;
top_cpuset.mems_allowed = NODE_MASK_ALL;

+ fmeter_init(&top_cpuset.fmeter);
atomic_inc(&cpuset_mems_generation);
top_cpuset.mems_generation = atomic_read(&cpuset_mems_generation);

@@ -1929,6 +2057,34 @@ done:
return overlap;
}

+/**
+ * cpuset_synchronous_page_reclaim_bump - keep stats of per-cpuset relaims.
+ *
+ * Keep a running average of the rate of synchronous (direct)
+ * page reclaim efforts initiated by tasks in each cpuset.
+ *
+ * This represents the rate at which some task in the cpuset
+ * ran low on memory on all nodes it was allowed to use, and
+ * had to enter the kernels page reclaim code in an effort to
+ * create more free memory by tossing clean pages or swapping
+ * or writing dirty pages.
+ *
+ * Display to user space in the per-cpuset read-only file
+ * "memory_reclaim_rate". Value displayed is an integer
+ * representing the recent rate of entry into the synchronous
+ * (direct) page reclaim by any task attached to the cpuset.
+ **/
+
+void cpuset_synchronous_page_reclaim_bump(void)
+{
+ struct cpuset *cs;
+
+ task_lock(current);
+ cs = current->cpuset;
+ fmeter_markevent(&cs->fmeter);
+ task_unlock(current);
+}
+
/*
* proc_cpuset_show()
* - Print tasks cpuset path into seq_file.
--- 2.6.14-rc5-mm1-cpuset-patches.orig/mm/page_alloc.c 2005-11-03 19:06:53.301850334 -0800
+++ 2.6.14-rc5-mm1-cpuset-patches/mm/page_alloc.c 2005-11-03 19:26:02.267868104 -0800
@@ -976,6 +976,7 @@ rebalance:
cond_resched();

/* We now go into synchronous reclaim */
+ cpuset_synchronous_page_reclaim_bump();
p->flags |= PF_MEMALLOC;
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.650.933.1373
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/