Re: [patch v3] swap: virtual swap readahead

From: Johannes Weiner
Date: Wed Jun 10 2009 - 03:47:43 EST

Next message: Frederic Weisbecker: "Re: [PATCH 08/11] tracing: add func and symfunc to tag format"
Previous message: Gui Jianfeng: "Re: [PATCH 17/19] io-controller: Support per cgroup per device weightsand io class"
In reply to: Wu Fengguang: "Re: [patch v3] swap: virtual swap readahead"
Next in thread: Wu Fengguang: "Re: [patch v3] swap: virtual swap readahead"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Fengguang,

On Wed, Jun 10, 2009 at 01:03:42PM +0800, Wu Fengguang wrote:
> On Wed, Jun 10, 2009 at 03:37:02AM +0800, Johannes Weiner wrote:
> > On Tue, Jun 09, 2009 at 09:01:28PM +0200, Johannes Weiner wrote:
> > > [resend with lists cc'd, sorry]
> >
> > [and fixed Hugh's email. crap]
> >
> > > Hi,
> > >
> > > here is a new iteration of the virtual swap readahead. Per Hugh's
> > > suggestion, I moved the pte collecting to the callsite and thus out
> > > ouf swap code. Unfortunately, I had to bound page_cluster due to an
> > > array of that many swap entries on the stack, but I think it is better
> > > to limit the cluster size to a sane maximum than using dynamic
> > > allocation for this purpose.
>
> Hi Johannes,
>
> When stress testing your patch, I found it triggered many OOM kills.
> Around the time of last OOMs, the memory usage is:
>
> total used free shared buffers cached
> Mem: 474 468 5 0 0 239
> -/+ buffers/cache: 229 244
> Swap: 1023 221 802

Wow, that really confused me for a second as we shouldn't read more
pages ahead than without the patch, probably even less under stress.

So the problem has to be a runaway reading. And indeed, severe
stupidity here:

+ window = cluster << PAGE_SHIFT;
+ min = addr & ~(window - 1);
+ max = min + cluster;
+ /*
+ * To keep the locking/highpte mapping simple, stay
+ * within the PTE range of one PMD entry.
+ */
+ limit = addr & PMD_MASK;
+ if (limit > min)
+ min = limit;
+ limit = pmd_addr_end(addr, max);
+ if (limit < max)
+ max = limit;
+ limit = max - min;

The mistake is at the initial calculation of max. It should be

max = min + window;

The resulting problem is that min could get bigger than max when
cluster is bigger than PMD_SHIFT. Did you use page_cluster == 5?

The initial min is aligned to a value below the PMD boundary and max
based on it with a too small offset, staying below the PMD boundary as
well. When min is rounded up, this becomes a bit large:

limit = max - min;

So if my brain is already functioning, fixing the initial max should
be enough because either

o window is smaller than PMD_SIZE, than we won't round down
below a PMD boundary in the first place or

o window is bigger than PMD_SIZE, than we can round down below
a PMD boundary but adding window to that is garuanteed to
cross the boundary again

and thus max is always bigger than min.

Fengguang, does this make sense? If so, the patch below should fix
it.

Thank you,

Hannes

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2467,7 +2467,7 @@ static int swap_readahead_ptes(struct mm

window = cluster << PAGE_SHIFT;
min = addr & ~(window - 1);
- max = min + cluster;
+ max = min + window;
/*
* To keep the locking/highpte mapping simple, stay
* within the PTE range of one PMD entry.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Frederic Weisbecker: "Re: [PATCH 08/11] tracing: add func and symfunc to tag format"
Previous message: Gui Jianfeng: "Re: [PATCH 17/19] io-controller: Support per cgroup per device weightsand io class"
In reply to: Wu Fengguang: "Re: [patch v3] swap: virtual swap readahead"
Next in thread: Wu Fengguang: "Re: [patch v3] swap: virtual swap readahead"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]