Re: [RFC PATCH 0/3] Change how we determine when to hand out THPs
From: Andrew Morton
Date: Sat Dec 14 2013 - 00:43:38 EST
On Thu, 12 Dec 2013 12:00:37 -0600 Alex Thorlton <athorlton@xxxxxxx> wrote:
> This patch changes the way we decide whether or not to give out THPs to
> processes when they fault in pages.
Please cc Andrea on this.
> The way things are right now,
> touching one byte in a 2M chunk where no pages have been faulted in
> results in a process being handed a 2M hugepage, which, in some cases,
> is undesirable. The most common issue seems to arise when a process
> uses many cores to work on small portions of an allocated chunk of
> memory.
>
> Here are some results from a test that I wrote, which allocates memory
> in a way that doesn't benefit from the use of THPs:
>
> # echo always > /sys/kernel/mm/transparent_hugepage/enabled
> # perf stat -a -r 5 ./thp_pthread -C 0 -m 0 -c 64 -b 128g
>
> Performance counter stats for './thp_pthread -C 0 -m 0 -c 64 -b 128g' (5 runs):
>
> 93.534078104 seconds time elapsed
> ...
>
>
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> # perf stat -a -r 5 ./thp_pthread -C 0 -m 0 -c 64 -b 128g
>
> Performance counter stats for './thp_pthread -C 0 -m 0 -c 64 -b 128g' (5 runs):
>
> ...
> 76.467835263 seconds time elapsed
> ...
>
> As you can see there's a significant performance increase when running
> this test with THP off.
yup.
> My proposed solution to the problem is to allow users to set a
> threshold at which THPs will be handed out. The idea here is that, when
> a user faults in a page in an area where they would usually be handed a
> THP, we pull 512 pages off the free list, as we would with a regular
> THP, but we only fault in single pages from that chunk, until the user
> has faulted in enough pages to pass the threshold we've set. Once they
> pass the threshold, we do the necessary work to turn our 512 page chunk
> into a proper THP. As it stands now, if the user tries to fault in
> pages from different nodes, we completely give up on ever turning a
> particular chunk into a THP, and just fault in the 4K pages as they're
> requested. We may want to make this tunable in the future (i.e. allow
> them to fault in from only 2 different nodes).
OK. But all 512 pages reside on the same node, yes? Whereas with thp
disabled those 512 pages would have resided closer to the CPUs which
instantiated them. So the expected result will be somewhere in between
the 93 secs and the 76 secs?
That being said, I don't see a downside to the idea, apart from some
additional setup cost in kernel code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/