[PATCH 0/2] Fix migration races in rmap_walk() V5

From: Mel Gorman
Date: Wed May 05 2010 - 09:14:50 EST


So, V4 wasn't to everyone liking so here is another variation of the fix
needed for migration-related races on VMA adjustments. From V4, patch 1 is
a different approach where as patch 2 is the same except a minor bug on
anon_vma ordering is fixed. If these can be agreed upon, it'd be nice to
get a fix in for 2.6.34.

Changelog since V4
o Switch back anon_vma locking to put bulk of locking in rmap_walk
o Fix anon_vma lock ordering in exec vs migration race

Changelog since V3
o Rediff against the latest upstream tree
o Improve the patch changelog a little (thanks Peterz)

Changelog since V2
o Drop fork changes
o Avoid pages in temporary stacks during exec instead of migration pte
lazy cleanup
o Drop locking-related patch and replace with Rik's

Changelog since V1
o Handle the execve race
o Be sure that rmap_walk() releases the correct VMA lock
o Hold the anon_vma lock for the address lookup and the page remap
o Add reviewed-bys

Broadly speaking, migration works by locking a page, unmapping it, putting
a migration PTE in place that looks like a swap entry, copying the page and
remapping the page removing the old migration PTE before unlocking the page.
If a fault occurs, the faulting process waits until migration completes.

The problem is that there are some races that either allow migration PTEs
to be left left behind. Migration still completes and the page is unlocked
but later a fault will call migration_entry_to_page() and BUG() because the
page is not locked. It's not possible to just clean up the migration PTE
because the page it points to has been potentially freed and reused. This
series aims to close the races.

Patch 1 of this series is about the of locking of anon_vma in migration
versus vma_adjust. While I am not aware of any reproduction cases, it is
potentially racy. This patch is an alternative to Rik's more comprehensive
locking approach posted at http://lkml.org/lkml/2010/5/3/155 and uses
trylock-and-retry logic in rmap_walk until it can lock all the anon_vmas
without contention. In vma_adjust, the anon_vma locks are acquired under
similar conditions to 2.6.33. The rmap_walk changes potentially slows
down migration and aspects of page reclaim a little but they are the less
important path.

Patch 2 of this series addresses the swapops bug reported that is a race
between migration and execve where pages get migrated from the temporary
stack before it is moved. To avoid migration PTEs being left behind,
a temporary VMA is put in place so that a migration PTE in either the
temporary stack or the relocated stack can be found.

The reproduction case for the races was as follows;

1. Run kernel compilation in a loop
2. Start four processes, each of which creates one mapping. The three stress
different aspects of the problem. The operations they undertake are;
a) Forks a hundred children, each of which faults the mapping
Purpose: stress tests migration pte removal
b) Forks a hundred children, each which punches a hole in the mapping
and faults what remains
Purpose: stress test VMA manipulations during migration
c) Forks a hundred children, each of which execs and calls echo
Purpose: stress test the execve race
d) Size the mapping to be 1.5 times physical memory. Constantly
memset it
Purpose: stress swapping
3. Constantly compact memory using /proc/sys/vm/compact_memory so migration
is active all the time. In theory, you could also force this using
sys_move_pages or memory hot-remove but it'd be nowhere near as easy
to test.

Compaction is the easiest way to trigger these bugs which is not going to
be in 2.6.34 but in theory the problem also affects memory hot-remove.

There were some concerns with patch 2 that performance would be impacted. To
check if this was the case I ran kernbench, aim9 and sysbench. AIM9 in
particular was of interest as it has an exec microbenchmark.

kernbench-vanilla fixraces-v5r1
Elapsed mean 103.40 ( 0.00%) 103.35 ( 0.05%)
Elapsed stddev 0.09 ( 0.00%) 0.13 (-55.72%)
User mean 313.50 ( 0.00%) 313.15 ( 0.11%)
User stddev 0.61 ( 0.00%) 0.20 (66.70%)
System mean 55.50 ( 0.00%) 55.85 (-0.64%)
System stddev 0.48 ( 0.00%) 0.15 (68.98%)
CPU mean 356.25 ( 0.00%) 356.50 (-0.07%)
CPU stddev 0.43 ( 0.00%) 0.50 (-15.47%)

Nothing special there and kernbench is fork+exec heavy. The patched kernel
is slightly faster on wall time but it's well within the noise. System time
is slightly slower but again, it's within the noise.

AIM9
aim9-vanilla fixraces-v5r1
creat-clo 116813.86 ( 0.00%) 117980.34 ( 0.99%)
page_test 270923.33 ( 0.00%) 268668.56 (-0.84%)
brk_test 2551558.07 ( 0.00%) 2649450.00 ( 3.69%)
signal_test 279866.67 ( 0.00%) 279533.33 (-0.12%)
exec_test 226.67 ( 0.00%) 232.67 ( 2.58%)
fork_test 4261.91 ( 0.00%) 4110.98 (-3.67%)
link_test 53534.78 ( 0.00%) 54076.49 ( 1.00%)

So, here exec and fork aren't showing up major worries. exec is faster but
these tests can be so sensitive to starting conditions that I tend not to
read much into them unless there are major differences.

SYSBENCH
sysbench-vanilla fixraces-v5r1
1 14177.73 ( 0.00%) 14218.41 ( 0.29%)
2 27647.23 ( 0.00%) 27774.14 ( 0.46%)
3 31395.69 ( 0.00%) 31499.95 ( 0.33%)
4 49866.54 ( 0.00%) 49713.49 (-0.31%)
5 49919.58 ( 0.00%) 49524.21 (-0.80%)
6 49532.97 ( 0.00%) 49397.60 (-0.27%)
7 49465.79 ( 0.00%) 49384.14 (-0.17%)
8 49483.33 ( 0.00%) 49186.49 (-0.60%)

These figures also show no differences worth talking about.

While the extra allocation in patch 2 would appear to slow down exec somewhat,
it's not by any amount that matters. As it is in exec, it means that anon_vmas
have likely been freed very recently so the allocation will be cache-hot and
cpu-local. It is possible to special-case migration to avoid migrating pages
in the temporary stack, but fixing it in exec is a more maintainable approach.

fs/exec.c | 37 +++++++++++++++++++++++++++++++++----
mm/ksm.c | 22 ++++++++++++++++++++--
mm/mmap.c | 9 +++++++++
mm/rmap.c | 28 +++++++++++++++++++++++-----
4 files changed, 85 insertions(+), 11 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/