Re: [patch][rfc] 2.6.23-rc1 mm: NUMA replicated pagecache

From: Nick Piggin
Date: Mon Aug 13 2007 - 08:54:03 EST


On Fri, Aug 10, 2007 at 05:08:18PM -0400, Lee Schermerhorn wrote:
> On Wed, 2007-08-08 at 16:25 -0400, Lee Schermerhorn wrote:
> > On Fri, 2007-07-27 at 10:42 +0200, Nick Piggin wrote:
> > > Hi,
> > >
> > > Just got a bit of time to take another look at the replicated pagecache
> > > patch. The nopage vs invalidate race and clear_page_dirty_for_io fixes
> > > gives me more confidence in the locking now; the new ->fault API makes
> > > MAP_SHARED write faults much more efficient; and a few bugs were found
> > > and fixed.
> > >
> > > More stats were added: *repl* in /proc/vmstat. Survives some kbuilding
> > > tests...
> > >
> >
> > Sending this out to give Nick an update and to give the list a
> > heads up on what I've found so far with the replication patch.
> >
> > I have rebased Nick's recent pagecache replication patch against
> > 2.6.23-rc1-mm2, atop my memory policy and auto/lazy migration
> > patch sets. These include:
> >
> > + shared policy
> > + migrate-on-fault a.k.a. lazy migration
> > + auto-migration - trigger lazy migration on inter-node task
> > task migration
> > + migration cache - pseudo-swap cache for parking unmapped
> > anon pages awaiting migrate-on-fault
> >
> > I added a couple of patches to fix up the interaction of replication
> > with migration [discussed more below] and a per cpuset control to
> > enable/disable replication. The latter allowed me to boot successfully
> > and to survive any bugs encountered by restricting the effects to
> > tasks in the test cpuset with replication enabled. That was the
> > theory, anyway :-). Mostly worked...
>
> After I sent out the last update, I ran a usex job mix overnight ~19.5 hours.
> When I came in the next morning, the console window was full of soft lockups
> on various cpus with varions stack traces. /var/log/messages showed 142, in
> all.
>
> I've placed the soft lockup reports from /var/log/messages in the Replication
> directory on free.linux:
>
> http://free.linux.hp.com/~lts/Patches/Replication.
>
> The lockups appeared in several places in the traces I looked at. Here's a
> couple of examples:
>
> + unlink_file_vma() from free_pgtables() during task exit:
> mapping->i_mmap_lock ???
>
> + smp_call_function() from ia64_global_tlb_purge().
> Maybe the 'call_lock' in arch/ia64/kernel/smp.c ?
> Traces show us getting to here in one of 2 ways:
>
> 1) try_to_unmap* during auto task migration [migrate_pages_unmap_only()...]
>
> 2) from zap_page_range() when __unreplicate_pcache() calls unmap_mapping_range.
>
> + get_page_from_freelist -> zone_lru_lock?
>
> An interesting point: all of the soft lockup messages said that the cpu was
> locked for 11s. Ring any bells?

Hi Lee,

Am sick with the flu for the past few days, so I haven't done much more
work here, but I'll just add some (not very useful) comments....

The get_page_from_freelist hang is quite strange. It would be zone->lock,
which shouldn't have too much contention...

Replication may be putting more stress on some locks. It will cause more
tlb flushing that can not be batched well, which could cause the call_lock
to get hotter. Then i_mmap_lock is held over tlb flushing, so it will
inherit the latency from call_lock. (If this is the case, we could
potentially extend the tlb flushing API slightly to cope better with
unmapping of pages from multiple mm's, but that comes way down the track
when/if replication proves itself!).

tlb flushing AFAIKS should not do the IPI unless it is deadling with a
multithreaded mm... does usex use threads?


> I should note that I was trying to unmap all mappings to the file backed pages
> on internode task migration, instead of just the current task's pte's. However,
> I was only attempting this on pages with mapcount <= 4. So, I don't think I
> was looping trying to unmap pages with mapcounts of several 10s--such as I see
> on some page cache pages in my traces.

Replication teardown would still have to unmap all... but that shouldn't
particularly be any worse than, say, page reclaim (except I guess that it
could occur more often).


> Today, after rebasing to 23-rc2-mm2, I added a patch to unmap only the current
> task's ptes for ALL !anon pages, regardless of mapcount. I've started the test
> again and will let it run over the weekend--or as long as it stays up, which
> ever is shorter :-).

Ah, so it does eventually die? Any hints of why?

>
> I put a tarball with the rebased series in the Replication directory linked
> above, in case you're interested. I haven't added the patch description for
> the new patch yet, but it's pretty simple. Maybe even correct.
>
> ----
>
> Unrelated to the lockups [I think]:
>
> I forgot to look before I rebooted, but earlier the previous evening, I checked
> the vmstats and at that point [~1.5 hours into the test] we had done ~4.88 million
> replications and ~4.8 million "zaps" [collapse of replicated page]. That's around
> 98% zaps. Do we need some filter in the fault path to reduce the "thrashing"--if
> that's what I'm seeing.

Yep. The current replication patch is very much only infrastructure at
this stage (and is good for stress testing). I feel sure that heuristics
and perhaps tunables would be needed to make the most of it.


> A while back I took a look at the Virtual Iron page replication patch. They had
> set VM_DENY_WRITE when mapping shared executable segments, and only replicated pages
> in those VMAs. Maybe 'DENY_WRITE isn't exactly what we want. Possibly set another
> flag for shared executables, if we can detect them, and any shared mapping that has
> no writable mappings ?

mapping_writably_mapped would be a good one to try. That may be too
broad in some corner cases where we do want occasionally-written files
or even parts of files to be replicated, but if we were ever to enable
CONFIG_REPLICATION by default, I imagine mapping_writably_mapped would
be the default heuristic.

Still, I appreciate the testing of the "thrashing" case, because with
the mapping_writably_mapped heuristic, it is likely that bugs could
remain lurking even in production workloads on huge systems (because
they will hardly ever get unreplicated).


> I'll try to remember to check the replication statistics after the currently
> running test. If the system stays up, that is. A quick look < 10 minutes into
> the test shows that zaps are now ~84% of replications. Also, ~47k replicated pages
> out of ~287K file pages.

Yeah I guess it can be a little misleading: as time approaches infinity,
zaps will probably approach replications. But that doesn't tell you how
long a replica stayed around and usefully fed CPUs with local memory...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/