Re: nonlinear swapping w/o pte_chains [Re: VMA_MERGING_FIXUP and patch]

From: Andrea Arcangeli
Date: Wed Mar 24 2004 - 09:39:17 EST

On Wed, Mar 24, 2004 at 10:12:58AM +0000, Hugh Dickins wrote:
> On Tue, 23 Mar 2004, Andrea Arcangeli wrote:
> >
> > I don't think I can use the tlb gather because I've to set the pte back
> > immediatly, or can I? The IPI flood and huge pagetable walk with total
> > destruction of the address space with huge mappings will be very bad in
> > terms of usability during swapping of huge nonlinear vmas, but hey, if
> > you want to swap smoothly, you should use the vmas.
> Thanks a lot for the preview (or would have been a preview if I'd been
> awake - and now I've found it easiest to look at 2.6.5-rc1 patched with
> the 2.6.5-rc1-aa2 objrmap and anon_vma you pointed Martin to in other
> mail, which includes your latest fixes).
> I think you're being too harsh on the nonlinear vmas! I know you're
> not keen on them, but punishing them this hard! If I read it right,
> page_referenced will never (unless PageReferenced, or mapped into
> a nonlinear also) report a page from a nonlinear vma as referenced
> (I do agree with that part). So they'll soon reach try_to_unmap,
> and each one which gets there will cause every page in every nonlinear
> vma of that inode to be unmapped from the nonlinears right then?
> Yes, that'll teach 'em to use sys_remap_file_pages without VM_LOCKED.

Yep ;)

> For mine I'll try to carry on with the less draconian approach I
> started yesterday, scanning just a range each time (rather 2.4 style).

That will DoS real life, that's why I had to be draconian. after you
finished I'll send a testcase to test, that is a real life testcase not
an exploit. The only way to dominate complexity with a pagetable scan is
to do what 2.4 is doing, that is to drop all ptes we find it in our way
so the vm will stop calling try_to_unmap, we must avoid walking the vma
more than once to swap it out. This will cause a minor fault flood but
that's ok, it doesn't need to be fast at swapping.

> At the very least, I think your unmap (and mine) needs to
> ptep_test_and_clear_young just before unmap_pte_page, and back out if
> the page is young (referenced). I was going to recommend that anyway:
> at last got around to considering that issue of whether the failed
> trylocks should report referenced or not (return 1 or 0). Looking at
> how shrink_list goes, even before 2.6.5-rc1, I'd expect it to behave
> better your way (proceed to try_to_unmap, which will rightly say
> SWAP_AGAIN if it fails the same trylock) than how it was before in
> objrmap; but that will behave better with a ptep_test_and_clear_young
> check first too.

cute, I agree we should recheck the young bit inside.

> Sorry to see the #if VMA_MERGING_FIXUPs are still there. I've a
> growing feeling that it won't make enough difference when they're
> gone. But maybe you have a cunning plan to merge all the anon_vmas
> which would result from an mmap next page, write data in, mprotect ro,
> mmap next page, write data in, mprotect ro, ..... workload.

problem is that mprotect (and mremap) meging is low prio compared to
nonlinear==mlock and i_mmap{shared} complexity, so it'll address it only
after I've a scalable swapping for huge i_mmap{shared} list too, which
is a pre-requisite for merging, mprotect merging doesn't sounds
prerequisite, though I certainly agree we should fixup it soon (and
after we fix it it'll work for files too, something that never worked
todate, and I feel it'll be as important for files as it was so far for
anon ram, and nobody complained yet that it's not enabled for files ;).
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at