Re: [PATCH] drop_caches: Allow unmapping pages

From: Dave Hansen
Date: Mon Jan 07 2019 - 14:25:19 EST

On 1/7/19 6:15 AM, Matthew Wilcox wrote:
> You're going to get data corruption doing this. try_to_unmap_one()
> does:
> /* Move the dirty bit to the page. Now the pte is gone. */
> if (pte_dirty(pteval))
> set_page_dirty(page);
> so PageDirty() can be false above, but made true by calling
> try_to_unmap().

I don't think that PageDirty() check is _required_ for correctness. You
can always safely try_to_unmap() no matter the state of the PTE. We
can't lock out the hardware from setting the Dirty bit, ever.

It's also just fine to unmap PageDirty() pages, as long as when the PTE
is created, we move the dirty bit from the PTE into the 'struct page'
(which try_to_unmap() does, as you noticed).

> I also think the way you've done this is expedient at the cost of
> efficiency and layering violations. I think you should first tear
> down the mappings of userspace processes (which will reclaim a lot
> of pages allocated to page tables), then you won't need to touch the
> invalidate_inode_pages paths at all.

By "tear down the mappings", do you mean something analogous to munmap()
where the VMA goes away? Or madvise(MADV_DONTNEED) where the PTE is
destroyed but the VMA remains?

Last time I checked, we only did free_pgtables() when tearing down VMAs,
but not for pure unmappings like reclaim or MADV_DONTNEED. I've thought
it might be fun to make a shrinker that scanned page tables looking for
zero'd pages, but I've never run into a system where empty page table
pages were actually causing a noticeable problem.