Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

From: Andy Lutomirski
Date: Tue Nov 03 2015 - 22:42:01 EST


On Nov 3, 2015 5:30 PM, "Minchan Kim" <minchan@xxxxxxxxxx> wrote:
>
> Linux doesn't have an ability to free pages lazy while other OS already
> have been supported that named by madvise(MADV_FREE).
>
> The gain is clear that kernel can discard freed pages rather than swapping
> out or OOM if memory pressure happens.
>
> Without memory pressure, freed pages would be reused by userspace without
> another additional overhead(ex, page fault + allocation + zeroing).
>

[...]

>
> How it works:
>
> When madvise syscall is called, VM clears dirty bit of ptes of the range.
> If memory pressure happens, VM checks dirty bit of page table and if it
> found still "clean", it means it's a "lazyfree pages" so VM could discard
> the page instead of swapping out. Once there was store operation for the
> page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> the page instead of discarding.

What happens if you MADV_FREE something that's MAP_SHARED or isn't
ordinary anonymous memory? There's a long history of MADV_DONTNEED on
such mappings causing exploitable problems, and I think it would be
nice if MADV_FREE were obviously safe.

Does this set the write protect bit?

What happens on architectures without hardware dirty tracking? For
that matter, even on architecture with hardware dirty tracking, what
happens in multithreaded processes that have the dirty TLB state
cached in a different CPU's TLB?

Using the dirty bit for these semantics scares me. This API creates a
page that can have visible nonzero contents and then can
asynchronously and magically zero itself thereafter. That makes me
nervous. Could we use the accessed bit instead? Then the observable
semantics would be equivalent to having MADV_FREE either zero the page
or do nothing, except that it doesn't make up its mind until the next
read.

> + ptent = pte_mkold(ptent);
> + ptent = pte_mkclean(ptent);
> + set_pte_at(mm, addr, pte, ptent);
> + tlb_remove_tlb_entry(tlb, pte, addr);

It looks like you are flushing the TLB. In a multithreaded program,
that's rather expensive. Potentially silly question: would it be
better to just zero the page immediately in a multithreaded program
and then, when swapping out, check the page is zeroed and, if so, skip
swapping it out? That could be done without forcing an IPI.

> +static int madvise_free_single_vma(struct vm_area_struct *vma,
> + unsigned long start_addr, unsigned long end_addr)
> +{
> + unsigned long start, end;
> + struct mm_struct *mm = vma->vm_mm;
> + struct mmu_gather tlb;
> +
> + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> + return -EINVAL;
> +
> + /* MADV_FREE works for only anon vma at the moment */
> + if (!vma_is_anonymous(vma))
> + return -EINVAL;

Does anything weird happen if it's shared?

> + if (!PageDirty(page) && (flags & TTU_FREE)) {
> + /* It's a freeable page by MADV_FREE */
> + dec_mm_counter(mm, MM_ANONPAGES);
> + goto discard;
> + }

Does something clear TTU_FREE the next time the page gets marked clean?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/