Re: [RFC v7 00/11] Support vrange for anonymous page
From: Minchan Kim
Date: Sun Apr 14 2013 - 03:42:21 EST
Hi KOSAKI,
On Thu, Apr 11, 2013 at 11:01:11AM -0400, KOSAKI Motohiro wrote:
> >>>> and adding new syscall invokation is unwelcome.
> >>>
> >>> Sure. But one more system call could be cheaper than page-granuarity
> >>> operation on purged range.
> >>
> >> I don't think vrange(VOLATILE) cost is the related of this discusstion.
> >> Whether sending SIGBUS or just nuke pte, purge should be done on vmscan,
> >> not vrange() syscall.
> >
> > Again, please see the MADV_FREE. http://lwn.net/Articles/230799/
> > It does changes pte and page flags on all pages of the range through
> > zap_pte_range. So it would make vrange(VOLASTILE) expensive and
> > the bigger cost is, the bigger range is.
>
> This haven't been crossed my mind. now try_to_discard_one() insert vrange
> for making SIGBUS. then, we can insert pte_none() as the same cost too. Am
> I missing something?
For your requirement, we need some tracking model to detect some page is
using by the process currently before VM discards it *if* we don't give
vrange(NOVOLATILE) pair system call(Look at below). So the tracking model
should be formed in vrange(VOLATILE) system call context.
>
> I couldn't imazine why pte should be zapping on vrange(VOLATILE).
Sorry, my explanation was too bad to understand.
I will try again.
First of all, thing you want is almost like MADV_FREE.
So let's look at it firstly.
If you call madvise(range, MADV_FREE), VM should investigate all of
pages mapped at page table for range(start, start + len) so we need
page table lookup for the range and mark a flag to all page descriptor
(ex,PG_lazyfree) to give hint to kernel for discarding the page instead of
swappint out when reclaim happens. Another thing we need is to clear out
a dirty bit from PTE to detect the pages is dirtied or not, since we call
madvise(range, MADV_FREE) because we can't discard them, which are using by
some process since he called madvise. So if VM find the page has PG_lazyfree
but the page is dirtied recenlty by peeking PTE, VM can't discard the page.
So madivse system call's overhead is folloinwg as in madvise(MADV_FREE)
1. look up all pages from page table for the range.
2. mark some bit(PG_lazyfree) for page descriptors of pages mapped at range
3. clear dirty bit and TLB flush
So, madvise(MADV_FREE) would be better than madvise(DONTNEED) because it can
avoid page fault if memory pressure doesn't happen but system call overhead
could be still huge and expecially the overhead is increased proportionally
by range size.
Let's talk about vrange(range, VOLATILE)
The overhead of it is very small, which is just mark a flag into a
structure which represents the range (ie, struct vrange). When VM want to reclaim
some pages, VM find a page is mapped at VOLATILE area, so it could discard it
instead of swapping out. It moves the ovehead from system call itself to
VM reclaim path which is very slow path in the system and I think it's desirable
design(And that's why we have rmap).
But the problem is remained. VM can't detect page using by process after he calls
vrange(range, VOLATILE) because we didn't do anything in vrange(VOLATILE) so
VM might discard the page under the process. It didn't happen in madvise(MADV_FREE)
because it cleared out dirty bit of PTE to detect the page is used or not
since madvise is called.
Solution in vrange is to make new vrange(range, NOVOLATILE) system call, which give
the hint to kernel for preventing descarding pages in the range any more.
The cost of vrange(range, NOVOLATILE) is very small, too.
It just clear out the flags from a struct vrange which represents a range.
So I think calling of pair system call about volatile would be cheaper than a
only madvise(MADV_FREE).
I hope it helps your understanding but not sure because I am writing this
in airport which are very hard to focus my work. :(
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>
--
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/