Re: Fwd: Control page reclaim granularity
From: Minchan Kim
Date: Mon Mar 12 2012 - 01:19:41 EST
On Mon, Mar 12, 2012 at 10:06:09AM +0800, Zheng Liu wrote:
> On Mon, Mar 12, 2012 at 09:29:34AM +0900, Minchan Kim wrote:
> > I forgot to Ccing you.
> > Sorry.
> >
> > ---------- Forwarded message ----------
> > From: Minchan Kim <minchan@xxxxxxxxxx>
> > Date: Mon, Mar 12, 2012 at 9:28 AM
> > Subject: Re: Control page reclaim granularity
> > To: Minchan Kim <minchan@xxxxxxxxxx>, linux-mm <linux-mm@xxxxxxxxx>,
> > linux-kernel <linux-kernel@xxxxxxxxxxxxxxx>, Konstantin Khlebnikov <
> > khlebnikov@xxxxxxxxxx>, riel@xxxxxxxxxx, kosaki.motohiro@xxxxxxxxxxxxxx
> >
> >
> > On Fri, Mar 09, 2012 at 12:54:03AM +0800, Zheng Liu wrote:
> > > Hi Minchan,
> > >
> > > Sorry, I forgot to say that I don't subscribe linux-mm and linux-kernel
> > > mailing list. So please Cc me.
> > >
> > > IMHO, maybe we should re-think about how does user use mmap(2). I
> > > describe the cases I known in our product system. They can be
> > > categorized into two cases. One is mmaped all data files into memory
> > > and sometime it uses write(2) to append some data, and another uses
> > > mmap(2)/munmap(2) and read(2)/write(2) to manipulate the files. In the
> > > second case, the application wants to keep mmaped page into memory and
> > > let file pages to be reclaimed firstly. So, IMO, when application uses
> > > mmap(2) to manipulate files, it is possible to imply that it wants keep
> > > these mmaped pages into memory and do not be reclaimed. At least these
> > > pages do not be reclaimed early than file pages. I think that maybe we
> > > can recover that routine and provide a sysctl parameter to let the user
> > > to set this ratio between mmaped pages and file pages.
> >
> > I am not convinced why we should handle mapped page specially.
> > Sometimem, someone may use mmap by reducing buffer copy compared to read
> > system call.
> > So I think we can't make sure mmaped pages are always win.
> >
> > My suggestion is that it would be better to declare by user explicitly.
> > I think we can implement it by madvise and fadvise's WILLNEED option.
> > Current implementation is just readahead if there isn't a page in memory
> > but I think
> > we can promote from inactive to active if there is already a page in
> > memory.
> >
> > It's more clear and it couldn't be affected by kernel page reclaim
> > algorithm change
> > like this.
>
> Thank you for your advice. But I still have question about this
> solution. If we improve the madvise(2) and fadvise(2)'s WILLNEED
> option, it will cause an inconsistently status for pages that be
> manipulated by madvise(2) and/or fadvise(2). For example, when I call
> madvise with WILLNEED flag, some pages will be moved into active list if
> they already have been in memory, and other pages will be read into
> memory and be saved in inactive list if they don't be in memory. Then
> pages that are in inactive list are possible to be reclaim. So from the
> view of users, it is inconsistent because some pages are in memory and
> some pages are reclaimed. But actually the user hopes that all of pages
> can be kept in memory. IMHO, this inconsistency is weird and makes users
> puzzled.
Now problem is that
1. User want to keep pages which are used once in a while in memory.
2. Kernel want to reclaim them because they are surely reclaim target
pages in point of view by LRU.
The most desriable approach is that user should use mlock to guarantee
them in memory. But mlock is too big overhead and user doesn't want to keep
memory all pages all at once.(Ie, he want demand paging when he need the page)
Right?
madvise, it's a just hint for kernel and kernel doesn't need to make sure madvise's behavior.
In point of view, such inconsistency might not be a big problem.
Big problem I think now is that user should use madvise(WILLNEED) periodically because such
activation happens once when user calls madvise. If user doesn't use page frequently after
user calls it, it ends up moving into inactive list and even could be reclaimed.
It's not good. :-(
Okay. How about adding new VM_WORKINGSET?
And reclaimer would give one more round trip in active/inactive list when reclaim happens
if the page is referenced.
Sigh. We have no room for new VM_FLAG in 32 bit.
>
> Regards,
> Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/