Re: userspace pagecache management tool
From: Andrew Morton
Date: Sat Mar 03 2007 - 18:46:25 EST
On Sun, 4 Mar 2007 00:01:55 +0100 bert hubert <bert.hubert@xxxxxxxxxxxxx> wrote:
> On Sat, Mar 03, 2007 at 02:26:09PM -0800, Andrew Morton wrote:
> > > > It is *not* a global instruction. It uses setenv, so the user's policy
> > > > affects only the target process and its forked children.
> > >
> > > ... and all other processes accessing the same file(s)!
> > >
> > > Your library and the system calls may be limited to one process,
> > > but the consequences are global.
> >
> > Yes. So what? If the user wants to go and evict libc.so from pagecache
> > then he can do so - the kernel has provided syscalls with which this can be
> > done for at least seven years. Bad user, shouldn't do that.
>
> While I agree with your sentiments that userspace can have a good idea on
> how to deal with the page cache, your program does more than it claims to
> do - because of how linux implements posix_fadvise.
>
> I don't think anybody expects or desires your program to actually *evict*
> the stuff from the cache you are trying access, which happens in case the
> data was in the cache prior to starting your program.
>
> What people expect is that a solution such as you wrote it simply won't
> *add* anything to the cache. They don't expect it will actually globally
> *remove* stuff from the cache.
>
> Making a backup this way would hurt even worse than usual with your
> pagecache management tool if the file being backupped was still being read.
>
> This is not your fault, but in practice, it makes your program less useful
> than it could be.
yup. As I said, it's a proof-of-concept. It's a project. And I have about one
free femtosecond per fortnight :(
> One could conceivably fix that up using mincore and simply not fadvise if a
> page was in core already.
Yes. Let's flesh it out the backup program policy some more:
- Unconditionally invalidate output files
- on entry to read(), probe pagecache, record which pages in the range are present
- on entry to next read(), shoot down those pages from the previous read
which weren't in pagecache.
- But we can do better! LRU the page's files up to a certain number of pages.
- Once that point is exceeded, we need to reclaim some pages. Which
ones? Well, we've been observing all reads, so we can record which pages
were referenced once, and which ones were referenced multiple times so we
can do arbitrarily complex page aging in there.
- On close(), nuke all pages which weren't in core during open(), even if
this app referenced them multiple times.
- If the backup program decided to read its input files with mmap we're
rather screwed. We can't intercept pagefaults so the best we can do is
to restore the file's pagecache to its previous state on close().
Or if it's really a problem, get control in there somehow and
periodically poll the pagecache occupancy via mincore(), use madvise()
then fadvise() to trim it back.
That all sounds reasonably doable. It'd be pretty complex to do it
in-kernel but we could do it there too. Problem is if course that the
above strategy is explicitly optimised for the backup program and if it's
in-kernel it becomes applicable to all other workloads.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/