Re: MADV_FREE functionality

From: Jakub Jelinek
Date: Tue May 01 2007 - 03:15:13 EST


On Mon, Apr 30, 2007 at 06:18:39PM -0700, Andrew Morton wrote:
> > In short:
> > - both MADV_FREE and MADV_DONTNEED only unmap file pages
> > - after MADV_DONTNEED the application will always get back
> > fresh zero filled anonymous pages when accessing the
> > memory
> > - after MADV_FREE the application can either get back the
> > original data (without a page fault) or zero filled
> > anonymous memory
> >
> > The Linux MADV_DONTNEED behavior is not POSIX compliant.
> > POSIX says that with MADV_DONTNEED the application's data
> > will be preserved.
> >
> > Currently glibc simply ignores POSIX_MADV_DONTNEED requests
> > from applications on Linux. Changing the behaviour which
> > some Linux applications may rely on might not be the best
> > idea.
>
> OK, thanks. I stuck that in the changelog.

FYI, Solaris man page on MADV_FREE says:

MADV_FREE
Tells the kernel that contents in the specified
address range are no longer important and the range
will be overwritten. When there is demand for memory,
the system will free pages associated with the speci-
fied address range. In this instance, the next time a
page in the address range is referenced, it will con-
tail all zeroes. Otherwise, it will contain the data
that was there prior to the MADV_FREE call. References
made to the address range will not make the system
read from backing store (swap space) until the page is
modified again.

This value cannot be used on mappings that have under-
lying file objects.

The last paragraph seems to be just about the operation being
undefined, madvise MADV_FREE on MAP_SHARED file mapping returns 0
rather than flagging an error.

FreeBSD man page:

MADV_FREE Gives the VM system the freedom to free pages, and tells
the system that information in the specified page range
is no longer important. This is an efficient way of
allowing malloc(3) to free pages anywhere in the address
space, while keeping the address space valid. The next
time that the page is referenced, the page might be
demand zeroed, or might contain the data that was there
before the MADV_FREE call. References made to that
address space range will not make the VM system page the
information back in from backing store until the page is
modified again.

> Also, where did we end up with the Solaris compatibility?
>
> The patch I have at present retains MADV_FREE=0x05 for sparc and sparc64
> which should be good.
>
> Did we decide that the Solaris and Linux implementations of MADV_FREE are
> compatible?

SPARC Solaris binary compatibility in Linux is in really bad shape, madvise
in Solaris is implemented using memcntl syscall (at least according to truss(1))
and that syscall is
systbl.S: .word solaris_unimplemented /* memcntl 131 */
When/if anyone decides to put more effort into the Solaris binary compatibility
(I'm quite doubtful anyone will), codes which don't match can be simply translated into
other codes, ignored etc., we can't use sys_madvise to implement memcntl
syscall anyway. While Solaris MADV_FREE is the same as Linux MADV_FREE proposed
by Rik (except perhaps the documented undefined behavior with file mappings,
on
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>

int
main (void)
{
getpid ();
int fd = open ("test", O_RDWR);
void *p = mmap (0, 8192, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
memset (p, ' ', 8192);
madvise (p, 8192, MADV_FREE);
return 0;
}
on Solaris the spaces actually made it into the file), MADV_DONTNEED is not,
but that doesn't really matter except for arch/sparc*/solaris/ layer if anyone
cares. We certainly can't change current MADV_DONTNEED behavior, all we
can do is implement a new MADV_* code with a different behavior and let glibc
translate POSIX_MADV_* codes on posix_madvise to the Linux specific MADV_*.

Jakub
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/