Re: [PATCH] mm: fix page_mkclean_one

From: Linus Torvalds
Date: Thu Dec 28 2006 - 15:15:32 EST

On Thu, 28 Dec 2006, Andrew Morton wrote:
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).

I get corruption - but the whole point is that it's very much pdflush that
should be writing these pages out.

Andrew - give my test-program a try. It can run in about 1 minute if you
have a 256MB machine (I didn't, but "mem=256M" is my friend), and it seems
to very consistently cause corruption.

What I do is:

# Make sure we write back aggressively
echo 5 > /proc/sys/vm/dirty_ratio

as root, and then just run the thing. Tons of corruption. But the
corruption goes away if I just leave the default dirty ratio alone (but
then I can increse the file size to trigger it, of course - but that
also makes the test run a lot slower).

Now, with a pre-2.6.19 kernel, I bet you won't get the corruption as
easily (at least with the "fsync()"), but that's less to do with anything
new, and probably just because then you simply won't have any pdflushing
going on - since the kernel won't even notice that you have tons of dirty
pages ;)

It might also depend on the speed of your disk drive - the machine I test
this on has a slow 4200 rpm laptop drive in it, and that probably makes
things go south more easily. That's _especially_ true if this is related
to any "bdi_write_congested()" logic.

Now, it could also be related to various code snippets like

if (wbc->sync_mode != WB_SYNC_NONE)

if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {

where the WB_SYNC_NONE case will hit the "PageWriteback()" and just not do
the writeback at all (but it also won't clear the dirty bit, so it's
certainly not an *OBVIOUS* bug).

We also have code like this ("pageout()"):

if (clear_page_dirty_for_io(page)) {
int res;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
res = mapping->a_ops->writepage(page, &wbc);

and in this case, if the "WB_SYNC_NONE" means that the "writepage()" call
won't do anything at all because of congestion, then that would be a _bad_
thing, and would certainly explain how something didn't get written out.

But that particular path should only trigger for the "shrink_page_list()"
case, and it's not the case I seem to be testing with my "low dirty_ratio"

Linus#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>

#define TARGETSIZE (22 << 20)
#define CHUNKSIZE (1460)

static void fillmem(void *start, int nr)
memset(start, nr, CHUNKSIZE);

#define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping))

static int chunkorder[NRCHUNKS];
static char *mapping;

static int order(int nr)
int i;
if (nr < 0 || nr >= NRCHUNKS)
return -1;
for (i = 0; i < NRCHUNKS; i++)
if (chunkorder[i] == nr)
return i;
return -2;

static void checkmem(void *buf, int nr)
unsigned int start = ~0u, end = 0;
unsigned char c = nr, *p = buf, differs = 0;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
unsigned char got = *p++;
if (got != c) {
if (i < start)
start = i;
if (i > end)
end = i;
differs = got;
if (start < end) {
printf("Chunk %d corrupted (%u-%u) (%x-%x) \n", nr, start, end,
page_offset(buf, start), page_offset(buf, end));
printf("Expected %u, got %u\n", c, differs);
printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1));

static char *remap(int fd, char *mapping)
if (mapping) {
munmap(mapping, SIZE);
// fsync(fd);
posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);

int main(int argc, char **argv)
int fd, i;

* Make some random ordering of writing the chunks to the
* memory map..
* Start with fully ordered..
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;

/* ..and then mix it up randomly */
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCHUNKS;
int nr = chunkorder[index];
chunkorder[index] = chunkorder[i];
chunkorder[i] = nr;

fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, SIZE) < 0)
return -1;
mapping = remap(fd, NULL);
if (-1 == (int)(long)mapping)
return -1;

for (i = 0; i < NRCHUNKS; i++) {
int chunk = chunkorder[i];
printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
fillmem(mapping + chunk * CHUNKSIZE, chunk);

/* Unmap, drop, and remap.. */
mapping = remap(fd, mapping);

/* .. and check */
for (i = 0; i < NRCHUNKS; i++) {
int chunk = i;
printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
checkmem(mapping + chunk * CHUNKSIZE, chunk);

/* Clean up for next time */
munmap(mapping, SIZE);

return 0;