block allocator issue with ext4+DAX

From: Ross Zwisler
Date: Wed Mar 30 2016 - 18:02:16 EST


I've hit an issue in my testing which I believe to be related to the ext4
block allocator when using the DAX mount option. I originally found this
issue with the generic/102 xfstest, but have reduced it to the minimal
reproducer at the bottom of this email. I've been able to reproduce this with
both BRD and with PMEM as the underlying block device.

For this test we're running in a very small filesystem, only 512 MiB. We
fallocate() 400 MiB of that space, unlink the file, then try and rewrite that
400 MiB file one chunk at a time.

What actually happens is that during the rewrite we run out of memory and the
DAX call to get_block() in dax_io() fails with -ENOSPC.

Here are the steps to reproduce this issue:

# fdisk -l /dev/ram0
Disk /dev/ram0: 1 GiB, 1073741824 bytes, 2097152 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

# mkfs.ext4 /dev/ram0 512M

# mount /dev/ram0 /mnt

# gcc -o test test.c

# ./test # success!

# umount /mnt

# mount -o dax /dev/ram0 /mnt # requires CONFIG_BLK_DEV_RAM_DAX

# ./test # failure
Partial write - only 577536 written

This test succeeds with xfs, ext2, and with ext4 without the DAX mount option.
I've also tried it with O_DIRECT, and that has the same behavior - we succeed
without DAX and fail with DAX.

Another clue is that a sync() call in the middle of the test between the
unlink and the following writes clears up the issue.

Something that might be related is the output in
/proc/fs/ext4/ram0/mb_groups. Here is that output when we're in a good
state, and the writes will succeed:

#group: free frags first [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 ]
#0 : 30673 1 2095 [ 1 0 0 0 1 0 1 1 1 1 1 0 1 3 ]
#1 : 32735 1 33 [ 1 1 1 1 1 0 1 1 1 1 1 1 1 3 ]
#2 : 28672 1 4096 [ 0 0 0 0 0 0 0 0 0 0 0 0 1 3 ]
#3 : 32735 1 33 [ 1 1 1 1 1 0 1 1 1 1 1 1 1 3 ]

Here is the output in that file when we're in a bad state, and our writes are
about to fail:

#group: free frags first [ 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9 2^10 2^11 2^12 2^13 ]
#0 : 18385 1 14383 [ 1 0 0 0 1 0 1 1 1 1 1 0 0 2 ]
#1 : 2015 1 33 [ 1 1 1 1 1 0 1 1 1 1 1 0 0 0 ]
#2 : 0 0 32768 [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
#3 : 2015 1 33 [ 1 1 1 1 1 0 1 1 1 1 1 0 0 0 ]

It appears as though we've exhausted group #2. Interestingly, if I run sync()
at this point it takes us from the bad output to the good, which leads me to
believe the newly unlinked blocks in group #2 are finally being freed back
into that group for reallocation or something. (I've clearly reached the
limits of my ext4-fu. :) )

I'm happy to help test proposed fixes.

Thanks,
- Ross

---
#define _GNU_SOURCE
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define MB(a) ((a)*1024ULL*1024)

int main(int argc, char *argv[])
{
int i, fd, ret;
void *buffer;

buffer = malloc(MB(1));

fd = open("/mnt/file", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if (fd < 0) {
perror("fd");
return 1;
}

ret = fallocate(fd, 0, 0, MB(400));
if (ret) {
perror("fallocate");
return 1;
}
close(fd);

unlink("/mnt/file");

/* a sync() call here makes the DAX case of this test pass */
// sync();

fd = open("/mnt/file", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if (fd < 0) {
perror("fd");
return 1;
}

for (i = 0; i < 400; i++) {
ret = write(fd, buffer, MB(1));

if (ret < 0) {
perror("write");
return 1;
} else if (ret != MB(1)) {
fprintf(stderr, "Partial write - only %lu written\n",
ret);
return 1;
}
}

close(fd);
free(buffer);
return 0;
}