Re: [PATCH v3] ext4: fix indirect punch hole corruption

From: Omar Sandoval
Date: Mon Feb 09 2015 - 16:28:19 EST


On Mon, Feb 09, 2015 at 03:03:56PM -0600, Chris J Arges wrote:
> On 02/09/2015 12:21 PM, Chris J Arges wrote:
> > On 02/08/2015 06:15 AM, Omar Sandoval wrote:
> >> Commit 4f579ae7de56 (ext4: fix punch hole on files with indirect
> >> mapping) rewrote FALLOC_FL_PUNCH_HOLE for ext4 files with indirect
> >> mapping. However, there are a bugs in a few cases.
> >>
> >> In the case where the punch happens within one level of indirection, we
> >> expect the start and end shared branches to converge on an indirect
> >> block. However, because the branches returned from ext4_find_shared do
> >> not necessarily start at the same level (e.g., the partial2 chain will
> >> be shallower if the last block occurs at the beginning of an indirect
> >> group), the walk of the two chains can end up "missing" each other and
> >> freeing a bunch of extra blocks in the process. This mismatch can be
> >> handled by first making sure that the chains are at the same level, then
> >> walking them together until they converge.
> >>
> >> In the case that a punch spans different levels of indirection, the
> >> original code skips freeing the intermediate indirect trees if the last
> >> block is the first triply-indirected block because it returns instead of
> >> jumping to do_indirects. Additionally, a non-zero nr2 does not mean that
> >> there's nothing else to free at the level of partial2: consider the case
> >> where the all_zeroes in ext4_find_shared backed up the shared branch.
> >>
> >> Signed-off-by: Omar Sandoval <osandov@xxxxxxxxxxx>
> >
> > Omar,
> > With this patch I no longer seem to be getting the original corruption I
> > detected with my test case; however eventually I do get errors when
> > trying to delete qcow2 snapshots. After getting these errors if I run
> > 'qemu-img check <image>' I see the following errors:
> >
> > ERROR OFLAG_COPIED data cluster: l2_entry=800000018f7f0000 refcount=0
> > ERROR OFLAG_COPIED data cluster: l2_entry=800000018f800000 refcount=0
> > ERROR OFLAG_COPIED data cluster: l2_entry=800000018f810000 refcount=0
> >
> > 16941 errors were found on the image.
> > Data may be corrupted, or further writes to the image may corrupt it.
> >
> > 60459 leaked clusters were found on the image.
> > This means waste of disk space, but no harm to data.
> > 88629/262144 = 33.81% allocated, 9.57% fragmented, 0.00% compressed clusters
> > Image end offset: 10438180864
> >
> > So this patch seems to have moved the problem. I can collect additional
> > logs if necessary.
> >
> > Thanks,
> > --chris j arges
> >
>
> After ignoring snapshot deletion errors, I've hit the original
> corruption problem with your patch still. I'll continue debugging this.
> --chris j arges
>
Chris,

Thanks for testing, and sorry about this game of whack-a-mole. Looks
like there's at least one more bug in the n == n2 case (if nr != 0, we
sometimes, but not always, need to free the subtree referred to by it.)
I'll send you another patch as soon as I have the proper fix.

P.S. The fpunch xfstests don't cover these bugs. I'm attaching the
slightly convoluted script I've been using for testing. It's a Python
script that generates a shell script which I then run on the test
machine. The interesting stuff is in known_bugs(), the first three of
which are fixed by the last patch I sent out.

--
Omar
#!/usr/bin/env python3

import hashlib

NDIR = 12
BLOCK_SIZE = 1024
ADDRS = (BLOCK_SIZE // 4)

ALLOCED_BLOCK = b'\xcd' * BLOCK_SIZE
HOLE_BLOCK = b'\x00' * BLOCK_SIZE

TEST_CASE = 0

def test_case(file_blocks, *args):
global TEST_CASE
test_file = '/mnt/test/test.%d' % TEST_CASE
TEST_CASE += 1

print('rm -f %s' % test_file)
print("xfs_io -f -c 'pwrite 0 %d' %s > /dev/null" % (file_blocks * BLOCK_SIZE, test_file))

blocks = [True] * file_blocks
for arg in args:
punch_start, punch_end = arg
offset = punch_start * BLOCK_SIZE
length = punch_end * BLOCK_SIZE - offset
print("xfs_io -c 'fpunch %d %d' %s" % (offset, length, test_file))
for i in range(punch_start, punch_end):
blocks[i] = False

print('umount /mnt/test')
print('mount /mnt/test')

print('diff - <(_fiemap %s) << "EOF"' % test_file)

m = hashlib.md5()
num = 0
extent_start = None
extent_type = None
for i, alloced in enumerate(blocks):
m.update(ALLOCED_BLOCK if alloced else HOLE_BLOCK)
if alloced != extent_type:
if extent_type is not None:
start = extent_start * (BLOCK_SIZE // 512)
end = i * (BLOCK_SIZE // 512) - 1
print('%d: [%d..%d]: %s' % (num, start, end, 'extent' if extent_type else 'hole'))
num += 1
extent_start = i
extent_type = alloced
if extent_type is not None:
if extent_start > 0 or extent_type:
start = extent_start * (BLOCK_SIZE // 512)
end = len(blocks) * (BLOCK_SIZE // 512) - 1
print('%d: [%d..%d]: %s' % (num, start, end, 'extent' if extent_type else 'hole'))
print('EOF')

print('diff - <(md5sum %s) << "EOF"' % test_file)
print('%s %s' % (m.hexdigest(), test_file))
print('EOF')
print()

def simple_tests():
# Punch within direct level.
print("echo 'n = 1, n2 = 1'")
test_case(NDIR, (2, NDIR - 2))
test_case(NDIR, (0, NDIR // 2))

# Punch from direct level to starts of indirect levels.
print("echo 'n2 > n'")
print("echo ' n = 1'")
test_case(NDIR + 1, (0, NDIR)) # Start of direct.
test_case(NDIR + ADDRS + 1, (0, NDIR + ADDRS))
test_case(NDIR + ADDRS + ADDRS * ADDRS + 1, (0, NDIR + ADDRS + ADDRS * ADDRS))
test_case(NDIR + 1, (2, NDIR)) # Middle of direct.
test_case(NDIR + ADDRS + 1, (2, NDIR + ADDRS))
test_case(NDIR + ADDRS + ADDRS * ADDRS + 1, (2, NDIR + ADDRS + ADDRS * ADDRS))

# Punch from direct level into indirect levels.
test_case(NDIR + 2, (0, NDIR + 1)) # Start of direct.
test_case(NDIR + ADDRS + 2, (0, NDIR + ADDRS + 1))
test_case(NDIR + ADDRS + ADDRS * ADDRS + 2, (0, NDIR + ADDRS + ADDRS * ADDRS + 1))
test_case(NDIR + 2, (2, NDIR + 1)) # Middle of direct
test_case(NDIR + ADDRS + 2, (2, NDIR + ADDRS + 1))
test_case(NDIR + ADDRS + ADDRS * ADDRS + 2, (2, NDIR + ADDRS + ADDRS * ADDRS + 1))

# Punch from indirect level into another indirect level.
print("echo ' n = 2'")
# IND -> DIND, IND -> TIND, start of end.
test_case(NDIR + ADDRS + 1, (NDIR, NDIR + ADDRS)) # Start of indirect.
test_case(NDIR + ADDRS + ADDRS * ADDRS + 1, (NDIR, NDIR + ADDRS + ADDRS * ADDRS))
test_case(NDIR + ADDRS + 1, (NDIR + 2, NDIR + ADDRS)) # Middle of indirect.
test_case(NDIR + ADDRS + ADDRS * ADDRS + 1, (NDIR + 2, NDIR + ADDRS + ADDRS * ADDRS))

# Middle of end.
test_case(NDIR + ADDRS + 2, (NDIR, NDIR + ADDRS + 1)) # Start of indirect.
test_case(NDIR + ADDRS + ADDRS * ADDRS + 2, (NDIR, NDIR + ADDRS + ADDRS * ADDRS + 1))
test_case(NDIR + ADDRS + 2, (NDIR + 2, NDIR + ADDRS + 1)) # Middle of indirect.
test_case(NDIR + ADDRS + ADDRS * ADDRS + 2, (NDIR + 2, NDIR + ADDRS + ADDRS * ADDRS + 1))


# DIND -> TIND
print("echo ' n = 3'")
test_case(NDIR + ADDRS + ADDRS * ADDRS + 1, (NDIR + ADDRS, NDIR + ADDRS + ADDRS * ADDRS)) # Start of indirect.
test_case(NDIR + ADDRS + ADDRS * ADDRS + 1, (NDIR + ADDRS + 2, NDIR + ADDRS + ADDRS * ADDRS))
test_case(NDIR + ADDRS + ADDRS * ADDRS + 2, (NDIR + ADDRS, NDIR + ADDRS + ADDRS * ADDRS + 1)) # Middle of indirect.
test_case(NDIR + ADDRS + ADDRS * ADDRS + 2, (NDIR + ADDRS + 2, NDIR + ADDRS + ADDRS * ADDRS + 1))

# Punch within the same indirect level.
print("echo 'n = 2, n2 = 2'")
# IND
test_case(NDIR + ADDRS, (NDIR, NDIR + ADDRS - 1))
test_case(NDIR + ADDRS, (NDIR + 4, NDIR + ADDRS - 1))

# DIND
print("echo 'n = 3, n2 = 3'")
test_case(NDIR + ADDRS + ADDRS * ADDRS, (NDIR + ADDRS, NDIR + ADDRS + ADDRS * ADDRS - 1))
test_case(NDIR + ADDRS + ADDRS * ADDRS, (NDIR + ADDRS + 4, NDIR + ADDRS + ADDRS * ADDRS - 1))

# TIND
print("echo 'n = 4, n2 = 4'")
test_case(NDIR + ADDRS + ADDRS * ADDRS + 20, (NDIR + ADDRS + ADDRS * ADDRS, NDIR + ADDRS + ADDRS * ADDRS + 20))
test_case(NDIR + ADDRS + ADDRS * ADDRS + 20, (NDIR + ADDRS + ADDRS * ADDRS + 4, NDIR + ADDRS + ADDRS * ADDRS + 20))
test_case(NDIR + ADDRS + ADDRS * ADDRS + 20, (NDIR + ADDRS + ADDRS * ADDRS, NDIR + ADDRS + ADDRS * ADDRS + 19))
test_case(NDIR + ADDRS + ADDRS * ADDRS + 20, (NDIR + ADDRS + ADDRS * ADDRS + 4, NDIR + ADDRS + ADDRS * ADDRS + 19))

def known_bugs():
print("echo 'n == n2, uneven partials'")
test_case(NDIR + ADDRS + 2 * ADDRS, (NDIR + ADDRS + ADDRS // 2, NDIR + ADDRS + ADDRS))

print("echo 'n <= 2, n2 == 4, end is first triply-indirect block'")
test_case(NDIR + ADDRS + ADDRS * ADDRS + 4, (0, NDIR + ADDRS + ADDRS * ADDRS))

print("echo 'n < n2, nr2 != 0 because all_zeroes'")
test_case(NDIR + ADDRS + 2 * ADDRS + 5, (NDIR + ADDRS + 2 * ADDRS, NDIR + ADDRS + 2 * ADDRS + 4),
(0, NDIR + ADDRS + 2 * ADDRS + 4))

print("echo 'n == n2, nr != 0 because all_zeroes'")
test_case(NDIR + ADDRS + 4 * ADDRS, (NDIR + ADDRS + ADDRS, NDIR + ADDRS + ADDRS + 1),
(NDIR + ADDRS + ADDRS + 1, NDIR + ADDRS + 3 * ADDRS))

if __name__ == '__main__':
print('#!/bin/bash')

print(r"""
_coalesce_extents () {
awk -F: '
{
range = $2;
type = $3;

split(range, bounds, "[\\[ \\.\\]]");
low = bounds[3];
high = bounds[5];

if (type != prev_type) {
if (prev_type != "")
printf("%u]:%s\n", low - 1, prev_type);
printf("%u: [%u..", out_count++, low);
prev_type = type;
}
}
END {
if (prev_type != "")
printf("%u]:%s\n", high, prev_type);
}'
}

_filter_hole_fiemap () {
awk --posix '
$3 ~ /hole/ {
print $1, $2, $3;
next;
}
$5 ~ /0x[[:xdigit:]]+/ {
print $1, $2, "extent";
}' |
_coalesce_extents
}

_fiemap () {
xfs_io -c 'fiemap -v' "$1" | _filter_hole_fiemap
}
""")

# simple_tests()
known_bugs()