Re: writev data loss bug in (at least) 2.6.31 and 2.6.32pre8 x86-64

From: James Y Knight
Date: Wed Dec 02 2009 - 16:24:20 EST


On Dec 1, 2009, at 11:03 AM, Jan Kara wrote:
> On Tue 01-12-09 15:35:59, Jan Kara wrote:
>> On Tue 01-12-09 12:42:45, Mike Galbraith wrote:
>>> I bisected it this morning. Bisected cleanly to...
>>>
>>> 9eaaa2d5759837402ec5eee13b2a97921808c3eb is the first bad commit
>> OK, I've debugged it. This commit is really at fault. The problem is
>> following:
>> When using writev, the page we copy from is not paged in (while when we
>> use ordinary write, it is paged in). This difference might be worth
>> investigation on its own (as it is likely to heavily impact performance of
>> writev) but is irrelevant for us now - we should handle this without data
>> corruption anyway. Because the source page is not available, we pass 0 as
>> the number of copied bytes to write_end and thus ext3_write_end decides to
>> truncate the file to original size. This is perfectly fine. The problem is
>> that we do this by ext3_truncate() which just frees corresponding block but
>> does not unmap buffers. So we leave mapped buffers beyond i_size (they
>> actually never were inside i_size) but the blocks they are mapped to are
>> already free. The write is then retried (after mapping the page),
>> block_write_begin() sees the buffer is mapped (although it is beyond
>> i_size) and thus it does not call ext3_get_block() anymore. So as a result,
>> data is written to a block that is no longer allocated to the file. Bummer
>> - welcome filesystem corruption.
>> Ext4 also has this problem but delayed allocation mitigates the effect to
>> an error in accounting of blocks reserved for delayed allocation and thus
>> under normal circumstances nothing bad happens.
>> The question is how to solve this in the cleanest way. We can call
>> vmtruncate() instead of ext3_truncate() as we used to do but Nick wants to
>> get rid of that (that's why I originally changed the code to what it is
>> now). So probably we could just manually call truncate_pagecache() instead.
>> Nick, I think your truncate calling sequence patch set needs similar fix
>> for all filesystems as well.
> The patch below fixes the issue for me...

Thank you! I can confirm that the patch fixes the issue in my real application as well.

James--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/