Daniel McNeil <daniel@xxxxxxxx> wrote:I think there was consensus on two other patches along the way:
I have found (finally) the problem causing DIO reads racing with
buffered writes to see uninitialized data on ext3 file systems (which is what I have been testing on).
What kernel? If -mm, is this the only remaining buffered-vs-direct
problem?
The problem is caused by the changes to __block_write_page_full()
and a race with journaling:
journal_commit_transaction() -> ll_rw_block() -> submit_bh()
ll_rw_block() locks the buffer, clears buffer dirty and calls
submit_bh()
A racing __block_write_full_page() (from ext3_ordered_writepage())
would see that buffer_dirty() is not set because the i/o
is still in flight, so it would not do a bh_submit()
It would SetPageWriteback() and unlock_page() and then
see that no i/o was submitted and call end_page_writeback()
(with the i/o still in flight).
This would allow the DIO code to issue the DIO read while buffer writes
are still in flight. The i/o can be reordered by i/o scheduling and
the DIO can complete BEFORE the writebacks complete. Thus the DIO
sees the old uninitialized data.
ow. How'd you work this out?
Here is a quick hack that fixes it, but I am not sure if this the
proper long term fix.
The problem is that ext3 and the VFS are using different paradigms. VFS is
all page-based, but ext3 is all block-based. One or the other needs to do
something nasty.
One approach would be to change the JBD write_out_data_locked loop to use
block_write_full_page(bh->b_page) instead of buffer_head operations. But
that could get hairy with blocksize < PAGE_SIZE.
Thanks for working this out. Let me ponder it for a bit.