On Thu 07-12-23 22:15:55, Baokun Li wrote:Thank you very much for confirming this issue!
On 2023/12/7 3:37, Jan Kara wrote:Indeed, I have checked and filemap_read() (but this dates back even to old
On Tue 05-12-23 20:50:30, Baokun Li wrote:After looking at the code again and again, the following concurrency
On 2023/12/4 22:41, Jan Kara wrote:Interesting. Not sure how your partially zeroed-out buffers could happen
On Mon 04-12-23 21:50:18, Baokun Li wrote:The operations performed on the relay log file are buffered reads and
On 2023/12/4 20:11, Jan Kara wrote:Yeah, from what you write it seems that the replay code is using buffered
The problem is with a one-master-twoslave MYSQL database with three
physical machines, and using sysbench pressure testing on each of the
three machines, the problem occurs about once every two to three hours.
The problem is with the relay log file, and when the problem occurs, the
middle dozens of bytes of the file are read as all zeros, while the data on
disk is not. This is a journal-like file where a write process gets the data
from
the master node and writes it locally, and another replay process reads the
file and performs the replay operation accordingly (some SQL statements).
The problem is that when replaying, it finds that the data read is
corrupted,
not valid SQL data, while the data on disk is normal.
It's not confirmed that buffered reads vs direct IO writes is actually
causing this issue, but this is the only scenario that we can reproduce
with our local simplified scripts. Also, after merging in patch 1, the
MYSQL pressure test scenario has now been tested for 5 days and has not
been reproduced.
I'll double-check the problem scenario, although buffered reads with
buffered writes doesn't seem to have this problem.
reads on the journal file. I guess you could confirm that with a bit of
kernel tracing but the symptoms look pretty convincing. Did you try talking
to MYSQL guys about why they are doing this?
writes, which I confirmed with the following bpftrace script:
```
#include <linux/fs.h>
#include <linux/path.h>
#include <linux/dcache.h>
kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
printf("read path: %s\n", str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}
kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
printf("write path: %s\n", str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}
```
I suspect there are DIO writes causing the problem, but I haven't caught
any DIO writes to such files via bpftrace.
with fully buffered IO.
seems to bypass the memory barrier:
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
smp_wmb()
set_bit(PG_uptodate, folio_flags(folio, 0))
i_size_write(inode, pos + copied)
// write isize 2048
unlock_page(page)
ext4_file_read_iter
generic_file_read_iter
filemap_read
filemap_get_pages
filemap_get_read_batch
folio_test_uptodate(folio)
ret = test_bit(PG_uptodate, folio_flags(folio, 0));
if (ret)
smp_rmb();
// The read barrier here ensures
// that data 0-2048 in the page is synchronized.
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
smp_wmb()
set_bit(PG_uptodate, folio_flags(folio,
0))
i_size_write(inode, pos + copied)
// write isize 4096
unlock_page(page)
// read isize 4096
isize = i_size_read(inode)
// But there is no read barrier here,
// so the data in the 2048-4096 range
// may not be synchronized yet !!!
copy_page_to_iter()
// copyout 4096
In the concurrency above, we read the updated i_size, but there is
no read barrier to ensure that the data in the page is the same as
the i_size at this point. Therefore, we may copy the unsynchronized
page out. Is it normal for us to read zero-filled data in this case?
do_generic_file_read() code) indeed does copy data only after checking
uptodate flag and then sampling i_size so we may be copying state in the
middle of the racing write and indeed there is nothing which would prevent
prefetching page data before fetching inode size. I agree this is kind of
nasty so I think adding a read barrier between i_size_read() and
copy_page_to_iter() makes sense. Does it fix your issue with MYSQL?
Honza