Re: [PATCH] fs: fix a data race in i_size_write/i_size_read

From: David Sterba
Date: Wed Feb 19 2020 - 07:18:26 EST

On Wed, Feb 19, 2020 at 10:21:46AM +0100, Marco Elver wrote:
> Right. In reality, for mainstream architectures, it appears quite unlikely.
> There may be other valid reasons, such as documenting the fact the
> write can happen concurrently with loads.
> Let's assume the WRITE_ONCE can be dropped.
> The load is a different story. While load tearing may not be an issue,
> it's more likely that other optimizations can break the code. For
> example load fusing can break code that expects repeated loads in a
> loop. E.g. I found these uses of i_size_read in loops:
> git grep -E '(for|while) \(.*i_size_read'
> fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) {
> fs/ocfs2/dir.c: for (i = 0; i < i_size_read(inode) &&
> i < offset; ) {
> fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)) {
> fs/ocfs2/dir.c: while (ctx->pos < i_size_read(inode)
> fs/squashfs/dir.c: while (length < i_size_read(inode)) {
> fs/squashfs/namei.c: while (length < i_size_read(dir)) {
> Can i_size writes happen concurrently, and if so will these break if
> the compiler decides to just do i_size_read's load once, and keep the
> result in a register?

It depends on the semantics and the behaviour when the value is not
cached in a register might be the wrong one. A concrete example with
assembly and analysis can be found in d98da49977f6 ("btrfs: save i_size
to avoid double evaluation of i_size_read in compress_file_range"),
which is the workardound mentioned in the my other mail.

actual_end = min_t(u64, i_size_read(inode), end + 1);


mov 0x20(%rsp),%rax
cmp %rax,0x48(%r15) # read
movl $0x0,0x18(%rsp)
mov %rax,%r12
mov %r14,%rax
cmovbe 0x48(%r15),%r12 # eval

Where r15 is inode and 0x48 is offset of i_size.

The original fix was to revert 62b37622718c that would do an
intermediate assignment and this would also avoid the doulble
evaluation but is not future-proof, should the compiler merge the
stores and call i_size_read anyway.

There's a patch adding READ_ONCE to i_size_read but that's not being
applied at the moment and we need to fix the bug. Instead, emulate
READ_ONCE by two barrier()s that's what effectively happens. The
assembly confirms single evaluation:

mov 0x48(%rbp),%rax # read once
mov 0x20(%rsp),%rcx
mov $0x20,%edx
cmp %rax,%rcx
cmovbe %rcx,%rax
mov %rax,(%rsp)
mov %rax,%rcx
mov %r14,%rax

Where 0x48(%rbp) is inode->i_size stored to %eax.