Re: [PATCH v2] buffer: Fix I/O error due to ARM read-after-read hazard
From: Catalin Marinas
Date: Tue Nov 12 2019 - 11:09:03 EST
On Tue, Nov 12, 2019 at 02:02:44PM +0100, Vincent Whitchurch wrote:
> On my dual-core ARM Cortex-A9, reading from squashfs (over
> dm-verity/ubi/mtd) in a loop for hundreds of hours invariably results in
> a read failure in squashfs_read_data(). The errors occur because the
> buffer_uptodate() check fails after wait_on_buffer(). Further debugging
> shows that the bh was in fact uptodate and that there is no actual I/O
> error in the lower layers.
>
> The problem is caused by the read-after-read hazards in the ARM
> Cortex-A9 MPCore (erratum #761319, see [1]). The code generated by the
> compiler for the combination of the wait_on_buffer() and
> buffer_uptodate() calls reads the flags value twice from memory (see the
> excerpt of the assembly below). The new value of the BH_Lock flag is
> seen but the new value of BH_Uptodate is not even though both the bits
> are read from the same memory location.
>
> 27c: 9d08 ldr r5, [sp, #32]
> 27e: 2400 movs r4, #0
> 280: e006 b.n 290 <squashfs_read_data+0x290>
> 282: 6803 ldr r3, [r0, #0]
> 284: 07da lsls r2, r3, #31
> 286: f140 810d bpl.w 4a4 <squashfs_read_data+0x4a4>
> 28a: 3401 adds r4, #1
> 28c: 42bc cmp r4, r7
> 28e: da08 bge.n 2a2 <squashfs_read_data+0x2a2>
> 290: f855 0f04 ldr.w r0, [r5, #4]!
> 294: 6803 ldr r3, [r0, #0]
> 296: 0759 lsls r1, r3, #29
> 298: d5f3 bpl.n 282 <squashfs_read_data+0x282>
> 29a: f7ff fffe bl 0 <__wait_on_buffer>
>
> Work around this problem by adding a DMB between the two reads of
> bh->flags, as recommended in the ARM document. With this barrier, no
> failures have been seen in more than 5000 hours of the same test.
>
> [1] http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf
I thought we were going to fix the compiler. I found an old thread here:
https://gcc.gnu.org/ml/gcc-patches/2014-06/msg00714.html
Also cc'ing Richard Earnshaw as he may been involved in the gcc
discussion at the time.
While you can add some barrier here, there may be other cases where this
can go wrong.
--
Catalin