Re: [PATCH] hwmon: (nzxt-kraken2) mark and order concurrent accesses
From: Guenter Roeck
Date: Tue Mar 30 2021 - 01:44:56 EST
On 3/29/21 8:16 PM, Jonas Malaco wrote:
> On Mon, Mar 29, 2021 at 06:01:00PM -0700, Guenter Roeck wrote:
>> On 3/29/21 5:21 PM, Jonas Malaco wrote:
>>> On Mon, Mar 29, 2021 at 02:53:39PM -0700, Guenter Roeck wrote:
>>>> On Mon, Mar 29, 2021 at 05:22:01AM -0300, Jonas Malaco wrote:
>>>>> To avoid a spinlock, the driver explores concurrent memory accesses
>>>>> between _raw_event and _read, having the former updating fields on a
>>>>> data structure while the latter could be reading from them. Because
>>>>> these are "plain" accesses, those are data races according to the Linux
>>>>> kernel memory model (LKMM).
>>>>>
>>>>> Data races are undefined behavior in both C11 and LKMM. In practice,
>>>>> the compiler is free to make optimizations assuming there is no data
>>>>> race, including load tearing, load fusing and many others,[1] most of
>>>>> which could result in corruption of the values reported to user-space.
>>>>>
>>>>> Prevent undesirable optimizations to those concurrent accesses by
>>>>> marking them with READ_ONCE() and WRITE_ONCE(). This also removes the
>>>>> data races, according to the LKMM, because both loads and stores to each
>>>>> location are now "marked" accesses.
>>>>>
>>>>> As a special case, use smp_load_acquire() and smp_load_release() when
>>>>> loading and storing ->updated, as it is used to track the validity of
>>>>> the other values, and thus has to be stored after and loaded before
>>>>> them. These imply READ_ONCE()/WRITE_ONCE() but also ensure the desired
>>>>> order of memory accesses.
>>>>>
>>>>> [1] https://lwn.net/Articles/793253/
>>>>>
>>>>
>>>> I think you lost me a bit there. What out-of-order accesses that would be
>>>> triggered by a compiler optimization are you concerned about here ?
>>>> The only "problem" I can think of is that priv->updated may have been
>>>> written before the actual values. The impact would be ... zero. An
>>>> attribute read would return "stale" data for a few microseconds.
>>>> Why is that a concern, and what difference does it make ?
>>>
>>> The impact of out-of-order accesses to priv->updated is indeed minimal.
>>>
>>> That said, smp_load_acquire() and smp_store_release() were meant to
>>> prevent reordering at runtime, and only affect architectures other than
>>> x86. READ_ONCE() and WRITE_ONCE() would already prevent reordering from
>>> compiler optimizations, and x86 provides the load-acquire/store-release
>>> semantics by default.
>>>
>>> But the reordering issue is not a concern to me, I got carried away when
>>> adding READ_ONCE()/WRITE_ONCE(). While smp_load_acquire() and
>>> smp_store_release() make the code work more like I intend it to, they
>>> are (small) costs we can spare.
>>>
>>> I still think that READ_ONCE()/WRITE_ONCE() are necessary, including for
>>> priv->updated. Do you agree?
>>>
>>
>> No. What is the point ? The order of writes doesn't matter, the writes won't
>> be randomly dropped, and it doesn't matter if the reader reports old values
>> for a couple of microseconds either. This would be different if the values
>> were used as synchronization primitives or similar, but that isn't the case
>> here. As for priv->updated, if you are concerned about lost reports and
>> the 4th report is received a few microseconds before the read, I'd suggest
>> to loosen the interval a bit instead.
>>
>> Supposedly we are getting reports every 500ms. We have two situations:
>> - More than three reports are lost, making priv->updated somewhat relevant.
>> In this case, it doesn't matter if outdated values are reported for
>> a few uS since most/many/some reports are outdated more than a second
>> anyway.
>> - A report is received but old values are reported for a few uS. That
>> doesn't matter either because reports are always outdated anyway by
>> much more than a few uS anyway, and the code already tolerates up to
>> 2 seconds of lost reports.
>>
>> Sorry, I completely fail to see the problem you are trying to solve here.
>
> Please disregard the out-of-order accesses, I agree that preventing them
> "are a (small) cost we can spare".
>
> The main problem I still would like to address are the data races.
> While the stores and loads cannot be dropped, and we can tolerate their
> reordering, they could still be teared, fused, perhaps invented...
> According to [1] these types of optimizations are not unheard.
>
> Load tearing alone could easily produce values that are not stale, but
> wrong. Do we also tolerate wrong values, even if they are infrequent?
>
> Another detail I should have mentioned sooner is that READ_ONCE() and
> WRITE_ONCE() cause only minor (gcc) to no (clang) changes to the
> generated code for x86_64 and i386.[2] While this seems contrary to the
> point I am trying to make, I want to show that, for the most part, these
> changes just lock in a reasonable compiler behavior.
>
> Specifically, on x86_64/gcc (the most relevant arch/compiler for this
> driver) the changes are restricted to kraken2_read:
>
> 1. Loading of priv->updated and jiffies are reordered, because
> (with READ_ONCE()) both are volatile and time_after(a, b) is
> defined as b - a.
>
Are you really trying to claim that much of the time_after() code
in the Linux kernel is wrong ?
> 2. When loading priv->fan_input[channel],
> movzx eax,WORD PTR [rdx+rcx*2+0x14]
> is split into
> add rcx,0x8
> movzx eax,WORD PTR [rdx+rcx*2+0x4]
> for no reason I could find in the x86 manual.
>
> 3. Similarly, when loading priv->temp_input[channel]
> movsxd rax,DWORD PTR [rdx+rcx*4+0x10]
> turns into
> add rcx,0x4
> movsxd rax,DWORD PTR [rdx+rcx*4]
>
I hardly see how this matters. In both cases, rax enda up with the same value.
Maybe rcx is reused later on. If not, maybe the compiler had a bad day.
But that is not an argument for using READ_ONCE/WRITE_ONCE; after all, the
same will happen with all other indexed accesses.
Guenter
> Both 2 and 3 admittedly get a bit worse with READ_ONCE()/WRITE_ONCE().
> But that is on gcc, and with the data race it could very well decide to
> produce much worse code than that at any moment.
>
> On Arm64 the code does get a lot more ordered, which we have already
> agreed is not really necessary. But removing smp_load_acquire() and
> smp_store_release() should still allow the CPU to reorder those,
> mitigating some of the impact.
>
> I hope this email clarifies what I am concerned about.
>
> Thanks for the patience,
> Jonas
>
> P.S. Tested with gcc 10.2.0 and clang 11.1.0.
>
> [1] https://lwn.net/Articles/793253/
> [2] (outdated, still with smp_*()): https://github.com/jonasmalacofilho/patches/tree/master/linux/nzxt-kraken2-mark-and-order-concurrent-accesses/objdumps
>