Re: [PATCH 3/3] rust: block: convert `block::mq` to use `Refcount`

From: Andreas Hindborg
Date: Thu Oct 10 2024 - 07:14:02 EST


"Benno Lossin" <benno.lossin@xxxxxxxxx> writes:

> On 10.10.24 11:06, Andreas Hindborg wrote:
>> Andreas Hindborg <a.hindborg@xxxxxxxxxx> writes:
>>
>>> Andreas Hindborg <a.hindborg@xxxxxxxxxx> writes:
>>>
>>>> "Gary Guo" <gary@xxxxxxxxxxx> writes:
>>>>
>>>>> On Sat, 5 Oct 2024 13:59:44 +0200
>>>>> Alice Ryhl <aliceryhl@xxxxxxxxxx> wrote:
>>>>>
>>>>>> On Sat, Oct 5, 2024 at 11:49 AM Andreas Hindborg <a.hindborg@xxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> Hi Greg,
>>>>>>>
>>>>>>> "Greg KH" <gregkh@xxxxxxxxxxxxxxxxxxx> writes:
>>>>>>>
>>>>>>>> On Fri, Oct 04, 2024 at 04:52:24PM +0100, Gary Guo wrote:
>>>>>>>>> There is an operation needed by `block::mq`, atomically decreasing
>>>>>>>>> refcount from 2 to 0, which is not available through refcount.h, so
>>>>>>>>> I exposed `Refcount::as_atomic` which allows accessing the refcount
>>>>>>>>> directly.
>>>>>>>>
>>>>>>>> That's scary, and of course feels wrong on many levels, but:
>>>>>>>>
>>>>>>>>
>>>>>>>>> @@ -91,13 +95,17 @@ pub(crate) unsafe fn start_unchecked(this: &ARef<Self>) {
>>>>>>>>> /// C `struct request`. If the operation fails, `this` is returned in the
>>>>>>>>> /// `Err` variant.
>>>>>>>>> fn try_set_end(this: ARef<Self>) -> Result<*mut bindings::request, ARef<Self>> {
>>>>>>>>> - // We can race with `TagSet::tag_to_rq`
>>>>>>>>> - if let Err(_old) = this.wrapper_ref().refcount().compare_exchange(
>>>>>>>>> - 2,
>>>>>>>>> - 0,
>>>>>>>>> - Ordering::Relaxed,
>>>>>>>>> - Ordering::Relaxed,
>>>>>>>>> - ) {
>>>>>>>>> + // To hand back the ownership, we need the current refcount to be 2.
>>>>>>>>> + // Since we can race with `TagSet::tag_to_rq`, this needs to atomically reduce
>>>>>>>>> + // refcount to 0. `Refcount` does not provide a way to do this, so use the underlying
>>>>>>>>> + // atomics directly.
>>>>>>>>> + if this
>>>>>>>>> + .wrapper_ref()
>>>>>>>>> + .refcount()
>>>>>>>>> + .as_atomic()
>>>>>>>>> + .compare_exchange(2, 0, Ordering::Relaxed, Ordering::Relaxed)
>>>>>>>>> + .is_err()
>>>>>>>>
>>>>>>>> Why not just call rust_helper_refcount_set()? Or is the issue that you
>>>>>>>> think you might not be 2 here? And if you HAVE to be 2, why that magic
>>>>>>>> value (i.e. why not just always be 1 and rely on normal
>>>>>>>> increment/decrement?)
>>>>>>>>
>>>>>>>> I know some refcounts are odd in the kernel, but I don't see where the
>>>>>>>> block layer is caring about 2 as a refcount anywhere, what am I missing?
>>>>>>>
>>>>>>> It is in the documentation, rendered version available here [1]. Let me
>>>>>>> know if it is still unclear, then I guess we need to update the docs.
>>>>>>>
>>>>>>> Also, my session from Recipes has a little bit of discussion regarding
>>>>>>> this refcount and it's use [2].
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Andreas
>>>>>>>
>>>>>>>
>>>>>>> [1] https://rust.docs.kernel.org/kernel/block/mq/struct.Request.html#implementation-details
>>>>>>> [2] https://youtu.be/1LEvgkhU-t4?si=B1XnJhzCCNnUtRsI&t=1685
>>>>>>
>>>>>> So it sounds like there is one refcount from the C side, and some
>>>>>> number of references from the Rust side. The function checks whether
>>>>>> there's only one Rust reference left, and if so, takes ownership of
>>>>>> the value, correct?
>>>>>>
>>>>>> In that case, the CAS should have an acquire ordering to synchronize
>>>>>> with dropping the refcount 3->2 on another thread. Otherwise, you
>>>>>> might have a data race with the operations that happened just before
>>>>>> the 3->2 refcount drop.
>>>>>>
>>>>>> Alice
>>>>>
>>>>> The code as is is fine since there's no data protected in
>>>>> `RequestDataWrapper` yet (in fact it's not even generic yet). I know
>>>>> Andreas does want to introduce driver-specific data into that, so in
>>>>> the long term the acquire would be necessary.
>>>>>
>>>>> Andreas, please let me know if you want me to make the change now, or
>>>>> you'd rather change the ordering when you introduce data to
>>>>> `RequestDataWrapper`.
>>>>
>>>> I guess we will have said data dependencies when we are going to run
>>>> drop for fields in the private data area. Thanks for pointing that out.
>>>> I will update the ordering when I submit that patch.
>>>>
>>>> As I mentioned before, I would rather we do not apply this patch before
>>>> we get a way to inline helpers.
>>>
>>> As discussed offline, the code that suffers the performance regression
>>> is downstream, and since this change seems to be important, I can apply
>>> the helper LTO patch downstream as well.
>>>
>>> Since the plan for the downstream code _is_ to move upstream, I really
>>> hope to see the helper LTO patch upstream, so we don't get a performance
>>> regression because of these refcounts.
>>>
>>> If we cannot figure out a way to get the LTO patches (or an alternative
>>> solution) upstream, we can always revert back to a more performant
>>> solution in block.
>>
>> I forgot to report the result of the benchmarks. Over the usual
>> benchmark workload that I run for `rnull` I see an average 0.8 percent
>> performance penalty with this patch. For some configurations
>> I see 95% CI N=40 [-18%;-5%]. So it is not insignificant.
>
> Was the benchmark run together with the LTO helper patches?

No, that the effect of applying only this patch set alone. I did apply
the helper LTO patches downstream a few times, but I don't carry them in
my default tree. But I guess I can start doing that now.

Best regards,
Andreas