Re: [RFC PATCH 1/5] rust: block: introduce `kernel::block::mq` module

From: Benno Lossin
Date: Fri Apr 05 2024 - 05:40:40 EST


On 05.04.24 10:43, Andreas Hindborg wrote:
> Benno Lossin <benno.lossin@xxxxxxxxx> writes:
>
>> On 04.04.24 11:30, Andreas Hindborg wrote:
>>> Benno Lossin <benno.lossin@xxxxxxxxx> writes:
>>>
>>>> On 04.04.24 07:44, Andreas Hindborg wrote:
>>>>> Benno Lossin <benno.lossin@xxxxxxxxx> writes:
>>>>>
>>>>>> On 03.04.24 10:46, Andreas Hindborg wrote:
>>>>>>> Benno Lossin <benno.lossin@xxxxxxxxx> writes:
>>>>>>>
>>>>>>>> On 23.03.24 07:32, Andreas Hindborg wrote:
>>>>>>>>> Benno Lossin <benno.lossin@xxxxxxxxx> writes:
>>>>>>>>>> On 3/13/24 12:05, Andreas Hindborg wrote:
>>>>>>>>>>> +//! implementations of the `Operations` trait.
>>>>>>>>>>> +//!
>>>>>>>>>>> +//! IO requests are passed to the driver as [`Request`] references. The
>>>>>>>>>>> +//! `Request` type is a wrapper around the C `struct request`. The driver must
>>>>>>>>>>> +//! mark start of request processing by calling [`Request::start`] and end of
>>>>>>>>>>> +//! processing by calling one of the [`Request::end`], methods Failure to do so
>>>>>>>>>>> +//! can lead to IO failures.
>>>>>>>>>>
>>>>>>>>>> I am unfamiliar with this, what are "IO failures"?
>>>>>>>>>> Do you think that it might be better to change the API to use a
>>>>>>>>>> callback? So instead of calling start and end, you would do
>>>>>>>>>>
>>>>>>>>>> request.handle(|req| {
>>>>>>>>>> // do the stuff that would be done between start and end
>>>>>>>>>> });
>>>>>>>>>>
>>>>>>>>>> I took a quick look at the rnull driver and there you are calling
>>>>>>>>>> `Request::end_ok` from a different function. So my suggestion might not
>>>>>>>>>> be possible, since you really need the freedom.
>>>>>>>>>>
>>>>>>>>>> Do you think that a guard approach might work better? ie `start` returns
>>>>>>>>>> a guard that when dropped will call `end` and you need the guard to
>>>>>>>>>> operate on the request.
>>>>>>>>>
>>>>>>>>> I don't think that would fit, since the driver might not complete the
>>>>>>>>> request immediately. We might be able to call `start` on behalf of the
>>>>>>>>> driver.
>>>>>>>>>
>>>>>>>>> At any rate, since the request is reference counted now, we can
>>>>>>>>> automatically fail a request when the last reference is dropped and it
>>>>>>>>> was not marked successfully completed. I would need to measure the
>>>>>>>>> performance implications of such a feature.
>>>>>>>>
>>>>>>>> Are there cases where you still need access to the request after you
>>>>>>>> have called `end`?
>>>>>>>
>>>>>>> In general no, there is no need to handle the request after calling end.
>>>>>>> C drivers are not allowed to, because this transfers ownership of the
>>>>>>> request back to the block layer. This patch series defer the transfer of
>>>>>>> ownership to the point when the ARef<Request> refcount goes to zero, so
>>>>>>> there should be no danger associated with touching the `Request` after
>>>>>>> end.
>>>>>>>
>>>>>>>> If no, I think it would be better for the request to
>>>>>>>> be consumed by the `end` function.
>>>>>>>> This is a bit difficult with `ARef`, since the user can just clone it
>>>>>>>> though... Do you think that it might be necessary to clone requests?
>>>>>>>
>>>>>>> Looking into the details now I see that calling `Request::end` more than
>>>>>>> once will trigger UAF, because C code decrements the refcount on the
>>>>>>> request. When we have `ARef<Request>` around, that is a problem. It
>>>>>>> probably also messes with other things in C land. Good catch.
>>>>>>>
>>>>>>> I did implement `Request::end` to consume the request at one point
>>>>>>> before I fell back on reference counting. It works fine for simple
>>>>>>> drivers. However, most drivers will need to use the block layer tag set
>>>>>>> service, that allows conversion of an integer id to a request pointer.
>>>>>>> The abstraction for this feature is not part of this patch set. But the
>>>>>>> block layer manages a mapping of integer to request mapping, and drivers
>>>>>>> typically use this to identify the request that corresponds to
>>>>>>> completion messages that arrive from hardware. When drivers are able to
>>>>>>> turn integers into requests like this, consuming the request in the call
>>>>>>> to `end` makes little sense (because we can just construct more).
>>>>>>
>>>>>> How do you ensure that this is fine?:
>>>>>>
>>>>>> let r1 = tagset.get(0);
>>>>>> let r2 = tagset.get(0);
>>>>>> r1.end_ok();
>>>>>> r2.do_something_that_would_only_be_done_while_active();
>>>>>>
>>>>>> One thing that comes to my mind would be to only give out `&Request`
>>>>>> from the tag set. And to destroy, you could have a separate operation
>>>>>> that also removes the request from the tag set. (I am thinking of a tag
>>>>>> set as a `HashMap<u64, Request>`.
>>>>>
>>>>> This would be similar to
>>>>>
>>>>> let r1 = tagset.get(0)?;
>>>>> ler r2 = r1.clone();
>>>>> r1.end_ok();
>>>>> r2.do_something_requires_active();
>>>>>
>>>>> but it is not a problem because we do not implement any actions that are
>>>>> illegal in that position (outside of `end` - that _is_ a problem).
>>>>
>>>> Makes sense, but I think it's a bit weird to still be able to access it
>>>> after `end`ing.
>>>
>>> Yes, that is true.
>>>
>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> What I do now is issue the an `Option<ARef<Request>>` with
>>>>>>> `bindings::req_ref_inc_not_zero(rq_ptr)`, to make sure that the request
>>>>>>> is currently owned by the driver.
>>>>>>>
>>>>>>> I guess we can check the absolute value of the refcount, and only issue
>>>>>>> a request handle if the count matches what we expect. Then we can be certain
>>>>>>> that the handle is unique, and we can require transfer of ownership of
>>>>>>> the handle to `Request::end` to make sure it can never be called more
>>>>>>> than once.
>>>>>>>
>>>>>>> Another option is to error out in `Request::end` if the
>>>>>>> refcount is not what we expect.
>>>>>>
>>>>>> I am a bit confused, why does the refcount matter in this case? Can't
>>>>>> the user just have multiple `ARef`s?
>>>>>
>>>>> Because we want to assert that we are consuming the last handle to the
>>>>> request. After we do that, the user cannot call `Request::end` again.
>>>>> `TagSet::get` will not issue a request reference if the request is not
>>>>> in flight. Although there might be a race condition to watch out for.
>>>>>
>>>>> When the block layer hands over ownership to Rust, the reference count
>>>>> is 1. The first `ARef<Request>` we create increments the count to 2. To
>>>>> complete the request, we must have ownership of all reference counts
>>>>> above 1. The block layer takes the last reference count when it takes
>>>>> back ownership of the request.
>>>>>
>>>>>> I think it would be weird to use `ARef<Request>` if you expect the
>>>>>> refcount to be 1.
>>>>>
>>>>> Yes, that would require a custom smart pointer with a `try_into_unique`
>>>>> method that succeeds when the refcount is exactly 2. It would consume
>>>>> the instance and decrement the refcount to 1. But as I said, there is a
>>>>> potential race with `TagSet::get` when the refcount is 1 that needs to
>>>>> be handled.
>>>>>
>>>>>> Maybe the API should be different?
>>>>>
>>>>> I needs to change a little, yes.
>>>>>
>>>>>> As I understand it, a request has the following life cycle (please
>>>>>> correct me if I am wrong):
>>>>>> 1. A new request is created, it is given to the driver via `queue_rq`.
>>>>>> 2. The driver can now decide what to do with it (theoretically it can
>>>>>> store it somewhere and later do something with it), but it should at
>>>>>> some point call `Request::start`.
>>>>>> 3. Work happens and eventually the driver calls `Request::end`.
>>>>>>
>>>>>> To me this does not seem like something where we need a refcount (we
>>>>>> still might need one for safety, but it does not need to be exposed to
>>>>>> the user).
>>>>>
>>>>> It would not need to be exposed to the user, other than a) ending a request
>>>>> can fail OR b) `TagSet::get` can fail.
>>>>>
>>>>> a) would require that ending a request must be done with a unique
>>>>> reference. This could be done by the user by the user calling
>>>>> `try_into_unique` or by making the `end` method fallible.
>>>>>
>>>>> b) would make the reference handle `!Clone` and add a failure mode to
>>>>> `TagSet::get`, so it fails to construct a `Request` handle if there are
>>>>> already one in existence.
>>>>>
>>>>> I gravitate towards a) because it allows the user to clone the Request
>>>>> reference without adding an additional `Arc`.
>>>>
>>>> This confuses me a little, since I thought that `TagSet::get` returns
>>>> `Option<ARef<Request>>`.
>>>
>>> It does, but in the current implementation the failure mode returning
>>> `None` is triggered when the refcount is zero, meaning that the request
>>> corresponding to that tag is not currently owned by the driver. For
>>> solution b) we would change the type to be
>>> `Option<CustomSmartPointerHandleThing<Request>>`.
>>>
>>>> (I tried to find the abstractions in your
>>>> github, but I did not find them)
>>>
>>> It's here [1]. It was introduced in the `rnvme-v6.8` branch.
>>
>> Thanks for the pointer.
>>
>>>> I think that this could work: `queue_rq` takes a `OwnedRequest`, which
>>>> the user can store in a `TagSet`, transferring ownership. `TagSet::get`
>>>> returns `Option<&Request>` and you can call `TagSet::remove` to get
>>>> `Option<OwnedRequest>`. `OwnedRequest::end` consumes `self`.
>>>> With this pattern we also do not need to take an additional refcount.
>>>
>>> It would, but the `TagSet` is just a wrapper for the C block layer
>>> `strugt blk_mq_tag_set`. This is a highly optimized data structure and
>>> tag mapping is done before the driver sees the request. I would like to
>>> reuse that logic.
>>>
>>> We could implement what you suggest anyhow, but I would not want to that
>>> additional logic to the hot path.
>>
>> I overlooked an important detail: the `TagSet` is always stored in an
>> `Arc` (IIRC since you want to be able to share it between different
>> `Gendisk`s). This probably makes my suggestion impossible, since you
>> can't mutably borrow the `TagSet` for removal of `Request`s.
>> Depending on how `Request`s are associated to a `TagSet`, there might be
>> a way around this: I saw the `qid` parameter to the `tag_to_rq`
>> function, is that a unique identifier for a queue?
>
> A tag set services a number of request queues. Each queue has a number
> used to identify it within the tag set. It is unique within the tag set.
>
>> Because in that case
>> we might be able to have a unique `QueueTagSetRef` with
>>
>> fn remove(&mut self, tag: u32) -> OwnedRequest;
>
> We would not need exclusive access. The tag set remove are synchronized
> internally with some fancy atomic bit flipping [1].

If we bind the ability to call `Request::end` to `OwnedRequest` and
require exclusive access to the `QueueTagSetRef`, then we could ensure
that the `end` function is only called once.

>>
>> fn get(&self, tag: u32) -> Option<&Request>;
>>
>> The `TagSet` would still be shared, only the ability to "remove" (I
>> don't know if you do that manually in C, if not, then this would just
>> remove it in the abstraction, but keep it on the C side) is unique to
>> the `QueueTagSetRef` struct.
>
> I would not advice removing tag->request associations from the driver. I
> understand your point and from the perspective of these patches it makes
> sense. But it would be a major layer violation of the current block
> layer architecture, as far as I can tell.

Ah I should have specified this better: we don't remove the request from
the C side, only from the `TagSet` Rust abstraction. Maybe a better name
would be `end_request` (the function would then return bool to indicate
if there was a request with that tag).

> I am having trouble enough trying to justify deferred free of the
> request structure as it is.

Using this approach, there also would not be a deferred free, as we
would call `end` immediately, right?

>> But feel free to use your proposed option a), it is simpler and we can
>> try to make this work when you send the `TagSet` abstractions.
>> I just think that we should try a bit harder to make it even better.
>
> I'll code it up a) and see how it looks (and what it costs in
> performance) 👍

Sure.

We can also speak about this in the meeting, I have the feeling that
that would be easier than trying via mail :)

--
Cheers,
Benno