Re: [PATCH v2 1/3] rust: block: introduce `kernel::block::mq` module
From: Benno Lossin
Date: Wed May 29 2024 - 14:08:32 EST
On 29.05.24 14:52, Andreas Hindborg wrote:
> Benno Lossin <benno.lossin@xxxxxxxxx> writes:
>
>> On 21.05.24 16:03, Andreas Hindborg wrote:
>>> From: Andreas Hindborg <a.hindborg@xxxxxxxxxxx>
>
> [...]
>
>>>>> +
>>> +//! This module provides types for implementing block drivers that interface the
>>> +//! blk-mq subsystem.
>>> +//!
>>> +//! To implement a block device driver, a Rust module must do the following:
>>> +//!
>>> +//! - Implement [`Operations`] for a type `T`
>>> +//! - Create a [`TagSet<T>`]
>>> +//! - Create a [`gen_disk::GenDisk<T>`], passing in the `TagSet` reference
>>
>> I would use short links [`GenDisk<T>`].
>
> `GenDisk` is not in scope, so short link is not working. But I can do
> whatever this is called:
>
>
> //! - Create a [`GenDisk<T>`], passing in the `TagSet` reference
> ...
> //! [`GenDisk<T>`]: gen_disk::GenDisk
>
> Would you prefer that?
Yes that is what I had in mind.
>>> +//! - Add the disk to the system by calling [`gen_disk::GenDisk::add`]
>>> +//!
>>> +//! The types available in this module that have direct C counterparts are:
>>> +//!
>>> +//! - The `TagSet` type that abstracts the C type `struct tag_set`.
>>
>> Missing link? (also below)
>
> `TagSet` was linked above. Would you link it on each mention?
I think that is more resilient to doc updates (eg if the mention above
is removed, then there would be no link). Also the reader won't have to
search for a different, linked location of the same thing.
> [...]
>
>>> +//!
>>> +//! fn commit_rqs(
>>> +//! ) {
>>> +//! }
>>
>> Formatting.
>
> I would love if `rustfmt` would do this. But I think it is both unstable
> and broken for examples like this [1]. I'll fix it up by hand.
>
> How do you manage formatting in examples? By hand?
I usually copy-paste it into a temporary file and then format that
(removing the leading `//! `) the annoying thing is that you then want
to align it to 96 columns. The way I do it is by wrapping the code in a
`mod tmp {}`, since that introduces 4 additional spaces. I think you
could also directly do it in the file by just removing the prefix and
wrapping in a module.
[...]
>>> +impl<T: Operations> GenDisk<T, Initialized> {
>>> + /// Register the device with the kernel. When this function returns, the
>>> + /// device is accessible from VFS. The kernel may issue reads to the device
>>> + /// during registration to discover partition information.
>>> + pub fn add(self) -> Result<GenDisk<T, Added>> {
>>> + crate::error::to_result(
>>> + // SAFETY: By type invariant, `self.gendisk` points to a valid and
>>> + // initialized instance of `struct gendisk`
>>> + unsafe {
>>> + bindings::device_add_disk(
>>> + core::ptr::null_mut(),
>>> + self.gendisk,
>>> + core::ptr::null_mut(),
>>> + )
>>> + },
>>> + )?;
>>> +
>>> + // We don't want to run the destuctor and remove the device from the VFS
>>> + // when `disk` is dropped.
>>> + let mut old = core::mem::ManuallyDrop::new(self);
>>> +
>>> + let new = GenDisk {
>>> + _tagset: old._tagset.clone(),
>>> + gendisk: old.gendisk,
>>> + _phantom: PhantomData,
>>> + };
>>> +
>>> + // But we have to drop the `Arc` or it will leak.
>>> + // SAFETY: `old._tagset` is valid for write, aligned, non-null, and we
>>> + // have exclusive access. We are not accessing the value again after it
>>> + // is dropped.
>>> + unsafe { core::ptr::drop_in_place(&mut old._tagset) };
>>> +
>>> + Ok(new)
>>
>> Instead of using `ManuallyDrop` and `drop_in_place` why not use
>> `transmute`? I feel like that would be a lot simpler.
>
> I was considering the layout not being deterministic for `repr(Rust)`. I
> think that in practice it will be the case that the two types will have
> the same layout, but I could not find the documentation that states
> this. Nomicon does not talk about zero sized types [2].
Ah I forgot about this issue... You are absolutely correct that we do
not get the guarantee for `repr(Rust)`.
[...]
>>> +/// Implement this trait to interface blk-mq as block devices.
>>> +///
>>> +/// To implement a block device driver, implement this trait as described in the
>>> +/// [module level documentation]. The kernel will use the implementation of the
>>> +/// functions defined in this trait to interface a block device driver Note:
>>> +/// There is no need for an exit_request() implementation, because the `drop`
>>> +/// implementation of the [`Request`] type will be invoked by automatically by
>>> +/// the C/Rust glue logic.
>>
>> This text is wrapped to 80 columns, but our usual wrapping is 100.
>
> This had me scratch my head for a bit. I run `make rustfmt` and `make
> rustfmtcheck`, so I had no idea why my code would be formatted
> incorrect. It took me a while to figure out that we are not enabling
> `comment_width = 100`, presumably because it is an unstable `rustfmt`
> feature.
Yeah, Alice also ran into this issue.
> I am not sure what the correct way to enable it but I hacked
> the Makefile and enabled it. It gives a huge diff all across the kernel
> crate.
Oh I did not notice...
> So, it seems we _are_ in fact using 80 line fill column, since that is
> what much of our existing code is using, and that is what the build
> system is configured to use.
>
> Where did you come across the 100 character fill column?
Well, my editor is configured to wrap at the 100th column including
comments. I don't know why we have so many comment blocks at 80 columns.
> Anyways, we should configure our tools to the standard we want. I don't
> care if it's 80 or 100, as long as I can have the tools do the job for
> me.
Agreed.
> Let's discuss this at next meeting.
>
>>
>>> +///
>>> +/// [module level documentation]: kernel::block::mq
>>> +#[macros::vtable]
>>> +pub trait Operations: Sized {
>>> + /// Called by the kernel to queue a request with the driver. If `is_last` is
>>> + /// `false`, the driver is allowed to defer committing the request.
>>> + fn queue_rq(rq: ARef<Request<Self>>, is_last: bool) -> Result;
>>> +
>>> + /// Called by the kernel to indicate that queued requests should be submitted.
>>> + fn commit_rqs();
>>> +
>>> + /// Called by the kernel when the request is completed.
>>> + fn complete(_rq: ARef<Request<Self>>);
>>
>> Is there a reason for the `_`?
>
> Copy pasta probably. Will remove.
>
>>
>> To me it seems this is called when the given request is already done, so
>> would it make more sense to name it `completed` or `on_completion`?
>
> I would agree. But we had a bit of discussion at LSF about naming things
> differently in Rust vs C. C people prefer if we keep the C names, even
> if they do not make sense to the people who write the new Rust code.
That makes sense. In that case, we could also probably add an additional
note to the function that the name might be a bit misleading. In the
sense that it sounds like "here is a request, please complete it"
instead of "here is a request that was just completed".
> In C, the vtable entry is called `complete_callback` and the called
> symbol is usually `my_driver_complete_rq`.
>
> We could go with `completed`, `completed_callback`, or `complete_rq`.
> Although `completed` sounds like it should return a bool indicating
> whether the request was already completed.
>
> I think I'll leave it for now, and we can always change it if we come up
> with a really good name.
[...]
>>> diff --git a/rust/kernel/block/mq/raw_writer.rs b/rust/kernel/block/mq/raw_writer.rs
>>> new file mode 100644
>>> index 000000000000..4f7e4692b592
>>> --- /dev/null
>>> +++ b/rust/kernel/block/mq/raw_writer.rs
>>> @@ -0,0 +1,55 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +
>>> +use core::fmt::{self, Write};
>>> +
>>> +use crate::error::Result;
>>> +use crate::prelude::EINVAL;
>>> +
>>> +/// A mutable reference to a byte buffer where a string can be written into.
>>> +///
>>> +/// # Invariants
>>> +///
>>> +/// `buffer` is always null terminated.
>>
>> I don't know how valuable this would be, but you could only ask for this
>> invariant, if `buffer` is non-empty. Then you could have the `new` and
>> `from_array` functions return a `RawWriter` without result.
>
> I think guaranteeing at least a null character is always written by
> `write_str` is a good thing. It is used for writing C strings to device
> name fields. `write_str` with a zero size buffer would give undesirable
> results, and is probably not what the caller wants.
Makes sense.
[...]
>>> + /// Notify the block layer that the request has been completed without errors.
>>> + ///
>>> + /// This function will return `Err` if `this` is not the only `ARef`
>>> + /// referencing the request.
>>> + pub fn end_ok(this: ARef<Self>) -> Result<(), ARef<Self>> {
>>
>> I am not yet fully convinced that this is the way we should go. I think
>> I would have to see a more complex usage of `Request` with that id <->
>> request mapping that you mentioned. Because with how rnull uses this
>> API, it could also have a `URef<Self>` parameter (URef := unique ARef).
>
> I considered a `UniqueARef` but it would just move the error handing to
> `ARef::into_unique` and then make `end_ok` infallible.
>
> There are four states for a request that we need to track:
>
> A) Request is owned by block layer (refcount 0)
> B) Request is owned by driver but with zero `ARef`s in existence
> (refcount 1)
> C) Request is owned by driver with exactly one `ARef` in existence
> (refcount 2)
> D) Request is owned by driver with more than one `ARef` in existence
> (refcount > 2)
>
> It is in the doc comments for `RequestDataWrapper` as well.
>
> We need A and B to ensure we fail tag to request conversions for
> requests that are not owned by the driver.
>
> We need C and D to ensure that it is safe to end the request and hand back
> ownership to the block layer.
>
> I will ping you when I hook up the NVMe driver with this.
Thanks. I think that since the C side doesn't use ref-counting, the
lifecycle of a request is probably rather simple. Therefore we should
try to also avoid refcounting in Rust and see if we can eg tie ending
requests to the associated `TagSet` (ie require `&mut` on the tagset)
and tie accessing requests to shared access to the `TagSet`. Then we
would be able to avoid the refcount. But I will first have to take a
look at the nvme driver to gauge the plausibility.
>>> + let this = Self::try_set_end(this)?;
>>> + let request_ptr = this.0.get();
>>> + core::mem::forget(this);
>>
>> Would be a good idea to mention who is going to own this refcount.
>
> The refcount is zero after `try_set_end`, so there is no owner of the
> count. The request will be in state A and thus block layer owns the
> request. Block layer does not honor this refcount, it is only for the
> driver to know.
>
> Perhaps I should move the explanation up into the docs for `Request`.
It wouldn't hurt to have the above as a comment :)
>>> +
>>> + // SAFETY: By type invariant, `self.0` is a valid `struct request`. By
>>> + // existence of `&mut self` we have exclusive access.
>>> + unsafe { bindings::blk_mq_end_request(request_ptr, bindings::BLK_STS_OK as _) };
>>> +
>>> + Ok(())
>>> + }
>>> +
>>> + /// Return a pointer to the `RequestDataWrapper` stored in the private area
>>> + /// of the request structure.
>>> + ///
>>> + /// # Safety
>>> + ///
>>> + /// - `this` must point to a valid allocation.
>>> + pub(crate) unsafe fn wrapper_ptr(this: *mut Self) -> NonNull<RequestDataWrapper> {
>>> + let request_ptr = this.cast::<bindings::request>();
>>> + let wrapper_ptr =
>>> + // SAFETY: By safety requirements for this function, `this` is a
>>> + // valid allocation.
>>> + unsafe { bindings::blk_mq_rq_to_pdu(request_ptr).cast::<RequestDataWrapper>() };
>>> + // SAFETY: By C api contract, wrapper_ptr points to a valid allocation
>>> + // and is not null.
>>> + unsafe { NonNull::new_unchecked(wrapper_ptr) }
>>> + }
>>> +
>>> + /// Return a reference to the `RequestDataWrapper` stored in the private
>>> + /// area of the request structure.
>>> + pub(crate) fn wrapper_ref(&self) -> &RequestDataWrapper {
>>> + // SAFETY: By type invariant, `self.0` is a valid alocation. Further,
>>> + // the private data associated with this request is initialized and
>>> + // valid. The existence of `&self` guarantees that the private data is
>>> + // valid as a shared reference.
>>> + unsafe { Self::wrapper_ptr(self as *const Self as *mut Self).as_ref() }
>>> + }
>>> +}
>>> +
>>> +/// A wrapper around data stored in the private area of the C `struct request`.
>>> +pub(crate) struct RequestDataWrapper {
>>
>> Why is this called `Wrapper`? It doesn't really wrap anything,
>> `RequestData` seems fine.
>
> It will eventually wrap private data associated with the request. Those
> patches will be submitted later. Should I change the name in the
> meanwhile?
I don't think the churn of the name change is worth it, so keep the
wrapper name.
>>> + /// The Rust request refcount has the following states:
>>> + ///
>>> + /// - 0: The request is owned by C block layer.
>>> + /// - 1: The request is owned by Rust abstractions but there are no ARef references to it.
>>> + /// - 2+: There are `ARef` references to the request.
>>> + refcount: AtomicU64,
>>> +}
>>> +
>>> +impl RequestDataWrapper {
>>> + /// Return a reference to the refcount of the request that is embedding
>>> + /// `self`.
>>> + pub(crate) fn refcount(&self) -> &AtomicU64 {
>>> + &self.refcount
>>> + }
>>> +
>>> + /// Return a pointer to the refcount of the request that is embedding the
>>> + /// pointee of `this`.
>>> + ///
>>> + /// # Safety
>>> + ///
>>> + /// - `this` must point to a live allocation of at least the size of `Self`.
>>> + pub(crate) unsafe fn refcount_ptr(this: *mut Self) -> *mut AtomicU64 {
>>> + // SAFETY: Because of the safety requirements of this function, the
>>> + // field projection is safe.
>>> + unsafe { addr_of_mut!((*this).refcount) }
>>> + }
>>> +}
>>> +
>>> +// SAFETY: Exclusive access is thread-safe for `Request`. `Request` has no `&mut
>>> +// self` methods and `&self` methods that mutate `self` are internally
>>> +// synchronzied.
>>> +unsafe impl<T: Operations> Send for Request<T> {}
>>> +
>>> +// SAFETY: Shared access is thread-safe for `Request`. `&self` methods that
>>> +// mutate `self` are internally synchronized`
>>> +unsafe impl<T: Operations> Sync for Request<T> {}
>>> +
>>> +/// Store the result of `op(target.load())` in target, returning new value of
>>> +/// taret.
>>> +fn atomic_relaxed_op_return(target: &AtomicU64, op: impl Fn(u64) -> u64) -> u64 {
>>> + let mut old = target.load(Ordering::Relaxed);
>>> + loop {
>>> + match target.compare_exchange_weak(old, op(old), Ordering::Relaxed, Ordering::Relaxed) {
>>> + Ok(_) => break,
>>> + Err(x) => {
>>> + old = x;
>>> + }
>>> + }
>>> + }
>>
>> This seems like a reimplementation of `AtomicU64::fetch_update` to me.
>
> It looks like it! Except this function is returning the updated value,
> `fetch_update` is returning the old value.
>
> Would you rather that I rewrite in terms of the library function?
If you can just use the fetch_update function, then that would be better
than (almost) reimplementing it. But if you really need to get the new
value, then I guess it can't really be helped. (or do you think you can
just apply `op` to the old value returned by `fetch_update`?)
>>> +
>>> + op(old)
>>> +}
>>> +
>>> +/// Store the result of `op(target.load)` in `target` if `target.load() !=
>>> +/// pred`, returning previous value of target
>>> +fn atomic_relaxed_op_unless(target: &AtomicU64, op: impl Fn(u64) -> u64, pred: u64) -> bool {
>>> + let x = target.load(Ordering::Relaxed);
>>> + loop {
>>> + if x == pred {
>>> + break;
>>> + }
>>> + if target
>>> + .compare_exchange_weak(x, op(x), Ordering::Relaxed, Ordering::Relaxed)
>>> + .is_ok()
>>> + {
>>> + break;
>>> + }
>>> + }
>>> +
>>> + x == pred
>>> +}
>>> +
>>> +// SAFETY: All instances of `Request<T>` are reference counted. This
>>> +// implementation of `AlwaysRefCounted` ensure that increments to the ref count
>>> +// keeps the object alive in memory at least until a matching reference count
>>> +// decrement is executed.
>>> +unsafe impl<T: Operations> AlwaysRefCounted for Request<T> {
>>> + fn inc_ref(&self) {
>>> + let refcount = &self.wrapper_ref().refcount();
>>> +
>>> + #[cfg_attr(not(CONFIG_DEBUG_MISC), allow(unused_variables))]
>>
>> Another option would be to use `_updated` as the name of the variable. I
>> personally would prefer that. What do you guys think?
>
> Either way is fine by me.
>
> [...]
>
>>> diff --git a/rust/kernel/block/mq/tag_set.rs b/rust/kernel/block/mq/tag_set.rs
>>> new file mode 100644
>>> index 000000000000..4217c2b03ff3
>>> --- /dev/null
>>> +++ b/rust/kernel/block/mq/tag_set.rs
>>> @@ -0,0 +1,93 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +
>>> +//! This module provides the `TagSet` struct to wrap the C `struct blk_mq_tag_set`.
>>> +//!
>>> +//! C header: [`include/linux/blk-mq.h`](srctree/include/linux/blk-mq.h)
>>> +
>>> +use core::pin::Pin;
>>> +
>>> +use crate::{
>>> + bindings,
>>> + block::mq::request::RequestDataWrapper,
>>> + block::mq::{operations::OperationsVTable, Operations},
>>> + error::{self, Error, Result},
>>> + prelude::PinInit,
>>> + try_pin_init,
>>> + types::Opaque,
>>> +};
>>> +use core::{convert::TryInto, marker::PhantomData};
>>> +use macros::{pin_data, pinned_drop};
>>> +
>>> +/// A wrapper for the C `struct blk_mq_tag_set`.
>>> +///
>>> +/// `struct blk_mq_tag_set` contains a `struct list_head` and so must be pinned.
>>> +#[pin_data(PinnedDrop)]
>>> +#[repr(transparent)]
>>> +pub struct TagSet<T: Operations> {
>>> + #[pin]
>>> + inner: Opaque<bindings::blk_mq_tag_set>,
>>> + _p: PhantomData<T>,
>>> +}
>>> +
>>> +impl<T: Operations> TagSet<T> {
>>> + /// Try to create a new tag set
>>> + pub fn try_new(
>>> + nr_hw_queues: u32,
>>> + num_tags: u32,
>>> + num_maps: u32,
>>> + ) -> impl PinInit<Self, error::Error> {
>>> + try_pin_init!( TagSet {
>>> + inner <- unsafe {kernel::init::pin_init_from_closure(move |place: *mut Opaque<bindings::blk_mq_tag_set>| -> Result<()> {
>>
>> We currently do not have `Opaque::try_ffi_init`, I vaguely remember that
>> we talked about it. Do you mind adding it? Otherwise I can also send a
>> patch.
>
> I have a `try_ffi_init` patch queued. I removed it from here to cut
> dependencies. I will submit it soon after this is in.
>
>>
>>> + let place = place.cast::<bindings::blk_mq_tag_set>();
>>> +
>>> + // SAFETY: try_ffi_init promises that `place` is writable, and
>>> + // zeroes is a valid bit pattern for this structure.
>>> + core::ptr::write_bytes(place, 0, 1);
>>> +
>>> + /// For a raw pointer to a struct, write a struct field without
>>> + /// creating a reference to the field
>>> + macro_rules! write_ptr_field {
>>> + ($target:ident, $field:ident, $value:expr) => {
>>> + ::core::ptr::write(::core::ptr::addr_of_mut!((*$target).$field), $value)
>>> + };
>>> + }
>>> +
>>> + // SAFETY: try_ffi_init promises that `place` is writable
>>> + write_ptr_field!(place, ops, OperationsVTable::<T>::build());
>>> + write_ptr_field!(place, nr_hw_queues , nr_hw_queues);
>>> + write_ptr_field!(place, timeout , 0); // 0 means default which is 30 * HZ in C
>>> + write_ptr_field!(place, numa_node , bindings::NUMA_NO_NODE);
>>> + write_ptr_field!(place, queue_depth , num_tags);
>>> + write_ptr_field!(place, cmd_size , core::mem::size_of::<RequestDataWrapper>().try_into()?);
>>> + write_ptr_field!(place, flags , bindings::BLK_MQ_F_SHOULD_MERGE);
>>> + write_ptr_field!(place, driver_data , core::ptr::null_mut::<core::ffi::c_void>());
>>> + write_ptr_field!(place, nr_maps , num_maps);
>>
>> I think that there is some way for pinned-init to do a better job here.
>> I feel like we ought to be able to just write:
>>
>> Opaque::init(
>> try_init!(bindings::blk_mq_tag_set {
>> ops: OperationsVTable::<T>::build(),
>> nr_hw_queues,
>> timeout: 0, // 0 means default, which is 30Hz
>> numa_node: bindings::NUMA_NO_NODE,
>> queue_depth: num_tags,
>> cmd_size: size_of::<RequestDataWrapper>().try_into()?,
>> flags: bindings::BLK_MQ_F_SHOULD_MERGE,
>> driver_data: null_mut(),
>> nr_maps: num_maps,
>> ..Zeroable::zeroed()
>> }? Error)
>> .chain(|tag_set| to_result(bindings::blk_mq_alloc_tag_set(tag_set)))
>> )
>>
>> But we don't have `Opaque::init` (shouldn't be difficult) and
>> `bindings::blk_mq_tag_set` doesn't implement `Zeroable`. We would need
>> bindgen to put `derive(Zeroable)` on certain structs...
>>
>> Another option would be to just list the fields explicitly, since there
>> aren't that many. What do you think?
>
> Both options sound good. Ofc the first one sounds more user friendly
> while the latter one sounds easier to implement. Getting rid of the
> unsafe blocks here would be really nice.
I think since it is not that expensive in this case, you should go for
the second approach.
We can fix it later, when we get the proper bindgen support.
---
Cheers,
Benno
>>> +
>>> + // SAFETY: Relevant fields of `place` are initialised above
>>> + let ret = bindings::blk_mq_alloc_tag_set(place);
>>> + if ret < 0 {
>>> + return Err(Error::from_errno(ret));
>>> + }
>>> +
>>> + Ok(())
>>> + })},
>>> + _p: PhantomData,
>>> + })
>>> + }
>>> +
>>> + /// Return the pointer to the wrapped `struct blk_mq_tag_set`
>>> + pub(crate) fn raw_tag_set(&self) -> *mut bindings::blk_mq_tag_set {
>>> + self.inner.get()
>>> + }
>>> +}
>>> +
>>> +#[pinned_drop]
>>> +impl<T: Operations> PinnedDrop for TagSet<T> {
>>> + fn drop(self: Pin<&mut Self>) {
>>> + // SAFETY: We are not moving self below
>>> + let this = unsafe { Pin::into_inner_unchecked(self) };
>>
>> You don't need to unwrap the `Pin`, since you only need access to `&Self`
>> and `Pin` always allows you to do that. (so just use `self` instead of
>> `this` below)
>
> Thanks 👍
>
>>
>>> +
>>> + // SAFETY: `inner` is valid and has been properly initialised during construction.
>>
>> Should be an invariant.
>
> Ok 👍
>
> Thanks for the review! I will send a new version.
>
>
> Best regards,
> Andreas
>
>
> [1] https://github.com/rust-lang/rustfmt/issues/3348
> [2] https://doc.rust-lang.org/nomicon/repr-rust.html#reprrust