Re: [PATCH v5 14/15] rust: sync: reduce stack usage of `UniqueArc::try_new_uninit`

From: y86-dev
Date: Mon Apr 03 2023 - 17:07:17 EST


On 03.04.23 19:56, Alice Ryhl wrote:
> On 4/3/23 18:05, Benno Lossin wrote:
>> /// Tries to allocate a new [`UniqueArc`] instance whose contents are not initialised yet.
>> pub fn try_new_uninit() -> Result<UniqueArc<MaybeUninit<T>>, AllocError> {
>> - Ok(UniqueArc::<MaybeUninit<T>> {
>> + // INVARIANT: The refcount is initialised to a non-zero value.
>> + let inner = Box::try_init::<AllocError>(try_init!(ArcInner {
>> + // SAFETY: There are no safety requirements for this FFI call.
>> + refcount: Opaque::new(unsafe { bindings::REFCOUNT_INIT(1) }),
>> + data <- init::uninit::<T, AllocError>(),
>> + }? AllocError))?;
>> + Ok(UniqueArc {
>> // INVARIANT: The newly-created object has a ref-count of 1.
>> - inner: Arc::try_new(MaybeUninit::uninit())?,
>> + // SAFETY: The pointer from the `Box` is valid.
>> + inner: unsafe { Arc::from_inner(Box::leak(inner).into()) },
>> })
>> }
>> }
>
> I'm curious - do you know whether this compiles to the same machine code
> as this?
>
> pub fn try_new_uninit() -> Result<UniqueArc<MaybeUninit<T>>, AllocError> {
> let inner: Box<MaybeUninit<ArcInner<T>>> = Box::try_new_uninit()?;
> let ptr = Box::into_raw(inner) as *mut ArcInner<T>;
> addr_of_mut!((*ptr).refcount).write(bindings::REFCOUNT_INIT(1));
> Ok(UniqueArc {
> inner: Arc {
> ptr: unsafe { NonNull::new_unchecked(ptr) },
> _p: PhantomData,
> }
> })
> }

Yes it compiles to exactly the same assembly (byte for byte), because I was
not sure if I was compiling the right thing, I compiled this function:

unsafe { core::arch::asm!("ud2") };
let r1 = UniqueArc::try_new_uninit();
unsafe { core::arch::asm!("ud2") };
let r2 = UniqueArc::try_new_uninit2();
unsafe { core::arch::asm!("ud2") };

The `ud2` instructions are for better visibility, as I have not read a lot
of assembly. The above disassembles to this:

ffffffff8143bb80 <_RNvXCsdVu6umiBwhr_12rust_minimalNtB2_11RustMinimalNtCsfATHBUcknU9_6kernel6Module4init>:
ffffffff8143bb80: 41 57 push r15
ffffffff8143bb82: 41 56 push r14
ffffffff8143bb84: 53 push rbx
ffffffff8143bb85: 49 89 ff mov r15,rdi
ffffffff8143bb88: 0f 0b ud2
ffffffff8143bb8a: bf 04 00 10 00 mov edi,0x100004
ffffffff8143bb8f: be 04 00 00 00 mov esi,0x4
ffffffff8143bb94: e8 a7 23 f5 ff call ffffffff8138df40 <__rust_alloc>
ffffffff8143bb99: 48 85 c0 test rax,rax
ffffffff8143bb9c: 74 12 je ffffffff8143bbb0 <_RNvXCsdVu6umiBwhr_12rust_minimalNtB2_11RustMinimalNtCsfATHBUcknU9_6kernel6Module4init+0x30>
ffffffff8143bb9e: 49 89 c6 mov r14,rax
ffffffff8143bba1: bf 01 00 00 00 mov edi,0x1
ffffffff8143bba6: e8 e5 fc f4 ff call ffffffff8138b890 <rust_helper_REFCOUNT_INIT>
ffffffff8143bbab: 41 89 06 mov DWORD PTR [r14],eax
ffffffff8143bbae: eb 03 jmp ffffffff8143bbb3 <_RNvXCsdVu6umiBwhr_12rust_minimalNtB2_11RustMinimalNtCsfATHBUcknU9_6kernel6Module4init+0x33>
ffffffff8143bbb0: 45 31 f6 xor r14d,r14d
ffffffff8143bbb3: 0f 0b ud2
ffffffff8143bbb5: bf 04 00 10 00 mov edi,0x100004
ffffffff8143bbba: be 04 00 00 00 mov esi,0x4
ffffffff8143bbbf: e8 7c 23 f5 ff call ffffffff8138df40 <__rust_alloc>
ffffffff8143bbc4: 48 85 c0 test rax,rax
ffffffff8143bbc7: 74 11 je ffffffff8143bbda <_RNvXCsdVu6umiBwhr_12rust_minimalNtB2_11RustMinimalNtCsfATHBUcknU9_6kernel6Module4init+0x5a>
ffffffff8143bbc9: 48 89 c3 mov rbx,rax
ffffffff8143bbcc: bf 01 00 00 00 mov edi,0x1
ffffffff8143bbd1: e8 ba fc f4 ff call ffffffff8138b890 <rust_helper_REFCOUNT_INIT>
ffffffff8143bbd6: 89 03 mov DWORD PTR [rbx],eax
ffffffff8143bbd8: eb 02 jmp ffffffff8143bbdc <_RNvXCsdVu6umiBwhr_12rust_minimalNtB2_11RustMinimalNtCsfATHBUcknU9_6kernel6Module4init+0x5c>
ffffffff8143bbda: 31 db xor ebx,ebx
ffffffff8143bbdc: 0f 0b ud2
ffffffff8143bbde: 4d 89 77 08 mov QWORD PTR [r15+0x8],r14
ffffffff8143bbe2: 49 89 5f 10 mov QWORD PTR [r15+0x10],rbx
ffffffff8143bbe6: 41 c7 07 00 00 00 00 mov DWORD PTR [r15],0x0
ffffffff8143bbed: 4c 89 f8 mov rax,r15
ffffffff8143bbf0: 5b pop rbx
ffffffff8143bbf1: 41 5e pop r14
ffffffff8143bbf3: 41 5f pop r15
ffffffff8143bbf5: c3 ret

I have not done extensive enough tests to be sure about this in general,
but most of the examples of the pin-init API that I looked at were
optimized to the same assembly as manual initialization.

The only examples were this was not the case was when I had triply nested
structs with `Box`es that all were initialized via `PinInit`. That was the
point were the initialization closure was not inlined any more.
I also verified that in this particular case the closure was again inlined
after adding `#[inline]` to it (which requires `stmt_expr_attributes`).

At some point I might do a more thorough analysis.

--
Cheers,
Benno