RE: [tip:locking/core] refcount_t: Introduce a special purpose refcount type

From: Reshetova, Elena
Date: Wed Feb 15 2017 - 06:17:29 EST


> * Reshetova, Elena <elena.reshetova@xxxxxxxxx> wrote:
>
> > > Subject: refcount: Out-of-line everything
> > > From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > > Date: Fri Feb 10 16:27:52 CET 2017
> > >
> > > Linus asked to please make this real C code.
> >
> > Perhaps a completely stupid question, but I am going to ask anyway since only
> > this way I can learn. What a real difference it makes? Or are we talking more
> > about styling and etc.? Is it because of size concerns? This way it is certainly
> > now done differently than rest of atomic and kref...
>
> It's about generated code size mostly.
>
> This inline function is way too large to be inlined:
>
> static inline __refcount_check
> bool refcount_add_not_zero(unsigned int i, refcount_t *r)
> {
> unsigned int old, new, val = atomic_read(&r->refs);
>
> for (;;) {
> if (!val)
> return false;
>
> if (unlikely(val == UINT_MAX))
> return true;
>
> new = val + i;
> if (new < val)
> new = UINT_MAX;
> old = atomic_cmpxchg_relaxed(&r->refs, val, new);
> if (old == val)
> break;
>
> val = old;
> }
>
> REFCOUNT_WARN(new == UINT_MAX, "refcount_t: saturated;
> leaking memory.\n");
>
> return true;
> }
>
> When used then this function generates this much code on x86-64 defconfig:
>
> 00000000000084d0 <test>:
> 84d0: 8b 0f mov (%rdi),%ecx
> 84d2: 55 push %rbp
> 84d3: 48 89 e5 mov %rsp,%rbp
>
> 84d6: 85 c9 test %ecx,%ecx |
> 84d8: 74 30 je 850a <test+0x3a> |
> 84da: 83 f9 ff cmp $0xffffffff,%ecx |
> 84dd: be ff ff ff ff mov $0xffffffff,%esi |
> 84e2: 75 04 jne 84e8 <test+0x18> |
> 84e4: eb 1d jmp 8503 <test+0x33> |
> 84e6: 89 c1 mov %eax,%ecx |
> 84e8: 8d 51 01 lea 0x1(%rcx),%edx |
> 84eb: 89 c8 mov %ecx,%eax |
> 84ed: 39 ca cmp %ecx,%edx |
> 84ef: 0f 42 d6 cmovb %esi,%edx |
> 84f2: f0 0f b1 17 lock cmpxchg %edx,(%rdi) |
> 84f6: 39 c8 cmp %ecx,%eax |
> 84f8: 74 09 je 8503 <test+0x33> |
> 84fa: 85 c0 test %eax,%eax |
> 84fc: 74 0c je 850a <test+0x3a> |
> 84fe: 83 f8 ff cmp $0xffffffff,%eax |
> 8501: 75 e3 jne 84e6 <test+0x16> |
> 8503: b8 01 00 00 00 mov $0x1,%eax |
>
> 8508: 5d pop %rbp
> 8509: c3 retq
> 850a: 31 c0 xor %eax,%eax
> 850c: 5d pop %rbp
> 850d: c3 retq
>
>
> (I've annotated the fastpath impact with '|'. Out of line code generally does not
> count.)
>
> It's about ~50 bytes of code per usage site. It might be better in some cases, but
> not by much.
>
> This is way above any sane inlining threshold. The 'unconditionally good' inlining
> threshold is at 1-2 instructions and below ~10 bytes of code.
>
> So for example refcount_set() and refcount_read() can stay inlined:
>
> static inline void refcount_set(refcount_t *r, unsigned int n)
> {
> atomic_set(&r->refs, n);
> }
>
> static inline unsigned int refcount_read(const refcount_t *r)
> {
> return atomic_read(&r->refs);
> }
>
>
> ... beacuse they compile into a single instruction with 2-5 bytes I$ overhead.
>
> Thanks,
>
> Ingo

Thank you very much Ingo for such detailed and nice explanation!

Best Regards,
Elena