Re: [PATCH] net: deinline netif_tx_stop_queue() and netif_tx_stop_all_queues()

From: Denys Vlasenko
Date: Fri May 08 2015 - 05:45:33 EST


On 05/07/2015 07:14 PM, Alexander Duyck wrote:
> On 05/07/2015 04:41 AM, Denys Vlasenko wrote:
>> These functions compile to ~60 bytes of machine code each.
>>
>> With this .config: http://busybox.net/~vda/kernel_config
>> there are 617 calls to netif_tx_stop_queue()
>> and 49 calls to netif_tx_stop_all_queues() in vmlinux.
>>
>> Code size is reduced by 27 kbytes:
>>
>> text data bss dec hex filename
>> 82426986 22255416 20627456 125309858 77813a2 vmlinux.before
>> 82399481 22255416 20627456 125282353 777a831 vmlinux
>>
>> It may seem strange that a seemingly simple code like one in
>> netif_tx_stop_queue() compiles to ~60 bytes of code.
>> Well, it's true. Here's its disassembly:
>>
>> netif_tx_stop_queue:
...
>> 55 push %rbp
>> be 7a 18 00 00 mov $0x187a,%esi
>> 48 c7 c7 50 59 d8 85 mov $.rodata+0x1d85950,%rdi
>> 48 89 e5 mov %rsp,%rbp
>> e8 54 5a 7d fd callq <warn_slowpath_null>
>> 48 c7 c7 5f 59 d8 85 mov $.rodata+0x1d8595f,%rdi
>> 31 c0 xor %eax,%eax
>> e8 b0 47 48 00 callq <printk>
>> eb 09 jmp <netif_tx_stop_queue+0x38>
>
> This is the WARN_ON action. One thing you might try doing is moving
> this to a function of its own instead of moving the entire thing
> out of being an inline.

If WARN_ON check would be moved into a function, the call overhead
would still be there, while each callsite will be larder than with
this patch.

> You may find you still get most
> of the space savings as I wonder if the string for the printk
> isn't being duplicated for each caller.

Yes, strings are duplicated:

$ strings vmlinux0 | grep 'cannot be called before register_netdev'
6netif_stop_queue() cannot be called before register_netdev()
6tun: netif_stop_queue() cannot be called before register_netdev()
6cc770: netif_stop_queue() cannot be called before register_netdev()
63c589_cs: netif_stop_queue() cannot be called before register_netdev()
63c574_cs: netif_stop_queue() cannot be called before register_netdev()
6typhoon netif_stop_queue() cannot be called before register_netdev()
6axnet_cs: netif_stop_queue() cannot be called before register_netdev()
6pcnet_cs: netif_stop_queue() cannot be called before register_netdev()
...

However, they amount only to ~5.7k out of 27k:

$ strings vmlinux0 | grep 'cannot be called before register_netdev' | wc -c
5731


>> f0 80 8f e0 01 00 00 01 lock orb $0x1,0x1e0(%rdi)
>
> This is your set bit operation. If you were to drop the whole WARN_ON
> then this is the only thing you would be inlining.

It's up to networking people to decide. I would happily send a patch which drops
WARN_ON if they say that's ok with them. Davem?


> That is only 8 bytes in size which would probably be comparable to the callq
> and register sorting needed for a function call.

"lock or" in my tests takes 21 cycles even on exclusively cached
L1 data cache line. Added "call+ret" is 4-5 cycles.

> Have you done any performance testing on this change?

No.

> I suspect there will likely be a noticeable impact some some tests.

(1) It's *transmit off* operation. Usually it means that we have to turn
transmit off because hw TX queue is full. So the bottleneck is likely
the network, not the CPU.

(2) It was auto-deinlined by gcc anyway. We already were unknownigly
using the uninlined version for some time. Apparently, it wasn't noticed.


--
vda

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/