Re: [PATCH 03/15] print_integer: new and improved way of printing integers

From: Andy Shevchenko
Date: Mon Apr 20 2020 - 17:19:13 EST


On Mon, Apr 20, 2020 at 11:57:31PM +0300, Alexey Dobriyan wrote:
> Time honored way to print integers via vsnprintf() or equivalent has
> unavoidable slowdown of parsing format string. This can't be fixed in C,
> without introducing external preprocessor.
>
> seq_put_decimal_ull() partially saves the day, but there are a lot of
> branches inside and overcopying still.
>
> _print_integer_*() family of functions is meant to make printing
> integers as fast as possible by deleting format string parsing and doing
> as little work as possible.
>
> It is based on the following observations:
>
> 1) memcpy is done in forward direction
> it can be done backwards but nobody does that,
>
> 2) digits can be extracted in a very simple loop which costs only
> 1 multiplication and shift (division by constant is not division)
>
> All the above asks for the following signature, semantics and pattern of
> printing out beloved /proc files:
>
> /* seq_printf(seq, "%u %llu\n", A, b); */
>
> char buf[10 + 1 + 20 + 1];
> char *p = buf + sizeof(buf);
>
> *--p = '\n';
> p = _print_integer_u64(p, B);
> *--p = ' ';
> p = _print_integer_u32(p, A);
>
> seq_write(seq, p, buf + sizeof(buf) - p);
>
> 1) stack buffer capable of holding the biggest string is allocated.
>
> 2) "p" is pointer to start of the string. Initially it points past
> the end of the buffer WHICH IS NOT NUL-TERMINATED!
>
> 3) _print_integer_*() actually prints an integer from right to left
> and returns new start of the string.
>
> <--------|
> 123
> ^
> |
> +-- p
>
> 4) 1 character is printed with
>
> *--p = 'x';
>
> It generates very efficient code as multiple writes can be
> merged.
>
> 5) fixed string is printed with
>
> p = memcpy(p - 3, "foo", 3);
>
> Complers know what memcpy() does and write-combine it.
> 4/8-byte writes become 1 instruction and are very efficient.
>
> 6) Once everything is printed, the result is written to seq_file buffer.
> It does only one overflow check and 1 copy.
>
> This generates very efficient code (and small!).
>
> In regular seq_printf() calls, first argument and format string are
> constantly reloaded. Format string will most likely with [rip+...] which
> is quite verbose.
>
> seq_put_decimal_ull() will do branches (and even more branches
> with "width" argument)
>

> TODO
> benchmark with mainline because nouveau is broken for me -(
> vsnprintf() changes make the code slower

Exactly main point of this exercise. I don't believe that algos in vsprintf.c
are too dumb to use division per digit (yes, division by constant which is not
power of two is a heavy operation).


> +noinline
> +char *_print_integer_u32(char *p, u32 x)
> +{
> + do {
> + *--p = '0' + (x % 10);
> + } while (x /= 10);
> + return p;
> +}

> +noinline
> +char *_print_integer_u64(char *p, u64 x)
> +{
> + while (x >= 100 * 1000 * 1000) {
> + u32 r;
> +
> + x = div_u64_rem(x, 100 * 1000 * 1000, &r);
> + p = memset(p - 8, '0', 8);
> + (void)_print_integer_u32(p + 8, r);
> + }
> + return _print_integer_u32(p, x);
> +}

--
With Best Regards,
Andy Shevchenko