Re: [PATCH 03/15] print_integer: new and improved way of printing integers

From: Alexey Dobriyan
Date: Tue Apr 21 2020 - 12:49:31 EST


On Mon, Apr 20, 2020 at 09:54:17PM -0400, Steven Rostedt wrote:
> On Tue, 21 Apr 2020 00:27:23 +0300
> Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx> wrote:
>
> > >
> > > > TODO
> > > > benchmark with mainline because nouveau is broken for me -(
> > > > vsnprintf() changes make the code slower
> > >
> > > Exactly main point of this exercise. I don't believe that algos in vsprintf.c
> > > are too dumb to use division per digit (yes, division by constant which is not
> > > power of two is a heavy operation).
> > >
> >
> > And second point here, why not to use existing algos from vsprintf.c?
>
> Exactly. The code in _print_integer_u32() doesn't look as fast as the
> code in vsprintf() that happens to use lookup tables and converts
> without any loops.
>
> Hint, loops are bad, they cause the CPU to slow down.

Oh, come on! Loops make code fit into icache and Îop decode cache.

> Anyway, this patch series would require a pretty good improvement, as
> the code replacing the sprintf() usages is pretty ugly compared to a
> simple sprintf() call.

No! Fast code must look ugly. Or in other words if you try to optimise
integer printing to death you'll probably end with something like
_print_integer().

When the very first patch changed /proc/stat to seq_put_decimal_ull()
the speed up was 66% (or 33%). That's how slow printing was back then.
It can be made slightly faster even now.

> Randomly picking patch 6:
>
> static int loadavg_proc_show(struct seq_file *m, void *v)
> {
> unsigned long avnrun[3];
>
> get_avenrun(avnrun, FIXED_1/200, 0);
>
> seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n",
> LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
> LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
> LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
> nr_running(), nr_threads,
> idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> return 0;
> }
>
> *vs*
>
> static int loadavg_proc_show(struct seq_file *m, void *v)
> {
> unsigned long avnrun[3];
> char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1];
> char *p = buf + sizeof(buf);
> int i;
>
> *--p = '\n';
> p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> *--p = ' ';
> p = _print_integer_u32(p, nr_threads);
> *--p = '/';
> p = _print_integer_u32(p, nr_running());
>
> get_avenrun(avnrun, FIXED_1/200, 0);
> for (i = 2; i >= 0; i--) {
> *--p = ' ';
> --p; /* overwritten */
> *--p = '0'; /* conditionally overwritten */
> (void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i]));
> *--p = '.';
> p = _print_integer_ul(p, LOAD_INT(avnrun[i]));
> }
>
> seq_write(m, p, buf + sizeof(buf) - p);
> return 0;
> }
>
>
> I much rather keep the first version.

I did the benchmarks (without stack protector though), everything except
/proc/cpuinfo and /proc/meminfo became faster. This requires investigation
and I can drop vsprintf() changes until then.

Now given that /proc/uptime format cast in stone, code may look a bit ugly
and unusual but it won't require maintainance