Re: [RFCv2] string: Use faster alternatives when constant arguments are used

From: Rasmus Villemoes
Date: Mon Mar 25 2019 - 17:24:08 EST


On 24/03/2019 23.32, Sultan Alsawaf wrote:
> On Sun, Mar 24, 2019 at 10:17:49PM +0100, Rasmus Villemoes wrote:
>> gcc already knows the semantics of these functions and can optimize
>> accordingly. E.g. for strcpy() of a literal to a buffer, gcc readily
>> compiles
>
> The example you gave appears to get optimized accordingly, but there are
> numerous sane counterexamples that don't get optimized.
>
> On arm64, with GCC 8.3.0, let's look at the simple strcmp usage in
> kernel/trace/preemptirq_delay_test.c:
>
> static int preemptirq_delay_run(void *data)
> {
> unsigned long flags;
>
> if (!strcmp(test_mode, "irq")) {
> local_irq_save(flags);
> busy_wait(delay);
> local_irq_restore(flags);
> } else if (!strcmp(test_mode, "preempt")) {
> preempt_disable();
> busy_wait(delay);
> preempt_enable();
> }
>
> return 0;
> }
>
> This is what happens without my patch:

Excellent, this is the kind of examples I was looking for. On x86-64,
there's no callq strcmp in sight (just a repz cmpsb loop). With your
patch, the code gets compiled similar to the arm64 case (it's a rather
convenient case which can be done as single 4- or 8-byte comparisons).

> I can produce lots of kernel examples that don't seem to follow basic str*
> optimization, which is what prompted me to write this in the first place.

Probably because the compiler doesn't want to introduce memory accesses
that are not allowed by the C language.

>> Does this even compile? It's a well-known (or perhaps
>> not-so-well-known?) pitfall that __builtin_constant_p() is not
>> guaranteed to be usable in __builtin_choose_expr() - the latter only
>> accepts bona fide integer constant expressions, while evaluation of
>> __builtin_constant_p can be delayed until various optimization phases.
>
> It not only compiles, but also seems to work correctly from what I've observed.

Interesting. I'll have to look into how that can be.

>> This seems to be buggy - you don't know that src is at least as long as
>> dest. And arguing "if it's shorter, there's a nul byte, which will
>> differ from dest at that point, so memcmp will/should stop" means that
>> the whole word-at-a-time argument would be out.
>
> You mean reading the last word of a string could read out of bounds of the
> string when using memcmp? There are lots of memcmp instances using a literal
> string in the kernel; I don't think it would be hard to find one that violates
> what you've pointed out, so I'm not really sure why it's a problem.

Existing code may have bugs. Or existing code may use a dubious pattern
which happens to be ok due to the context it appears in, but becomes
buggy when copy-pasted elsewhere.

> After a few minutes of grepping, it seems like the memcmp usage in
> drivers/md/dm-integrity.c can read out of bounds on its arguments:
> } else if (!memcmp(opt_string, "internal_hash:", strlen("internal_hash:"))) {
> r = get_alg_and_key(opt_string, &ic->internal_hash_alg, &ti->error,
> "Invalid internal_hash argument");
> if (r)
> goto bad;

Yeah, this actually looks buggy. Maybe we don't have any memcmp()
implementations that do word-at-a-time (it would mostly make sense on BE
architectures). Or maybe there's something here that guarantees
sufficient slack after the nul byte, so we never read across a page
boundary (which is how bad things could happen).

Something like

char stackbuffer[32];
fill_stackbuffer_somehow();
if (memcmp(stackbuffer, "literal", strlen("literal"))

should be sorta ok, even if memcmp might end up reading uninitialized bytes.

What I'm worried about is your patch changing every single strcmp(,
"literal") into a memcmp, with absolutely no way of knowing or checking
anything about the other buffer. And actually, it doesn't have to be a
BE arch with a word-at-a-time memcmp. If (as is usually the case) the
strcmp() result is compared to zero, after you change

!strcmp(buf, "literal")

into

!memcmp(buf, "literal", 8)

the compiler may (exactly as you want it to) change that into a single
8-byte load (or two 4-byte loads) and comparisons to literals, no
memcmp() involved. And how do you know that _that_ is ok, for every one
of the hundreds, if not thousands, of instances in the tree?

So, for strcpy(), I think gcc does optimize across all arches. For
strcat(), I dunno, it probably doesn't have many users and users of
strcat() obviously do not care about performance anyway. And for
strcmp(), some arch backends already optimize pretty well, some may be
lacking (either because the arch does not have suitable string ops, or
because nobody implemented the optimizations in gcc yet), and converting
blindly to memcmp() seems to be somewhat dangerous - a few pre-existing
"open-coded" such instances doesn't really justify a tree-wide conversion.

Rasmus