Re: Big git diff speedup by avoiding x86 "fast string" memcmp

From: Nick Piggin
Date: Sun Dec 19 2010 - 10:46:17 EST

Next message: Larry Finger: "Re: [Bug #22562] Regression in 2.6.37-rc1 - logs spammed with "unableto enumerate USB port" - bisected to commit 3df7169e"
Previous message: United Nations Poverty Alleviation: "[no subject]"
In reply to: Boaz Harrosh: "Re: Big git diff speedup by avoiding x86 "fast string" memcmp"
Next in thread: George Spelvin: "Re: Big git diff speedup by avoiding x86 "fast string" memcmp"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, Dec 19, 2010 at 9:54 AM, George Spelvin <linux@xxxxxxxxxxx> wrote:
>> static inline int dentry_memcmp_long(const unsigned char *cs,
>> const unsigned char *ct, ssize_t count)
>> {
>> int ret;
>> const unsigned long *ls = (const unsigned long *)cs;
>> const unsigned long *lt = (const unsigned long *)ct;
>>
>> while (count > 8) {
>> ret = (*cs != *ct);
>> if (ret)
>> break;
>> cs++;
>> ct++;
>> count-=8;
>> }
>> if (count) {
>> unsigned long t = *ct & ((0xffffffffffffffff >> ((8 - count) * 8))
>> ret = (*cs != t)
>> }
>>
>> return ret;
>> }
>
> First, let's get the code right, and use correct types, but also, there

You still used the wrong vars in the loop.

> are some tricks to reduce the masking cost.
>
> As long as you have to mask one string, *and* don't have to worry about
> running off the end of mapped memory, there's no additional cost to
> masking both in the loop. Just test (a ^ b) & mask.

Using a lookup table I considered, but maybe not well enough. It is
another cacheline, but common to all lookups. So it could well be
worth it, let's keep your code around...

The big problem for CPUs that don't do well on this type of code is
what the string goes through during the entire syscall.

First, a byte-by-byte strcpy_from_user of the whole name string to
kernel space. Then a byte-by-byte chunking and hashing component
paths according to '/'. Then a byte-by-byte memcmp against the
dentry name.

I'd love to do everything with 8 byte loads, do the component
separation and hashing at the same time as copy from user, and
have the padded and aligned component strings and their hash
available... but complexity.

On my Westmere system, time to do a stat is 640 cycles plus 10
cycles for every byte in the string (this cost holds perfectly
from 1 byte name up to 32 byte names in my test range).
`git diff` average path name strings are 31 bytes, although this
is much less cache friendly, and over several components (my
test is just a single component).

But still, even if the base cost were doubled, it may still
spend 20% or so kernel cycles in name string handling.

This 8 byte memcpy takes my microbenchmark down to 8 cycles per
byte, so it may get several more % on git diff.

A careful thinking about the initial strcpy_from_user, and
hashing code could shave another few cycles off it. Well
worth investigating I think.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Larry Finger: "Re: [Bug #22562] Regression in 2.6.37-rc1 - logs spammed with "unableto enumerate USB port" - bisected to commit 3df7169e"
Previous message: United Nations Poverty Alleviation: "[no subject]"
In reply to: Boaz Harrosh: "Re: Big git diff speedup by avoiding x86 "fast string" memcmp"
Next in thread: George Spelvin: "Re: Big git diff speedup by avoiding x86 "fast string" memcmp"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]