RE: [PATCH] add slice by 8 algorithm to crc32.c

From: Joakim Tjernlund
Date: Mon Aug 08 2011 - 03:15:24 EST


"Bob Pearson" <rpearson@xxxxxxxxxxxxxxxxxxxxx> wrote on 2011/08/05 19:27:26:
>
> > >
> > > >
> > > > >
> > > > > Modify all 'i' loops from for (i = 0; i < foo; i++) { ... } to for
> (i =
> > > foo
> > > > > - 1; i >= 0; i--) { ... }
> > > >
> > > > That should be (i = foo; i ; --i) { ... }
> > >
> > > Shouldn't make much difference, branch on zero bit or branch on sign
> bit.
> > > But at the end of the day didn't help on Nehalem.
>
> I figured out why "for (i = 0; i < len; i++) {...}" is faster than "for (;
> len; len--) {...}" on my system.
> The current code is
>
> for (; Ien; len--) {
> load *++p
> ...
> }
>
> Which turns into (in fake assembly)
>
> top:
> dec len
> inc p
> load p
> ...
> test len
> branch neq top
>
> But when I replace that with
>
> for(i = 0; i < len; i++) {
> load *++p
> ...
> }
>
> Gcc turns it into
>
> top:
> load p[i]
> i++
> ...
> compare i, len
> branch lt top
>
> which is fewer instructions and i++ is well scheduled. Incrementing the
> pointer has been moved out of the loop.

I see. Lets leave the pre vs. post inc. for now. That is something
that can be sorted separately.

Jocke

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/