Re: Pentium-M?

From: dean gaudet
Date: Mon Aug 25 2003 - 23:11:00 EST


On Sat, 23 Aug 2003, Zwane Mwaikambo wrote:

> On Sat, 23 Aug 2003, Barry K. Nathan wrote:
>
> > On Sat, Aug 23, 2003 at 09:03:17AM -0400, Zwane Mwaikambo wrote:
> > > That's interesting, intel compiler recommends P4 type optimisations,
> > > also worth noting that the P-M has hardware prefetch.
> >
> > I'm pretty sure the "Tualatin" Pentium III's also have hardware prefetch.
> > So it's not something specific to the P4 or P-M.

yeah tualatin has hw prefetch. p-m can handle more streams than tualatin.


> Someone else (in concordance with Mikael) also pointed out that the
> cacheline size is also the same as the PIII and not P4. So it's best
> going for PIII optimisations. It's best ignoring my previous comment then.

P-M has a 64-byte L1 dcacheline size same as P4 -- p3 has only 32 bytes.
see <http://sandpile.org/impl/pm.htm> for example. (it's possible to
prove it experimentally if you want :) but i thought kernel cacheline
size stuff was only important for SMP locking alignments?

there's details regarding the differences between P-M and P4 in the latest
"P4 optimisation guide", which was document id 24896609 at
developer.intel.com last time i fetched it, it might have been rev'd since
then.

basically the main disadvantage to selecting P4 for kernel compiling on a
centrino is that P-M has the same complex-simple-simple (4-1-1) uop
decoding machinery as the entire P6 family line... whereas P4's trace
cache somewhat offsets the P4's decoder's quirks. so stuff scheduled for
p4 may not produce the best complex-simple-simple sequence that a p6
family processor wants to see. i'm not really sure how well gcc does in
either case though... (whereas icc does a stellar job.)

i bet intel is saying "optimise for p4" for two reasons:

(a) the reality is way too complex to describe
(b) p4 is their long-term bet and they'd rather code be targetted to it
specifically

however, if/when gcc picks up some of the more wacked p4 optimisations,
such as turning multiplication by 32 into:

add eax,eax
add eax,eax
add eax,eax
add eax,eax
add eax,eax

then you'll start to see penalties on p6 cores ... the p4 does not like
shifts or lea. the above runs in 2.5 cycles on a p4 (double-pumped ALUs)
whereas a shift or lea would be 4 clocks.

-dean
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/