Neal> Tsk. if only I'd read the Wise One's words more carefully:
Neal> a load which misses the dcache and hits the write buffer
Neal> causes the buffer to flush, and the load get the data from
Neal> off-chip.
Neal> - in the case of this loop, the load *hits* in the
Neal> Dcache. That's why:
Neal> -- you see a speedup
Neal> -- you get the same performance between 21066 (noname)
Neal> and 21064 (Cabriolet)
Neal> So, the process is:
Neal> -- the store hits in the Dcache, and updates the Dcache
Neal> value. Since the Dcache can't be dirty, the store data posts
Neal> an entry into the write buffer.
Neal> -- the load hit sin the Dcache, so doesn't force the write
Neal> buffer flush.
Neal> Phew! Thanks, Anthony, for setting me straight on that
Neal> one. Saved me having to pore over DavidMT's code at the
Neal> weekend.
Neal> So, it seems like the problem with the store/load
Neal> 'optimisation' is to decide whether the data is likely to be
Neal> Dcached.
Oh, I see! I didn't know that stores *do* update the d-cache
(provided the line is in the cache already). That's cool, because it
means that fp<->integer conversions will usually run at CPU speeds
(because the top of the stack is normally in the d-cache already).
Interesting!
Thanks much for digging into this!
Have a great weekend (no more mails---promised! :)
--david