if only I'd read the Wise One's words more carefully:
a load which misses the dcache and hits the write buffer causes the
buffer to flush, and the load get the data from off-chip.
- in the case of this loop, the load *hits* in the Dcache. That's why:
-- you see a speedup
-- you get the same performance between 21066 (noname) and 21064 (Cabriolet)
So, the process is:
-- the store hits in the Dcache, and updates the Dcache value. Since the
Dcache can't be dirty, the store data posts an entry into the write buffer.
-- the load hit sin the Dcache, so doesn't force the write buffer flush.
Phew! Thanks, Anthony, for setting me straight on that one. Saved me having
to pore over DavidMT's code at the weekend.
So, it seems like the problem with the store/load 'optimisation' is to decide
whether the data is likely to be Dcached.
Neal.