bypass theory: nice try, but no cigar.

Neal. (crook@rdgeng.enet.dec.com)
Fri, 1 Sep 95 17:46:42 MET DST


David Mosberger-Tang writes:

>>How could it be that a store directly followed by a load is so much
>>faster than a "properly scheduled" instruction sequence? Well, the
>>EV4 CPU has a bypass path that allows to read data directly from the
>>write buffer (which is 4 cache blocks deep). My guess is that in the
>>former sequence, the load is able to read the data from the write
>>buffer, while in the latter sequence the load misses the write buffer
>>and has to go to the bcache instead (this is just a guess
>>though---awfully hard to say exactly what's going on in the CPU).

Well, I thought this sounded fishy. My recollection was that a miss
would cause the write buffer to *flush* (empty completely) then honour
the read miss from external memory/Bcache. I checked the hardware reference
manual and it says:

(that a write buffer tries to empty itself off-chip when) "A
load miss is pending to an address currently valid in the write buffer that
requires the write buffer to be flushed. The write buffer is completely flushed
regardless of which entry matches the address"

Now, that *seems* to say that the load miss *can only* be satisfied by data
from external Bcache/memory, but that word //requires// adds some
ambiguity.

I checked with the designer, and that exhaulted (sp?) being said (and these
are his exact ASCII characters):

"a load which misses the dcache and hits the write buffer causes the
buffer to flush, and the load get the data from off-chip."

So, the question remains: just exactly *what is* going on?

Have a good weekend, y'all.

Neal.

---------
APPENDIX

Just for the record, the usual use of the word 'bypass' is for a case like:

load a <- b + c
load d <- a + e

- the second instruction has a data dependency upon the first. The simple-minded
way of handling this is to wait until the value of b+c has been written to a,
then use that value in the calculation of d.

'Bypassing' is the technique of grabbing the value of b+c for the d calculation
*at the same time* as it is being written to a. The benefit of this is that
it reduces the 'producer-consumer' latency (typically by 1 clock cycle).