Maybe cpus these days have so much store bandwith that doingon modern x86 cpus the memset may even be faster if the memory isn't in cache;
things like the above is OK, but I doubt it :-)
the "explicit" method ends up doing Write Allocate on the cache lines
(so read them from memory) even though they then end up being written entirely.
With memset the CPU is told that the entire range is set to a new value, and
the WA can be avoided for the whole-cachelines in the range.
Don't you have write combining store buffers? Or is it still speculatively
issuing the reads even before the whole cacheline is combined?