/* Majority: (x^y)|(y&z)|(z&x) = (x & z) + ((x ^ z) & y)
#define F3(x,y,z,dest) \
movl z, TMP; \
andl x, TMP; \
addl TMP, dest; \
movl z, TMP; \
xorl x, TMP; \
andl y, TMP; \
addl TMP, dest
Since y is the most recently computed result (it's rotated in the
previous round), I arranged the code to delay its use as late as
possible.
Now you have one more register to play with.
A faster way is to unroll 5 iterations and do:
e += F(b, c, d) + K + rol32(a, 5) + W[i ]; b = rol32(b, 30);
d += F(a, b, c) + K + rol32(e, 5) + W[i+1]; a = rol32(a, 30);
c += F(e, a, b) + K + rol32(d, 5) + W[i+2]; e = rol32(e, 30);
b += F(d, e, a) + K + rol32(c, 5) + W[i+3]; d = rol32(d, 30);
a += F(c, d, e) + K + rol32(b, 5) + W[i+4]; c = rol32(c, 30);
then loop over that 4 times each. This is somewhat larger, but
still reasonably compact; only 20 of the 80 rounds are written out
long-hand.