Using inline asm on the i386

Colin Plumb (colin@nyx.net)
Sun, 19 Apr 1998 00:15:15 -0600 (MDT)


I notice that a lot of inline asm has code like the following:

#define _set_base(addr,base) \
__asm__("movw %%dx,%0\n\t" \
"rorl $16,%%edx\n\t" \
"movb %%dl,%1\n\t" \
"movb %%dh,%2" \
: /* no output */ \
:"m" (*((addr)+2)), \
"m" (*((addr)+4)), \
"m" (*((addr)+7)), \
"d" (base) \
:"dx")

or the now famous:

#define _set_tssldt_desc(n,addr,limit,type) \
__asm__ __volatile__ ("movw %3,0(%2)\n\t" \
"movw %%ax,2(%2)\n\t" \
"rorl $16,%%eax\n\t" \
"movb %%al,4(%2)\n\t" \
"movb %4,5(%2)\n\t" \
"movb $0,6(%2)\n\t" \
"movb %%ah,7(%2)\n\t" \
"rorl $16,%%eax" \
: "=m"(*(n)) : "a" (addr), "r"(n), "ir"(limit), "i"(type))

Note that specific registers are used, mostly because of the need to
access high and low bytes and so on.

It is quite possible to do this with generic registers. The only trick is
that the details are processor-specific (there is no equivalent to the
%ah/%al business in other processors, so no wonder) and some of it is only
really doumented in the i386.h header file in the GCC distribution.

There are two aspects to this, namely using constraints and using
special output substitutions. I'll review constraints for those
who don't remember all the details, and then go on the the output
substitutions where you can get clever.

The "=m" or "r" string is a constraint, specifying what operands
are legal in this position. Each letter is a register class that
is allowed. You can have multiples, indicating
that either is legal, such as "ir", meaning either an immediate
value or a register.

The i386 has *lots* of register classes, designed for anything remotely
useful. Common ones are defined in the "constraints" section of the
GCC manual. Here are the most useful:

g - general effective address
m - memory effective address
r - register
i - immediate value, 0..0xffffffff
n - immediate value known at compile time. (i would allow an address
known only at link time)

But there are some i386-specific ones described in the processor-specific
part of the manual and in more detail in GCC's i386.h:

q - byte-addressible register (eax, ebx, ecx, edx)
A - eax or edx
a, b, c, d, S, D - eax, ebx, ..., edi only

I - immediate 0..31
J - immediate 0..63
K - immediate 255
L - immediate 65535
M - immediate 0..3 (shifts that can be done with lea)
N - immediate 0..255 (one-byte immediate value)
O - immedaite 0..32

There are some more for floating-point registers, but I won't go into those.

It's also possible to make each constraint a comma-separated list
(of the same length). Then GCC may choose either all of the first
combinations, or all of the second. For example, on the i386, a
legal "add" instruction is

asm("addl %1,%0" : "=g,r" (dest) : "ir,g" (src), "0" (dest));

You may add a register or immediate to a general operand, or you
may add a ganeral operand to a register. memory-memory operations
are not allowed.

This also illustrates one last possibility for register classes:
a digit 0..9, saying that *this* operand must go into the same register
as *that* operand. Which is useful in situations like the above
where an address is both source and destination.

The GCC docs for the details of constraints are in the section
on writing back ends for GCC, since (not surprisingly) GCC uses the
same basic system as extended asm to encode what it knows about
a machine's instructions.

There are a few other constraint modifier flags that may be included
in an inline asm.

=, as you probably know, is required to flag operands that are written to.
If there's a really good reason this is needed, it's unclear to me, since
the division into input operands and output operands already makes that
clear, but it's not worth making a big fuss over.

+, described in the manual, looks like a nice way to avoid the need to
specify "0" (dest) above, but it's not legal in extended asm.

% says that the instruction is commutative in two operands; this
operand and the next may be exchanged at the compiler's convenience.

& says that an output operand is written to before the inputs are
read, so this output must not be the same register as any input.
Without this, gcc may place an output and an input in the same register
even if not required by a "0" constraint.

Now, having gone through all of this, the real fun comes when trying to
use registers in the "q" class effectively. Sometimes you want %eax,
sometimes %ax, sometimes %al, and sometimes even %ah.

These are encoded in the substitutions in the string part of the asm.
You already know that "%0" is output to the assembler as the addressing
mode that selects the first operanCd, and so on, but there are modifier
characters that can go betweenthe % and the 0 that can change the way
it's printed. Although these are described in the GCC manual (the
"Output Template" section), the machine-specific details are in the GCC
source code; in particular the i386.h file.

%z0 - Print the opcode suffix for the size of the operand (b, w or l)
%k0 - Print the 32-bit form of the register, regardless of the size of
the operand (%eax). This is usually not necessary.
%w0 - Print the 16-bit form of the operand (%ax)
%b0 - Print the 8-bit form of the operand (%al)
%h0 - print the high-byte form of the register (%ah)

Obviously, most of those require a "r" or even a "q" constraint.
There are some others in the i386.h file, but the above are the
ones important to most uses. (Actually, it is legal to include
such modifiers with an immedaite or memory reference; it just
won't have any effect. So "andb %b1,%b0" : "g,r" : "ri,g"
is legal.)

%z is rather cool. For example, consider the following code:
#define xchg(m, in, out) \
asm("xchg%z0 %2,%0" : "=g" (*(m)), "=r" (out) : "1" (in))

int
bar(void *m, int x)
{
xchg((char *)m, (char)x, x);
xchg((short *)m, (short)x, x);
xchg((int *)m, (int)x, x);
return x;
}

This produces, as assembly output,
.globl bar
.type bar,@function
bar:
movl 4(%esp),%eax
movb 8(%esp),%dl
#APP
xchgb %dl,(%eax)
xchgw %dx,(%eax)
xchgl %edx,(%eax)
#NO_APP
movl %edx,%eax
ret

As you can see, the sizes of everything match up. Note that if
I's used "%1" instead of "%2" in the asm() expansion, all
of the registers would have been %edx, because I'm letting gcc believe
that the output of this asm is 32 bits wide.

For am example that already exists in asm/include/system.h:
/*
* Note: no "lock" prefix even on SMP: xchg always implies lock anyway
*/
static inline unsigned long __xchg(unsigned long x, void * ptr, int size)
{
switch (size) {
case 1:
__asm__("xchgb %b0,%1"
:"=q" (x)
:"m" (*__xg(ptr)), "0" (x)
:"memory");
break;
case 2:
__asm__("xchgw %w0,%1"
:"=r" (x)
:"m" (*__xg(ptr)), "0" (x)
:"memory");
break;
case 4:
__asm__("xchgl %0,%1"
:"=r" (x)
:"m" (*__xg(ptr)), "0" (x)
:"memory");
break;
}
return x;
}

Notice how, in the byte case, the constraint is "q" and the output
template says to print the register in "%b0" form. In the 16-bit case,
all registers are allowed, and it's printed in word form.
In the 32-bit case, there's no special format modifier, since x is
already a 32-bit value.

But looking at something like:

#define _set_tssldt_desc(n,addr,limit,type) \
__asm__ __volatile__ ("movw %3,0(%2)\n\t" \
"movw %%ax,2(%2)\n\t" \
"rorl $16,%%eax\n\t" \
"movb %%al,4(%2)\n\t" \
"movb %4,5(%2)\n\t" \
"movb $0,6(%2)\n\t" \
"movb %%ah,7(%2)\n\t" \
"rorl $16,%%eax" \
: "=m"(*(n)) : "a" (addr), "r"(n), "ri"(limit), "i"(type))

It's obvious that the writer didn't know how to take optimal advantage of
this (admittedly complex, but x86 addressing *is* complex) facility.
This could be rewritten as:

#define _set_tssldt_desc(n,addr,limit,type) \
__asm__ __volatile__ ("movw %w3,0(%2)\n\t" \
"movw %w1,2(%2)\n\t" \
"rorl $16,%1\n\t" \
"movb %b1,4(%2)\n\t" \
"movb %4,5(%2)\n\t" \
"movb $0,6(%2)\n\t" \
"movb %h1,7(%2)\n\t" \
"rorl $16,%1" \
: "=m"(*(n)) : "q" (addr), "r"(n), "ri"(limit), "i"(type))

which would give GCC more flexibility in choosing registers, which leads
to smaller and faster code. (And we all live happily ever after.)

Note also how this gets around the difficulty of adding an offset to
an existing effective address by specifying *n as an output
(even though it's not used, this tells GCC that it's modified)
and n as an input which is actually used.

It is almost possible to actually tell GCC that it can use more general
addressing modes. In particular, "o" is a constraint for an offsettable
addressing mode, one where adding a small offset is also legal. There
are no non-offsettable addressing modes on the i386, although things
like the 680x0's postincrement addressing is an example of such
an addressing mode, so the distinction is perhaps rather fine.

Letting GCC use this for the operand would allow more than just
register-indirect, since you could do

#define _set_tssldt_desc(n,addr,limit,type) \
__asm__ __volatile__ ("movw %w2,%0\n\t" \
"movw %w1,2+%0\n\t" \
"rorl $16,%1\n\t" \
"movb %b1,4+%0\n\t" \
"movb %4,5+%0\n\t" \
"movb $0,6+%0\n\t" \
"movb %h1,7+%0\n\t" \
"rorl $16,%1" \
: "=o"(*(n)) : "q" (addr), "ri"(limit), "i"(type))

This works great except that the output assembler ends up looking
a bit wierd if it turns out that there is no offset. You end up producing
code that looks like:
#APP
movw $235,(%eax)
movw %dx,2+(%eax)
rorl $16,%edx
movb %dl,4+(%eax)
movb $137,5+(%eax)
movb $0,6+(%eax)
movb %dh,7+(%eax)
rorl $16,%edx
#NO_APP

Which gas compiles with warnings:
/tmp/cca03076.s:24: Warning: missing operand; zero assumed
/tmp/cca03076.s:26: Warning: missing operand; zero assumed
/tmp/cca03076.s:27: Warning: missing operand; zero assumed
/tmp/cca03076.s:28: Warning: missing operand; zero assumed
/tmp/cca03076.s:29: Warning: missing operand; zero assumed

It gets it right, but it's not happy. I can't write "2%0" because then
if %0 were 10(%eax), that would come out as "210(%eax)", which is
definitely wrong. And "2+0%0" produces "2+010(%eax)", which is octal,
so you would get "2+8(%eax)" or "10(%eax)", which wasn't quite the
point, either.

GCC uses "o" internally, but adds the offset internally instead of
being limited to textual substitutions.

You could always have lots of output substitutions, one for each
offset, and produce a mess of operands like:
: "=m"(*(n)), "=m" (*((char *)(n)+2)),
"=m" (*((char *)(n)+4)), "=m" (*((char *)(n)+5)),
"=m" (*((char *)(n)+6)), "=m" (*((char *)(n)+7)),
: "q" (addr), "ri"(limit), "i"(type))

That works, but ugh.

Since the address is referred to so often, it's not actually bad
to force it to be a register rather than some complex indexed mode, but
it would be nice to have a solution for the problem.

Any ideas?

As for things like:

#define _set_gate(gate_addr,type,dpl,addr) \
__asm__ __volatile__ ("movw %%dx,%%ax\n\t" \
"movw %2,%%dx\n\t" \
"movl %%eax,%0\n\t" \
"movl %%edx,%1" \
:"=m" (*((long *) (gate_addr))), \
"=m" (*(1+(long *) (gate_addr))) \
:"i" ((short) (0x8000+(dpl<<13)+(type<<8))), \
"d" ((char *) (addr)),"a" (__KERNEL_CS << 16) \
:"ax","dx")

Obviously that could be turned into:

#define _set_gate(gate_addr,type,dpl,addr) \
do { int _1, _2; /* Dummy variables */ \
__asm__ __volatile__ ("movw %w2,%w3\n\t" \
"movw %4,%w2\n\t" \
"movl %3,%0\n\t" \
"movl %2,%1" \
:"=m" (*((long *) (gate_addr))), \
"=m" (*(1+(long *) (gate_addr))) \
"=r" (_1),
"=r" (_2),
:"i" ((short) (0x8000+(dpl<<13)+(type<<8))), \
"2" ((char *) (addr)), "3" (__KERNEL_CS << 16)) \
} while (0)

I am assuming that it's actually a good idea to do this because GCC
emits bad code on the obvious C version of all of this.

I kind of wonder, however, about the relative speed of:

#define _set_gate(gate_addr,type,dpl,addr) \
do { int _1, _2; /* Dummy variables */ \
__asm__ __volatile__ ("xchgw %w2,%w3\n\t" \
"movl %3,%0\n\t" \
"movl %2,%1" \
:"=m" (*((long *) (gate_addr))), \
"=m" (*(1+(long *) (gate_addr))) \
"=r" (_1),
"=r" (_2),
:"3" ((__KERNEL_CS<<16) + 0x8000+(dpl<<13)+(type<<8))), \
"2" ((char *) (addr))) \
} while (0)

Oh, well, there's always more to hack.

I hope this has been of use to some folks. GCC's inline asm features
are really wonderful, since you can use them with full optimization
and you don't have to outsmart the compiler, but this is because the
compiler really understands what's going on.

This has the unfortunate side effect that you have to learn how to
explain to the compiler what's going on. But it's worth it, really!

-- 
	-Colin

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu