Generated code not matching expectations with Extended ASM - c++

I have a CpuFeatures class. The requirements for the class are simple: (1) preserve EBX or RBX, and (2) record the values returned from CPUID in EAX/EBX/ECX/EDX. I'm not sure the code being generated is the code I intended.
The CpuFeatures class code uses GCC Extended ASM. Here's the relevant code:
struct CPUIDinfo
{
word32 EAX;
word32 EBX;
word32 ECX;
word32 EDX;
};
bool CpuId(word32 func, word32 subfunc, CPUIDinfo& info)
{
uintptr_t scratch;
__asm__ __volatile__ (
".att_syntax \n"
#if defined(__x86_64__)
"\t xchgq %%rbx, %q1 \n"
#else
"\t xchgl %%ebx, %k1 \n"
#endif
"\t cpuid \n"
#if defined(__x86_64__)
"\t xchgq %%rbx, %q1 \n"
#else
"\t xchgl %%ebx, %k1 \n"
#endif
: "=a"(info.EAX), "=&r"(scratch), "=c"(info.ECX), "=d"(info.EDX)
: "a"(func), "c"(subfunc)
);
if(func == 0)
return !!info.EAX;
return true;
}
The code below was compiled with -g3 -Og on Cygwin i386. When I examine it under a debugger, I'm don't like what I am seeing.
Dump of assembler code for function CpuFeatures::DoDetectX86Features():
...
0x0048f355 <+1>: sub $0x48,%esp
=> 0x0048f358 <+4>: mov $0x0,%ecx
0x0048f35d <+9>: mov %ecx,%eax
0x0048f35f <+11>: xchg %ebx,%ebx
0x0048f361 <+13>: cpuid
0x0048f363 <+15>: xchg %ebx,%ebx
0x0048f365 <+17>: mov %eax,0x10(%esp)
0x0048f369 <+21>: mov %ecx,0x18(%esp)
0x0048f36d <+25>: mov %edx,0x1c(%esp)
0x0048f371 <+29>: mov %ebx,0x14(%esp)
0x0048f375 <+33>: test %eax,%eax
...
I don't like what I am seeing because it appears EBX/RBX is not being preserved (xchg %ebx,%ebx at +11). Additionally, it looks like the preserved EBX/RBX is being saved as the result of CPUID, and not the actual value of EBX returned by CPUID (xchg %ebx,%ebx at +15, before the mov %ebx,0x14(%esp) at +29).
If I change the operand to use a memory op with "=&m"(scratch), then the generated code is:
0x0048f35e <+10>: xchg %ebx,0x40(%esp)
0x0048f362 <+14>: cpuid
0x0048f364 <+16>: xchg %ebx,0x40(%esp)
A related question is What ensures reads/writes of operands occurs at desired times with extended ASM?
What am I doing wrong (besides wasting countless hours on something that should have taken 5 or 15 minutes)?

The code below is a complete example that I used to compile your example code above including the modification to exchange(swap) directly to the info.EBX variable.
#include <inttypes.h>
#define word32 uint32_t
struct CPUIDinfo
{
word32 EAX;
word32 EBX;
word32 ECX;
word32 EDX;
};
bool CpuId(word32 func, word32 subfunc, CPUIDinfo& info)
{
__asm__ __volatile__ (
".att_syntax \n"
#if defined(__x86_64__)
"\t xchgq %%rbx, %q1 \n"
#else
"\t xchgl %%ebx, %k1 \n"
#endif
"\t cpuid \n"
#if defined(__x86_64__)
"\t xchgq %%rbx, %q1 \n"
#else
"\t xchgl %%ebx, %k1 \n"
#endif
: "=a"(info.EAX), "=&m"(info.EBX), "=c"(info.ECX), "=d"(info.EDX)
: "a"(func), "c"(subfunc)
);
if(func == 0)
return !!info.EAX;
return true;
}
int main()
{
CPUIDinfo cpuInfo;
CpuId(1, 0, cpuInfo);
}
The first observation that you should make is that I chose to use the info.EBX memory location to do the actual swap to. This eliminates needing a another temporary variable or register.
I assembled as 32-bit code with -g3 -Og -S -m32 and got these instructions of interest:
xchgl %ebx, 4(%edi)
cpuid
xchgl %ebx, 4(%edi)
movl %eax, (%edi)
movl %ecx, 8(%edi)
movl %edx, 12(%edi)
%edi happens to contain the address of the info structure. 4(%edi) happens to be the address of info.EBX. We swap %ebx and 4(%edi) after cpuid. With that instruction ebx is restored to what it was before cpuid and 4(%edi) now has what ebx was right after cpuid was executed. The remaining movl lines place eax, ecx, edx registers into the rest of the info structure via the %edi register.
The generated code above is what I would expect it to be.
Your code with the scratch variable (and using the constraint "=&m"(scratch)) never gets used after the assembler template so %ebx,0x40(%esp) has the value you want but it never gets moved anywhere useful. You'd have to copy the scratch variable into info.EBX (ie. info.EBX = scratch;)and look at all of the resulting instructions that get generated. At some point the data would be copied from the scratch memory location to info.EBX among the generated assembly instructions.
Update - Cygwin and MinGW
I wasn't entirely satisfied that the Cygwin code output was correct. In the middle of the night I had an Aha! moment. Windows already does its own position independent code when the dynamic link loader loads an image (DLL etc) and modifies the image via re-basing. There is no need for additional PIC processing like it is done in Linux 32 bit shared libraries so there is no issue with ebx/rbx. This is why Cygwin and MinGW will present warnings like this when compiling with -fPIC
warning: -fPIC ignored for target (all code is position independent)
This is because under Windows all 32bit code can be re-based when it is loaded by the Windows dynamic loader. More about re-basing can be found in this Dr. Dobbs article. Information on the windows Portable Executable format (PE) can be found in this Wiki article. Cygwin and MinGW don't need to worry about preserving ebx/rbx when targeting 32bit code because on their platforms PIC is already handled by the OS, other re-basing tools, and the linker.

Related

Why does clang make the Quake fast inverse square root code 10x faster than with GCC? (with *(long*)float type punning)

I'm trying to benchmark the fast inverse square root. The full code is here:
#include <benchmark/benchmark.h>
#include <math.h>
float number = 30942;
static void BM_FastInverseSqrRoot(benchmark::State &state) {
for (auto _ : state) {
// from wikipedia:
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) );
// y = y * ( threehalfs - ( x2 * y * y ) );
float result = y;
benchmark::DoNotOptimize(result);
}
}
static void BM_InverseSqrRoot(benchmark::State &state) {
for (auto _ : state) {
float result = 1 / sqrt(number);
benchmark::DoNotOptimize(result);
}
}
BENCHMARK(BM_FastInverseSqrRoot);
BENCHMARK(BM_InverseSqrRoot);
and here is the code in quick-bench if you want to run it yourself.
Compiling with GCC 11.2 and -O3, the BM_FastInverseSqrRoot is around 31 times slower than Noop (around 10 ns when I ran it locally on my machine). Compiling with Clang 13.0 and -O3, it is around 3.6 times slower than Noop (around 1 ns when I ran it locally on my machine). This is a 10x speed difference.
Here is the relevant Assembly (taken from quick-bench).
With GCC:
push %rbp
mov %rdi,%rbp
push %rbx
sub $0x18,%rsp
cmpb $0x0,0x1a(%rdi)
je 408c98 <BM_FastInverseSqrRoot(benchmark::State&)+0x28>
callq 40a770 <benchmark::State::StartKeepRunning()>
408c84 add $0x18,%rsp
mov %rbp,%rdi
pop %rbx
pop %rbp
jmpq 40aa20 <benchmark::State::FinishKeepRunning()>
nopw 0x0(%rax,%rax,1)
408c98 mov 0x10(%rdi),%rbx
callq 40a770 <benchmark::State::StartKeepRunning()>
test %rbx,%rbx
je 408c84 <BM_FastInverseSqrRoot(benchmark::State&)+0x14>
movss 0x1b386(%rip),%xmm4 # 424034 <_IO_stdin_used+0x34>
movss 0x1b382(%rip),%xmm3 # 424038 <_IO_stdin_used+0x38>
mov $0x5f3759df,%edx
nopl 0x0(%rax,%rax,1)
408cc0 movss 0x237a8(%rip),%xmm0 # 42c470 <number>
mov %edx,%ecx
movaps %xmm3,%xmm1
2.91% movss %xmm0,0xc(%rsp)
mulss %xmm4,%xmm0
mov 0xc(%rsp),%rax
44.70% sar %rax
3.27% sub %eax,%ecx
3.24% movd %ecx,%xmm2
3.27% mulss %xmm2,%xmm0
9.58% mulss %xmm2,%xmm0
10.00% subss %xmm0,%xmm1
10.03% mulss %xmm2,%xmm1
9.64% movss %xmm1,0x8(%rsp)
3.33% sub $0x1,%rbx
jne 408cc0 <BM_FastInverseSqrRoot(benchmark::State&)+0x50>
add $0x18,%rsp
mov %rbp,%rdi
pop %rbx
pop %rbp
408d0a jmpq 40aa20 <benchmark::State::FinishKeepRunning()>
With Clang:
push %rbp
push %r14
push %rbx
sub $0x10,%rsp
mov %rdi,%r14
mov 0x1a(%rdi),%bpl
mov 0x10(%rdi),%rbx
call 213a80 <benchmark::State::StartKeepRunning()>
test %bpl,%bpl
jne 212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
test %rbx,%rbx
je 212e69 <BM_FastInverseSqrRoot(benchmark::State&)+0x79>
movss -0xf12e(%rip),%xmm0 # 203cec <_IO_stdin_used+0x8>
movss -0xf13a(%rip),%xmm1 # 203ce8 <_IO_stdin_used+0x4>
cs nopw 0x0(%rax,%rax,1)
nopl 0x0(%rax)
212e30 2.46% movd 0x3c308(%rip),%xmm2 # 24f140 <number>
4.83% movd %xmm2,%eax
8.07% mulss %xmm0,%xmm2
12.35% shr %eax
2.60% mov $0x5f3759df,%ecx
5.15% sub %eax,%ecx
8.02% movd %ecx,%xmm3
11.53% mulss %xmm3,%xmm2
3.16% mulss %xmm3,%xmm2
5.71% addss %xmm1,%xmm2
8.19% mulss %xmm3,%xmm2
16.44% movss %xmm2,0xc(%rsp)
11.50% add $0xffffffffffffffff,%rbx
jne 212e30 <BM_FastInverseSqrRoot(benchmark::State&)+0x40>
212e69 mov %r14,%rdi
call 213af0 <benchmark::State::FinishKeepRunning()>
add $0x10,%rsp
pop %rbx
pop %r14
pop %rbp
212e79 ret
They look pretty similar to me. Both seem to be using SIMD registers/instructions like mulss. The GCC version has a sar that is supposedly taking 46%? (But I think it's just mislabelled and it's the mulss, mov, sar that together take 46%). Anyway, I'm not familiar enough with Assembly to really tell what is causing such a huge performance difference.
Anyone know?
Just FYI, Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64? - no, obsoleted by SSE1 rsqrtss which you can use with or without a Newton iteration.
As people pointed out in comments, you're using 64-bit long (since this is x86-64 on a non-Windows system), pointing it at a 32-bit float. So as well as a strict-aliasing violation (use memcpy or std::bit_cast<int32_t>(myfloat) for type punning), that's a showstopper for performance as well as correctness.
Your perf report output confirms it; GCC is doing a 32-bit movss %xmm0,0xc(%rsp) store to the stack, then a 64-bit reload mov 0xc(%rsp),%rax, which will cause a store forwarding stall costing much extra latency. And a throughput penalty, since actually you're testing throughput, not latency: the next computation of an inverse sqrt only has a constant input, not the result of the previous iteration. (benchmark::DoNotOptimize contains a "memory" clobber which stops GCC/clang from hoisting most of the computation out of the loop; they have to assume number may have changed since it's not const.)
The instruction waiting for the load result (the sar) is getting the blame for those cycles, as usual. (When an interrupt fires to collect a sample upon the cycles event counter wrapping around, the CPU has to figure out one instruction to blame for that event. Usually this ends up being the one waiting for an earlier slow instruction, or maybe just one after a slow instruction even without a data dependency, I forget.)
Clang chooses to assume that the upper 32 bits are zero, thus movd %xmm0, %eax to just copy the register with an ALU uop, and the shr instead of sar because it knows it's shifting in a zero from the high half of the 64-bit long it's pretending to work with. (A function call still used %rdi so that isn't Windows clang.)
Bugfixed version: GCC and clang make similar asm
Fixing the code on the quick-bench link in the question to use int32_t and std::bit_cast, https://godbolt.org/z/qbxqsaW4e shows GCC and clang compile similarly with -Ofast, although not identical. e.g. GCC loads number twice, once into an integer register, once into XMM0. Clang loads once and uses movd eax, xmm2 to get it.
On QB (https://quick-bench.com/q/jYLeX2krrTs0afjQKFp6Nm_G2v8), now GCC's BM_FastInverseSqrRoot is faster by a factor of 2 than the naive version, without -ffast-math
And yes, the naive benchmark compiles to sqrtss / divss without -ffast-math, thanks to C++ inferring sqrtf from sqrt(float). It does check for the number being >=0 every time, since quick-bench doesn't allow compiling with -fno-math-errno to omit that check to maybe call the libm function. But that branch predicts perfectly so the loop should still easily just bottleneck on port 0 throughput (div/sqrt unit).
Quick-bench does allow -Ofast, which is equivalent to -O3 -ffast-math, which uses rsqrtss and a Newton iteration. (Would be even faster with FMA available, but quick-bench doesn't allow -march=native or anything. I guess one could use __attribute__((target("avx,fma"))).
Quick-bench is now giving Error or timeout whether I use that or not, with Permission error mapping pages. and suggesting a smaller -m/--mmap_pages so I can't test on that system.
rsqrt with a Newton iteration (like compilers use at -Ofast for this) is probably faster or similar to Quake's fast invsqrt, but with about 23 bits of precision.

Why does my program not check the value of a bitfield member even though there is an "if" statement?

I wrote this program as a test case for the behavior of bit field member comparisons in C++ (I suppose the same behavior would be exhibited in C as well):
#include <cstdint>
#include <cstdio>
union Foo
{
int8_t bar;
struct
{
#if __BYTE_ORDER == __LITTLE_ENDIAN
int8_t baz : 1;
int8_t quux : 7;
#elif __BYTE_ORDER == __BIG_ENDIAN
int8_t quux : 7;
int8_t baz : 1;
#endif
};
};
int main()
{
Foo foo;
scanf("%d", &foo.bar);
if (foo.baz == 1)
printf("foo.baz == 1\n");
else
printf("foo.baz != 1\n");
}
After I compile and run it with 1 as its input, I get the following output:
foo.baz != 1
*** stack smashing detected ***: terminated
fish: “./a.out” terminated by signal SIGABRT (Abort)
One would expect that the foo.baz == 1 check would be evaluated as true since baz is always the least significant bit in the anonymous bit field. However, the opposite seems to happen, as can be seen from the program output (which is, somewhat comfortingly, consistently the same across each program invocation).
Even more weird to me is the fact that the generated AMD64 assembly code for the program (using the GCC 10.2 compiler) does not contain even a single comparison or jump instruction!
.LC0:
.string "%d"
.LC1:
.string "foo.baz != 1"
main:
push rbp
mov rbp, rsp
sub rsp, 16
lea rax, [rbp-1]
mov rsi, rax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call scanf
mov edi, OFFSET FLAT:.LC1
call puts
mov eax, 0
leave
ret
It seems that the C++ code for the if statement somehow gets optimized out (or something like that), even though I compiled the program with the default settings (i.e. I did not turn on any level of optimization or anything like that).
Interestingly enough, Clang 10.0.1 (when run without optimizations) seems to generate code with a cmp instruction (as well as a jne and a jmp one):
main: # #main
push rbp
mov rbp, rsp
sub rsp, 16
mov dword ptr [rbp - 4], 0
lea rax, [rbp - 8]
movabs rdi, offset .L.str
mov rsi, rax
mov al, 0
call scanf
mov cl, byte ptr [rbp - 8]
shl cl, 7
sar cl, 7
movsx edx, cl
cmp edx, 1
jne .LBB0_2
movabs rdi, offset .L.str.1
mov al, 0
call printf
jmp .LBB0_3
.LBB0_2:
movabs rdi, offset .L.str.2
mov al, 0
call printf
.LBB0_3:
mov eax, dword ptr [rbp - 4]
add rsp, 16
pop rbp
ret
.L.str:
.asciz "%d"
.L.str.1:
.asciz "foo.baz == 1\n"
.L.str.2:
.asciz "foo.baz != 1\n"
Both of the printf strings also seem to be present in the data segment (unlike in the GCC case when only the second one is present). I cannot tell for sure (because I'm not very proficient in assembly) but this seems to be properly generated code (unlike the one which GCC generates).
However, as soon as I try compile with any kind of optimizations (even -O1) using Clang, the comparisons/jumps are gone (as well as the foo.baz == 1 string), and the generated code seems to be very similar to the one which GCC generates:
(with -O1)
main: # #main
push rax
mov rsi, rsp
mov edi, offset .L.str
xor eax, eax
call scanf
mov edi, offset .Lstr
call puts
xor eax, eax
pop rcx
ret
.L.str:
.asciz "%d"
.Lstr:
.asciz "foo.baz != 1"
(You may want to check the generated assembly code by different compiler versions yourself using Compiler Explorer.)
I'm totally perplexed by this kind of unintuitive behavior. The only thing which comes to mind as an explanation is the interaction of some weird undefined behavior of bitfields containing signed integral types and unions. What makes me think so is that after I replace the signed integer types with their unsigned counterparts, the output of the program becomes exactly as one would expect (with 1 as input):
foo.baz == 1
*** stack smashing detected ***: terminated
fish: “./a.out” terminated by signal SIGABRT (Abort)
Naturally, the program crashing because of a stack smashing (just like before) is something which is not supposed to happen, which leads to my second question: why does this occur?
Here's the modified program:
#include <cstdint>
#include <cstdio>
union Foo
{
uint8_t bar;
struct
{
#if __BYTE_ORDER == __LITTLE_ENDIAN
uint8_t baz : 1;
uint8_t quux : 7;
#elif __BYTE_ORDER == __BIG_ENDIAN
uint8_t quux : 7;
uint8_t baz : 1;
#endif
};
};
int main()
{
Foo foo;
scanf("%d", &foo.bar);
if (foo.baz == 1)
printf("foo.baz == 1\n");
else
printf("foo.baz != 1\n");
}
... and the generated assembly code by GCC:
.LC0:
.string "%d"
.LC1:
.string "foo.baz == 1"
.LC2:
.string "foo.baz != 1"
main:
push rbp
mov rbp, rsp
sub rsp, 16
lea rax, [rbp-1]
mov rsi, rax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call scanf
movzx eax, BYTE PTR [rbp-1]
and eax, 1
test al, al
je .L2
mov edi, OFFSET FLAT:.LC1
call puts
jmp .L3
.L2:
mov edi, OFFSET FLAT:.LC2
call puts
.L3:
mov eax, 0
leave
ret
The stack smashing has nothing to do with member access.
scanf("%d", &foo.bar);
The %d format conversion specifier is for an int. Which is, typically, 4 bytes. But your bar is:
int8_t bar;
just one byte.
So, scanf ends up writing a 4 bytes worth of an int value into a one byte bar, and clobbering three additional bytes in the immediate vicinity.
There's your stack smash.
The answer is trivial.
your baz struct member is 1 bit long and it is signed. So it will never be 1. The only possibe values are 0 and -1.
Compiler knows that so the condition foo.baz == 1 will never be the truth. No conditional code has to be generated.
So I afraid it is not the compiler bug, only the programmer bug :)
So if we change the code to:
int main()
{
union Foo foo;
int x;
scanf("%d", &x);
foo.bar = x;
if (foo.baz == -1)
printf("foo.baz == -1\n");
else
printf("foo.baz != -1\n");
}
Compiler starts to generate the conditional instructions.
https://godbolt.org/z/fzKMo5
BTW your endianess check does not make any sense here as endianess defines the byte order not the bit order
Not related to the code generation problem is use of the wrong scanf conversion specifier.

Trying to understand simple disassembled code from g++

I am still struggling with g++ inline assembler and trying to understand how to use it.
I've adapted a piece of code from here: http://asm.sourceforge.net/articles/linasm.html (Quoted from the "Assembler Instructions with C Expressions Operands" section in gcc info files)
static inline uint32_t sum0() {
uint32_t foo = 1, bar=2;
uint32_t ret;
__asm__ __volatile__ (
"add %%ebx,%%eax"
: "=eax"(ret) // ouput
: "eax"(foo), "ebx"(bar) // input
: "eax" // modify
);
return ret;
}
I've compiled disabling optimisations:
g++ -Og -O0 inline1.cpp -o test
The disassembled code puzzles me:
(gdb) disassemble sum0
Dump of assembler code for function sum0():
0x00000000000009de <+0>: push %rbp ;prologue...
0x00000000000009df <+1>: mov %rsp,%rbp ;prologue...
0x00000000000009e2 <+4>: movl $0x1,-0xc(%rbp) ;initialize foo
0x00000000000009e9 <+11>: movl $0x2,-0x8(%rbp) ;initialize bar
0x00000000000009f0 <+18>: mov -0xc(%rbp),%edx ;
0x00000000000009f3 <+21>: mov -0x8(%rbp),%ecx ;
0x00000000000009f6 <+24>: mov %edx,-0x14(%rbp) ; This is unexpected
0x00000000000009f9 <+27>: movd -0x14(%rbp),%xmm1 ; why moving variables
0x00000000000009fe <+32>: mov %ecx,-0x14(%rbp) ; to extended registers?
0x0000000000000a01 <+35>: movd -0x14(%rbp),%xmm2 ;
0x0000000000000a06 <+40>: add %ebx,%eax ; add (as expected)
0x0000000000000a08 <+42>: movd %xmm0,%edx ; copying the wrong result to ret
0x0000000000000a0c <+46>: mov %edx,-0x4(%rbp) ; " " " " " "
0x0000000000000a0f <+49>: mov -0x4(%rbp),%eax ; " " " " " "
0x0000000000000a12 <+52>: pop %rbp ;
0x0000000000000a13 <+53>: retq
End of assembler dump.
As expected, the sum0() function returns the wrong value.
Any thoughts? What is going on? How to get it right?
-- EDIT --
Based on #MarcGlisse comment, I tried:
static inline uint32_t sum0() {
uint32_t foo = 1, bar=2;
uint32_t ret;
__asm__ __volatile__ (
"add %%ebx,%%eax"
: "=a"(ret) // ouput
: "a"(foo), "b"(bar) // input
: "eax" // modify
);
return ret;
}
It seems that the tutorial I've been following is misleading. "eax" in the output/input field does not mean the register itself, but e,a,x abbreviations on the abbrev table.
Anyway, I still do not get it right. The code above results in a compilation error: 'asm' operand has impossible constraints.
I don't see why.
The Extended inline assembly constraints for x86 are listed in the official documentation.
The complete documentation is also worth reading.
As you can see, the constraints are all single letters.
The constraint "eax" fo foo specifies three constraints:
a
The a register.
x
Any SSE register.
e
32-bit signed integer constant, or ...
Since you are telling GCC that eax is clobbered it cannot put the input operand there and it picks xmm0.
When the compiler selects the registers to use to represent the input operands, it does not use any of the clobbered registers
The proper constraint is simply "a".
You need to remove eax (by the way it should be rax due to zeroing of the upper bits) from the clobbers (and add "cc").

Error in simple g++ inline assembler

I'm trying to write a "hello world" program to test inline assembler in g++.
(still leaning AT&T syntax)
The code is:
#include <stdlib.h>
#include <stdio.h>
# include <iostream>
using namespace std;
int main() {
int c,d;
__asm__ __volatile__ (
"mov %eax,1; \n\t"
"cpuid; \n\t"
"mov %edx, $d; \n\t"
"mov %ecx, $c; \n\t"
);
cout << c << " " << d << "\n";
return 0;
}
I'm getting the following error:
inline1.cpp: Assembler messages:
inline1.cpp:18: Error: unsupported instruction `mov'
inline1.cpp:19: Error: unsupported instruction `mov'
Can you help me to get it done?
Tks
Your assembly code is not valid. Please carefully read on Extended Asm. Here's another good overview.
Here is a CPUID example code from here:
static inline void cpuid(int code, uint32_t* a, uint32_t* d)
{
asm volatile ( "cpuid" : "=a"(*a), "=d"(*d) : "0"(code) : "ebx", "ecx" );
}
Note the format:
first : followed by output operands: : "=a"(*a), "=d"(*d); "=a" is eax and "=b is ebx
second : followed by input operands: : "0"(code); "0" means that code should occupy the same location as output operand 0 (eax in this case)
third : followed by clobbered registers list: : "ebx", "ecx"
I kept #AMA answer as accepted one because it was complete enough. But I've put some thought on it and I concluded that it is not 100% correct.
The code I was trying to implement in GCC is the one below (Microsoft Visual Studio version).
int c,d;
_asm
{
mov eax, 1;
cpuid;
mov d, edx;
mov c, ecx;
}
When cpuid executes with eax set to 1, feature information is returned in ecx and edx.
The suggested code returns the values from eax ("=a") and edx (="d").
This can be easily seen at gdb:
(gdb) disassemble cpuid
Dump of assembler code for function cpuid(int, uint32_t*, uint32_t*):
0x0000000000000a2a <+0>: push %rbp
0x0000000000000a2b <+1>: mov %rsp,%rbp
0x0000000000000a2e <+4>: push %rbx
0x0000000000000a2f <+5>: mov %edi,-0xc(%rbp)
0x0000000000000a32 <+8>: mov %rsi,-0x18(%rbp)
0x0000000000000a36 <+12>: mov %rdx,-0x20(%rbp)
0x0000000000000a3a <+16>: mov -0xc(%rbp),%eax
0x0000000000000a3d <+19>: cpuid
0x0000000000000a3f <+21>: mov -0x18(%rbp),%rcx
0x0000000000000a43 <+25>: mov %eax,(%rcx) <== HERE
0x0000000000000a45 <+27>: mov -0x20(%rbp),%rax
0x0000000000000a49 <+31>: mov %edx,(%rax) <== HERE
0x0000000000000a4b <+33>: nop
0x0000000000000a4c <+34>: pop %rbx
0x0000000000000a4d <+35>: pop %rbp
0x0000000000000a4e <+36>: retq
End of assembler dump.
The code that generates something closer to what I want is (EDITED based on feedbacks on the comments):
static inline void cpuid2(uint32_t* d, uint32_t* c)
{
int a = 1;
asm volatile ( "cpuid" : "=d"(*d), "=c"(*c), "+a"(a) :: "ebx" );
}
The result is:
(gdb) disassemble cpuid2
Dump of assembler code for function cpuid2(uint32_t*, uint32_t*):
0x00000000000009b0 <+0>: push %rbp
0x00000000000009b1 <+1>: mov %rsp,%rbp
0x00000000000009b4 <+4>: push %rbx
0x00000000000009b5 <+5>: mov %rdi,-0x20(%rbp)
0x00000000000009b9 <+9>: mov %rsi,-0x28(%rbp)
0x00000000000009bd <+13>: movl $0x1,-0xc(%rbp)
0x00000000000009c4 <+20>: mov -0xc(%rbp),%eax
0x00000000000009c7 <+23>: cpuid
0x00000000000009c9 <+25>: mov %edx,%esi
0x00000000000009cb <+27>: mov -0x20(%rbp),%rdx
0x00000000000009cf <+31>: mov %esi,(%rdx)
0x00000000000009d1 <+33>: mov -0x28(%rbp),%rdx
0x00000000000009d5 <+37>: mov %ecx,(%rdx)
0x00000000000009d7 <+39>: mov %eax,-0xc(%rbp)
0x00000000000009da <+42>: nop
0x00000000000009db <+43>: pop %rbx
0x00000000000009dc <+44>: pop %rbp
0x00000000000009dd <+45>: retq
End of assembler dump.
Just to be clear... I know that there are better ways of doing it. But the purpose here is purely educational. Just want to understand how it works ;-)
-- edited (removed personal opinion) ---

Using AT&T inline assembler for GCC

I'm writing a simple but a little specific program:
Purpose: calculate number from it's factorial
Requirements: all calculations must be done on gcc inline asm (at&t syntax)
Source code:
#include <iostream>
int main()
{
unsigned n = 0, f = 0;
std::cin >> n;
asm
(
"mov %0, %%eax \n"
"mov %%eax, %%ecx \n"
"mov 1, %%ebx \n"
"mov 1, %%eax \n"
"jmp cycle_start\n"
"cycle:\n"
"inc %%ebx\n"
"mul %%ebx\n"
"cycle_start:\n"
"cmp %%ecx, %%eax\n"
"jnz cycle\n"
"mov %%ebx, %1 \n":
"=r" (n):
"r" (f)
);
std::cout << f;
return 0;
}
This code causes SIGSEV.
Identic program on intel asm syntax (http://pastebin.com/2EqJmGAV) works fine. Why my "AT&T program" fails and how can i fix it?
#include <iostream>
int main()
{
unsigned n = 0, f = 0;
std::cin >> n;
__asm
{
mov eax, n
mov ecx, eax
mov eax, 1
mov ebx, 1
jmp cycle_start
cycle:
inc ebx
mul ebx
cycle_start:
cmp eax, ecx
jnz cycle
mov f, ebx
};
std::cout << f;
return 0;
}
UPD: Pushing to stack and restoring back used registers gives the same result: SIGSEV
You have your input and output the wrong way around.
So, start by altering
"=r" (n):
"r" (f)
to:
"=r" (f) :
"r" (n)
Then I suspect you'll want to tell the compiler about clobbers (registers you are using that aren't inputs or outputs):
So add:
: "eax", "ebx", "ecx"
after the two lines above.
I personally would make some other changes:
Use local labels (1: and 2: etc), which allows the code to be duplicated without "duplicate label".
Use %1 instead of %%ebx - that way, you are not using an extra register.
Move %0 directly to %%ecx. You are loading 1 into %%eax two instructions later, so what purpose has it got to do in %%eax?
[Now, I'ver written too much, and someone else has answered first... ]
Edit: And, as Anton points out, you need $1 to load the constant 1, 1 means read from address 1, which doesn't work well, and most likely is the cause of your problems
Hopefully there are no requirements to use nothing but gcc inline asm to figure it out. You can translate your AT&T example with nasm, then disassemble with objdump and see what's the right syntax.
I seem to recall that mov 1,%eax should be mov $1,%eax if you mean literal constant and not a memory reference.
An answer by #MatsPetersson is very useful regarding the interaction of your inline assembly with the compiler (clobbered/input/output registers). I've focused on the reason why you get SIGSEGV, and reading the address 1 does answer the question.