I would like to learn some inline assembly programming, but my first cod snippet does not work. I have a string and I would like to assign the value of the string to the rsi register.
Here is my code:
string s = "Hello world";
const char *ystr = s.c_str();
asm("mov %1,%%rsi"
:"S"(ystr)
:"%rsi" //clobbered register
);
return 0;
It gives me the error :Expected ')' before token. Any help is appreciated.
You left out a : to delimit the empty outputs section. So "S"(ystr) is an input operand in the outputs section, and "%rsi" is in the inputs section, not clobbers.
But as an input it's missing the (var_name) part of the "constraint"(var_name) syntax. So that's a syntax error, as well as a semantic error. That's the immediate source of the error <source>:9:5: error: expected '(' before ')' token. https://godbolt.org/z/97aTdjE8K
As Nate pointed out, you have several other errors, like trying to force the input to pick RSI with "S".
char *output; // could be a dummy var if you don't actually need it.
asm("mov %1, %0"
: "=r" (output) /// compiler picks a reg for you to write to.
:"S"(ystr) // force RSI input
: // no clobbers
);
Note that this does not tell the compiler that you read or write the pointed-to memory, so it's only safe for something like this, which copies the address around but doesn't expect to read or write the pointed-to data.
Also related:
How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
Can I modify input operands in gcc inline assembly
How to mark as clobbered input operands (C register variables) in extended GCC inline assembly?
In general, when using gcc inline asm on x86, you want to avoid ever using mov instructions, and want to avoid explicit registers in the asm code -- just use the register constraints to get things in the appropriate registers. So for your example, getting a string pointer into the rsi register, you want just:
asm volatile("; ystr wil be in %rsi here"
: // no output contraints
: "S"(ystr) // input constraint
: // no clobber needed
);
Note that there's no actual asm code output here -- just a comment. The input constraint is sufficient to get the operand into the needed register prior to the point where this appears. Yes, rsi might well be used for something else afterwards, but that is as expected -- the register constraints just cover the inputs and outputs of the asm text.
In my case, C++ code was compiled with -std=c++17 and the compiler also reported expected ')' before ':' token
I changed the keyword asm to __asm__ and in my case, this helped.
This modification was inspired by "When writing code that can be compiled with -ansi and the various -std options, use __asm__ instead of asm" from https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, however as commented below, this may not be completely precise.
Related
We consider that we are using GCC (or GCC-compatible) compiler on a X86_64 architecture, and that eax, ebx, ecx, edx and level are variables (unsigned int or unsigned int*) for input and output of the instruction (like here).
asm("CPUID":::);
asm volatile("CPUID":::);
asm volatile("CPUID":::"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx)::"memory");
asm volatile("CPUID":"=a"(eax):"0"(level):"memory");
asm volatile("CPUID"::"a"(level):"memory"); // Not sure of this syntax
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
asm("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level));
I am not used to the inline assembly syntax, and I am wondering what would be the difference between all these calls, in a context where I just want to use CPUID as a serializing instruction (e.g. nothing will be done with the output of the instruction).
Can some of these calls lead to errors?
Which one(s) of these calls would be the most suited (given that I want the least overhead as possible, but at the same time the "strongest" serialization possible)?
First of all, lfence may be strongly serializing enough for your use-case, e.g. for rdtsc. If you care about performance, check and see if you can find evidence that lfence is strong enough (at least for your use-case). Possibly even using both mfence; lfence might be better than cpuid, if you want to e.g. drain the store buffer before an rdtsc.
But neither lfence nor mfence are serializing on the whole pipeline in the official technical-terminology meaning, which could matter for cross-modifying code - discarding instructions that might have been fetched before some stores from another core became visible.
2. Yes, all the ones that don't tell the compiler that the asm statement writes E[A-D]X are dangerous and will likely cause hard-to-debug weirdness. (i.e. you need to use (dummy) output operands or clobbers).
You need volatile, because you want the asm code to be executed for the side-effect of serialization, not to produce the outputs.
If you don't want to use the CPUID result for anything (e.g. do double duty by serializing and querying something), you should simply list the registers as clobbers, not outputs, so you don't need any C variables to hold the results.
// volatile is already implied because there are no output operands
// but it doesn't hurt to be explicit.
// Serialize and block compile-time reordering of loads/stores across this
asm volatile("CPUID"::: "eax","ebx","ecx","edx", "memory");
// the "eax" clobber covers RAX in x86-64 code, you don't need an #ifdef __i386__
I am wondering what would be the difference between all these calls
First of all, none of these are "calls". They're asm statements, and inline into the function where you use them. CPUID itself is not a "call" either, although I guess you could look at it as calling a microcode function built-in to the CPU. But by that logic, every instruction is a "call", e.g. mul rcx takes inputs in RAX and RCX, and returns in RDX:RAX.
The first three (and the later one with no outputs, just a level input) destroy RAX through RDX without telling the compiler. It will assume that those registers still hold whatever it was keeping in them. They're obviously unusable.
asm("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory"); (the one without volatile) will optimize away if you don't use any of the outputs. And if you do use them, it can still be hoisted out of loops. A non-volatile asm statement is treated by the optimizer as a pure function with no side effects. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#index-asm-volatile
It has a memory clobber, but (I think) that doesn't stop it from optimizing away, it just means that if / when / where it does run, any variables it could possibly read / write are synced to memory, so memory contents match what the C abstract machine would have at that point. This may exclude locals that haven't had their address taken, though.
asm("" ::: "memory") is very similar to std::atomic_thread_fence(std::memory_order_seq_cst), but note that that asm statement has no outputs, and thus is implicitly volatile. That's why it isn't optimized away, not because of the "memory" clobber itself. A (volatile) asm statement with a memory clobber is a compiler barrier against reordering loads or stores across it.
The optimizer doesn't care at all what's inside the first string literal, only the constraints / clobbers, so asm volatile("anything" ::: register clobbers, "memory") is also a compile-time-only memory barrier. I assume this is what you want, to serialize some memory operations.
"0"(level) is a matching constraint for the first operand (the "=a"). You could equally have written "a"(level), because in this case the compiler doesn't have a choice of which register to select; the output constraint can only be satisfied by eax. You could also have used "+a"(eax) as the output operand, but then you'd have to set eax=level before the asm statement. Matching constraints instead of read-write operands are sometimes necessary for x87 stack stuff; I think that came up once in an SO question. But other than weird stuff like that, the advantage is being able to use different C variables for input and output, or not using a variable at all for the input. (e.g. a literal constant, or an lvalue (expression)).
Anyway, telling the compiler to provide an input will probably result in an extra instruction, e.g. level=0 would result in an xor-zeroing of eax. This would be a waste of an instruction if it didn't already need a zeroed register for anything earlier. Normally xor-zeroing an input would break a dependency on the previous value, but the whole point of CPUID here is that it's serializing, so it has to wait for all previous instructions to finish executing anyway. Making sure eax is ready early is pointless; if you don't care about the outputs, don't even tell the compiler your asm statement takes an input. Compilers make it difficult or impossible to use an undefined / uninitialized value with no overhead; sometimes leaving a C variable uninitialized will result in loading garbage from the stack, or zeroing a register, instead of just using a register without writing it first.
I want to create a function for addition two 16-bit integers with overflow detection. I have generic variant written in portable c. But the generic variant is not optimal for x86 target, because CPU internally calculate overflow flag when execute ADD/SUB/etc. Of course, there is__builtin_add_overflow(), but in my case it generates some boilerplate.
So I write the following code:
#include <cstdint>
struct result_t
{
uint16_t src;
uint16_t dst;
uint8_t of;
};
static void add_u16_with_overflow(result_t& r)
{
char of, cf;
asm (
" addw %[dst], %[src] "
: [dst] "+mr"(r.dst)//, "=#cco"(of), "=#ccc"(cf)
: [src] "imr" (r.src)
: "cc"
);
asm (" seto %0 " : "=rm" (r.of) );
}
uint16_t test_add(uint16_t a, uint16_t b)
{
result_t r;
r.src = a;
r.dst = b;
add_u16_with_overflow(r);
add_u16_with_overflow(r);
return (r.dst + r.of); // use r.dst and r.of for prevent discarding
}
I've played with https://godbolt.org/g/2mLF55 (gcc 7.2 -O2 -std=c++11) and it results
test_add(unsigned short, unsigned short):
seto %al
movzbl %al, %eax
addw %si, %di
addw %si, %di
addl %esi, %eax
ret
So, seto %0 is reordered. It seems gcc think there is no dependency between two consequent asm() statements. And "cc" clobber doesn't have any effect for flags dependency.
I can't use volatile because seto %0 or whole function can be (and have to) optimized out if result (or some part of result) is not used.
I can add dependency for r.dst: asm (" seto %0 " : "=rm" (r.of) : "rm"(r.dst) );, and reordering will not happen. But it is not a "right thing", and compiler still can insert some code changes flags (but not changes r.dst) between add and seto statement.
Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?
I haven't looked at gcc's output for __builtin_add_overflow, but how bad is it? #David's suggestion to use it, and https://gcc.gnu.org/wiki/DontUseInlineAsm is usually good, especially if you're worried about how this will optimize. asm defeats constant propagation and some other things.
Also, if you are going to use ASM, note that att syntax is add %[src], %[dst] operand order. See the tag wiki for details, unless you're always going to build your code with -masm=intel.
Is there way to say "this asm() statement change some cpu flags" and "this asm() use some cpu flags" for dependency between statement and prevent reordering?
No. Put the flag-consuming instruction (seto) inside the same asm block as the flag-producing instruction. An asm statement can have an many input and output operands as you like, limited only by register-allocation difficulty (but multiple memory outputs can use the same base register with different offsets). Anyway, an extra write-only output on the statement containing the add isn't going to cause any inefficiency.
I was going to suggest that if you want multiple flag outputs from one instruction, use LAHF to Load AH from FLAGS. But that doesn't include OF, only the other condition codes. This is often inconvenient and seems like a bad design choice because there are some unused reserved bits in the low 8 of EFLAGS/RFLAGS, so OF could have been in the low 8 along with CF, SF, ZF, PF, and AF. But since that isn't the case, setc + seto are probably better than pushf / reload, but that is worth considering.
Even if there was syntax for flag-input (like there is for flag-output), there would be very little to gain from letting gcc insert some of its own non-flag-modifying instructions (like lea or mov) between your two separate asm statements.
You don't want them reordered or anything, so putting them in the same asm statement makes by far the most sense. Even on an in-order CPU, add is low latency so it's not a big bottleneck to put a dependent instruction right after it.
And BTW, a jcc might be more efficient if overflow is an error condition that doesn't happen normally. But unfortunately GNU C asm goto doesn't support output operands. You could take a pointer input and modify dst in memory (and use a "memory" clobber), but forcing a store/reload sucks more than using setc or seto to produce an input for a compiler-generated test/jnz.
If you didn't also need an output, you could put C labels on a return true and a return false statement, which (after inlining) would turn your code into a jcc to wherever the compiler wanted to lay out the branches of an if(). e.g. see how Linux does it: (with extra complicating factors in these two examples I found): setting up to patch the code after checking a CPU feature once at boot, or something with a section for a jump table in arch_static_branch.)
In my C++ / C project I want to set the stack pointer equal to the base pointer... Intuitively I would use something like this:
asm volatile(
"movl %%ebp %%esp"
);
However, when I execute this, I get this error message:
Error: bad register name `%%ebp %%esp'
I use gcc / g++ version 4.9.1 compiler.
I dont know whether I need to set specific g++ or gcc flag though... There should be a way to manipulate the esp and ebp registers but I just don't know the right way to do it.
Doe anybody know how to manipulate these two registers in c++? Maybe I should do it with hexed OP codes?
You're using GNU C Basic Asm syntax (no input/output/clobber constraints), so % is not special and therefore, it shouldn't be escaped.
It's only in Extended Asm (with constraints) that % needs to be escaped to end up with a single % in front of hard-coded register names in the compiler's asm output (as required in AT&T syntax).
You also have to separate the operands with a comma:
asm volatile(
"movl %ebp, %esp"
);
asm statements with no output operands are implicitly volatile, but it doesn't hurt to write an explicit volatile.
Note, however, that putting this statement inside a function will likely interfere with the way the compiler handles the stack frame.
I have two similar issues when handling arrays when defined in the asm and when passed from c++ to asm. The code works fine inline but I need to separate them from the cpp into an asm file. The compiler may not throw an error or warning but the end result is random each run and should be constant like it was when inline.
The below code works when used in MMX (movq mm6,twosMask_W) but I need the equivalent for SSE2. I thought that this would work but I appear to be incorrect.
.data
align 16
twosMask_W qword 2 dup(0002000200020002h)
.code
...
movdqa xmm6,oword ptr twosMask_W
...
The second issue is when I pass my thresh128 array from C++ to asm (again for SSE2):
//C++
uint64_t thresh128[2];
thresh128[0] = ((thresh-1)<<8)+(thresh-1);
thresh128[0] += (thresh128[0]<<48)+(thresh128[0]<<32)+(thresh128[0]<<16);
thresh128[1] = thresh128[0];
sendToASM(thresh128)
//ASM
;There are more parameters that utilize the registers but not listed.
receivedFromCPP proc thresh:qword
public receivedFromCPP
...
movdqu xmm4,oword ptr thresh
...
I've tried having thresh as an oword parameter in the procedure but it yielded no results. I'm sure I've got some syntax or parameter type wrong. Any help would be greatly appreciated.
Note: Compiled using MASM in VS2013 for x86.
Well, I tested the first part and it seems to work - so I cannot say anything related to this particular issue.
Concerning the second problem: you seem to pass a 64 bit qword on the stack in 32 bit mode (where is no direct opcode for 64 bit PUSHes) so it would be 2 PUSHes...
receivedFromCPP proc thresh:qword
but are expecting a pointer to a 128 bit value on the stack:
movdqu xmm4,oword ptr thresh
Also keep in mind the little-endianess of x86 - depending on how the compiler chooses to PUSH the 2*64bit-array it may be different from a little-endian-value resulting in seemingly random values.
EDIT: Because the stack grows upside-down, a 128 bit value has to be PUSHed in reverse order for referencing it by EBP.
I'm having difficulty understanding the role constraints play in GCC inline assembly (x86). I've read the manual, which explains exactly what each constraint does. The problem is that even though I understand what each constraint does, I have very little understanding of why you would use one constraint over another, or what the implications might be.
I realize this is a very broad topic, so a small example should help narrow the focus. The following is a simple asm routine which just adds two numbers. If an integer overflow occurs, it writes a value of 1 to an output C variable.
int32_t a = 10, b = 5;
int32_t c = 0; // overflow flag
__asm__
(
"addl %2,%3;" // Do a + b (the result goes into b)
"jno 0f;" // Jump ahead if an overflow occurred
"movl $1, %1;" // Copy 1 into c
"0:" // We're done.
:"=r"(b), "=m"(c) // Output list
:"r"(a), "0"(b) // Input list
);
Now this works fine, except I had to arbitrarily fiddle with the constraints until I got it to work correctly. Originally, I used the following constraints:
:"=r"(b), "=m"(c) // Output list
:"r"(a), "m"(b) // Input list
Note that instead of a "0", I use an "m" constraint for b. This had a weird side effect where if I compiled with optimization flags and called the function twice, for some reason the result of the addition operation would also get stored in c. I eventually read about "matching constraints", which allows you to specify that a variable is to be used as both an input and output operand. When I changed "m"(b) to "0"(b) it worked.
But I don't really understand why you would use one constraint over another. I mean yeah, I understand that "r" means the variable should be in a register and "m" means it should be in memory - but I don't really understand what the implications of choosing one over another are, or why the addition operation doesn't work correctly if I choose a certain combination of constraints.
Questions: 1) In the above example code, why did the "m" constraint on b cause c to get written to? 2) Is there any tutorial or online resource which goes into more detail about constraints?
Here's an example to better illustrate why you should choose constraints carefully (same function as yours, but perhaps written a little more succinctly):
bool add_and_check_overflow(int32_t& a, int32_t b)
{
bool result;
__asm__("addl %2, %1; seto %b0"
: "=q" (result), "+g" (a)
: "r" (b));
return result;
}
So, the constraints used were: q, r, and g.
q means only eax, ecx, edx, or ebx could be selected. This is because the set* instructions must write to an 8-bit-addressable register (al, ah, ...). The use of b in the %b0 means, use the lowest 8-bit portion (al, cl, ...).
For most two-operand instructions, at least one of the operands must be a register. So don't use m or g for both; use r for at least one of the operands.
For the final operand, it doesn't matter whether it's register or memory, so use g (general).
In the example above, I chose to use g (rather than r) for a because references are usually implemented as memory pointers, so using an r constraint would have required copying the referent to a register first, and then copying back. Using g, the referent could be updated directly.
As to why your original version overwrote your c with the addition's value, that's because you specified =m in the output slot, rather than (say) +m; that means the compiler is allowed to reuse the same memory location for input and output.
In your case, that means two outcomes (since the same memory location was used for b and c):
The addition didn't overflow: then, c got overwritten with the value of b (the result of the addition).
The addition did overflow: then, c became 1 (and b might become 1 also, depending on how the code was generated).