I'm trying to mimic the following code using atomic inline assembly code:
struct Node{
Node * next;
int value;
}
typedef struct Node * Node_ptr;
Node_ptr store(Node_ptr ** L, Node_ptr * I){
pthread_mutex_lock (&queue_mutex);
Node_ptr tmp = **L;
**L = *I;
pthread_mutex_unlock (&queue_mutex)
return tmp;
}
Here is what I've tried:
Node_ptr tmp;
__asm volatile ("lock; movq %1, %%rax; movq %%rax, %0"
: "=r" (tmp)
: "r" (**L)
: "%rax"
);
__asm volatile ("lock; movq %1, %%rax; movq %%rax, %0"
: "=r" (**L)
: "r" (*I)
: "%rax"
);
return tmp;
However I'm getting a "Illegal Instruction" error and I'm having trouble seeing where I went wrong. Does anyone have some insight as to what is the issue?
Thanks
Edit: added definition for node_ptr
Intel's manual says the following on the topic of the LOCK prefix:
The LOCK prefix can be prepended only to the following instructions
and only to those forms of the instructions where the destination
operand is a memory operand: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG,
CMPXCH8B, DEC, INC, NEG, NOT, OR, SBB, SUB, XOR, XADD, and XCHG. If
the LOCK prefix is used with one of these instructions and the source
operand is a memory operand, an undefined opcode exception (#UD) may
be generated. An undefined opcode exception will also be generated if
the LOCK prefix is used with any instruc-tion not in the above list.
The best thing to do here (apart from reading the several thousand pages thick manuals from intel) is to look at what kind of output your compiler generates for the c++ code, that should give you an idea.
What you're looking for is the CMPXCHG instruction. (You'll still need the LOCK prefix.)
Related
I've recently seen an article on how the swap operation can be performed using xor'ing instead of using a temporary variable. When I compile code using int a ^= b; the result won't simply be(for at&t syntax)
xor b, a
etc.
instead it will load the raw values into registers, xor it and write it back.
To optimize this i want to write this in inline assembly so it only uses three ticks to do the entire thing and not 15 like it does normally.
I've tried multiple keywords like:
asm(...);
asm("...");
asm{...};
asm{"..."};
asm ...
__asm ...
None of that worked, either giving me a syntax error, gcc doesn't seem to accept all of that syntax or else saying
main.cpp: Assembler messages:
main.cpp:12: Error: too many memory references for `xor'
Basically, I want to use the variables defined in my c++ code used in the assembler block, using three lines to xor them and then have my swapped variables basically like this:
int main() {
volatile int a = 5;
volatile int b = 6;
asm {
xor a,b
xor b,a
xor a,b
};
//a should now be 6, b should be 5
}
To clarify:
I want to avoid using the compiler generated mov operations since they take more cpu cycles than just doing three xor operations which would take three cycles. How could I accomplish this?
To use inline assembly, you should use __asm__ volatile. However, this type of optimization may be premature. Just because there are more instructions does not mean the code is slower - some instructions can be really slow. For example, a floating point BCD store instruction (fbstp), while admittedly rare, takes over 200 cycles - compared to one cycle for a simple mov (Agner Fog's Optimization Guide is a good resource for these timings).
So, I implemented a bunch of "swap" functions, some in C++ and some in assembly, and did a bit of measuring, running each function 100 million times in a row.
Test cases
std::swap
std::swap is probably the preferred solution here. It does what you want (swap the values of two variables), works for most standard library types and not just for integers, clearly communicates what you are trying to achieve, and is portable across architectures.
void std_swap(int *a, int *b) {
std::swap(*a, *b);
}
Here is the generated assembly: It loads both values into registers, and then writes them back to the opposite memory locations.
movl (%rdi), %eax
movl (%rsi), %edx
movl %edx, (%rdi)
movl %eax, (%rsi)
XOR swap
This is what you were trying to do, in C++:
void xor_swap(int *a, int *b) {
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
This doesn't directly translate to only xor instructions, because there is no instruction on x86 that allows you to directly xor two locations in memory - you always need to load at least one of the two into a register:
movl (%rdi), %eax
xorl (%rsi), %eax
movl %eax, (%rdi)
xorl (%rsi), %eax
movl %eax, (%rsi)
xorl %eax, (%rdi)
You also generate a bunch of extra instructions because the two pointers may alias, i.e. point to overlapping memory areas. Then, changing one variable would also change the other, so the compiler needs to constantly store and re-load the values. An implementation using the compiler-specific __restrict keyword will compile to the same code as std_swap (thanks to #Ped7g for pointing out this flaw in the comments).
Swap with temporary variables
This is the "standard" swap with a temporary variable (that the compiler promptly optimizes out to the same code as std::swap):
void tmp_swap(int *a, int *b) {
int tmp = *a;
*a = *b;
*b = tmp;
}
The xchg instruction
xchg can swap a memory value with a register value - it seems perfect at first for your use case. However, it is really slow when you use it to access memory, as you will see later.
void xchg_asm_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"xchgl (%1), %%eax\n\t"
"movl %%eax, (%0)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax"
);
}
We need to load one of the two values into a register, because there is no xchg for two memory locations.
XOR swap in Assembly
I made two versions of the XOR-based swap in Assembly. The first one only loads one of the values in a register, the second loads both before swapping them and writing them back.
void xor_asm_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"xorl (%1), %%eax\n\t"
"xorl %%eax, (%1)\n\t"
"xorl (%1), %%eax\n\t"
"movl %%eax, (%0)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax"
);
}
void xor_asm_register_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"movl (%1), %%ecx\n\t"
"xorl %%ecx, %%eax\n\t"
"xorl %%eax, %%ecx\n\t"
"xorl %%ecx, %%eax\n\t"
"movl %%eax, (%0)\n\t"
"movl %%ecx, (%1)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax", "%ecx"
);
}
The results
You can view the full compilation results along with the generated assembly code on Godbolt.
On my machine, the timings (in microseconds) vary a bit, but are generally comparable:
std_swap: 127371
xor_swap: 150152
tmp_swap: 125896
xchg_asm_swap: 699355
xor_asm_swap: 130586
xor_asm_register_swap: 124718
You can see that std_swap, tmp_swap, xor_asm_swap, and xor_asm_register_swap are generally very similar in speed - in fact, if I move xor_asm_register_swap to the front, it turns out slightly slower than std_swap. Also note that tmp_swap is exactly the same assembly code as std_swap (although it regularly measures in as a bit faster, probably because of the ordering).
xor_swap implemented in C++ is slightly slower because the compiler generates an additional memory load/store for each of the instructions because of aliasing - as mentioned above, if we modify xor_swap to take int * __restrict a, int * __restrict b instead (meaning that a and b never alias), the compiler generates the same code as for std_swap and tmp_swap.
xchg_swap, despite using the lowest number of instructions, is terribly slow (over four times slower than any of the other options), just because xchg is not a fast operation if it involves a memory access.
Ultimately, you have the choice between using some custom assembly-based version (that is hard to understand and maintain) or just using std::swap (which is pretty much the opposite, and also benefits from any optimizations that the standard library designers can come up with, e.g. using vectorization on larger types). Since this is over one hundred million iterations, it should be clear that the potential improvement by using assembly code here is very small - if you improve at all (which is not clear) you'd shave off a couple of microseconds at most.
TL;DR: You shouldn't do that, just use std::swap(a, b)
Appendix: __asm__ volatile
I figured that it may make sense at this point to explain the inline assembly code a bit. __asm__ (in GNU mode, asm is enough) introduces a block of assembly code. The volatile is there to make sure the compiler doesn't optimize it away - it likes to just remove the block otherwise.
There are two forms of __asm__ volatile. One of them also deals with goto labels; I will not address it here. The other form takes up to four arguments, separated with colons (:):
The simplest form (__asm__ volatile ("rdtsc")) just dumps the assembly code, but does not really interact with the C++ code around it. In particular, you need to guess how variables are assigned to registers, which is not exactly good.
Note that the assembly code instructions are separated with "\n", because this assembly code is passed verbatim to the GNU assembler (gas).
The second argument is a list of output operands. You can specify what "type" they have (in particular, =r means "any register operand", and +r means "any register operand, but it is also used as an input"). For example, : "+r" (a), "+r" (b) tells the compiler to replace %0 (references the first of the operands) with the register containing a, and %1 with the register containing b.
This notation means you need to replace %eax (as you would normally reference eax in AT&T assembly notation) with %%eax to escape the percentage sign.
You can also use ".intel_syntax\n" to switch to Intel's assembly syntax if you prefer.
The third argument is the same, but deals with input-only operands.
The fourth argument tells the compiler which registers and memory locations lose their values to enable optimizations around the assembly code. For example, "clobbering" "memory" will likely prompt the compiler to insert a full memory fence. You can see that I added all the registers I used for temporary storage to this list.
I have ubuntu 16.04, x86_64 arch, 4.15.0-39-generic kernel version.
GCC 8.1.0
I tried to rewrite this functions(from first post https://groups.google.com/forum/#!topic/comp.lang.c++.moderated/qHDCU73cEFc) from Intel dialect to AT&T. And I did not succeed.
namespace atomic {
__declspec(naked)
static void*
ldptr_acq(void* volatile*) {
_asm {
MOV EAX, [ESP + 4]
MOV EAX, [EAX]
RET
}
}
__declspec(naked)
static void*
stptr_rel(void* volatile*, void* const) {
_asm {
MOV ECX, [ESP + 4]
MOV EAX, [ESP + 8]
MOV [ECX], EAX
RET
}
}
}
Then I wrote a simple program, to get the same pointer, which I pass inside. I installed GCC version 8.1 with supported naked attributes(https://gcc.gnu.org/gcc-8/changes.html "The x86 port now supports the naked function attribute") for fuctions.
As far as I remember, this attribute tells the compiler not to create the prologue and epilog of the function, and I can take the parameters from the stack myself and return them.
Code:(don't work with segfault)
#include <cstdio>
#include <cstdlib>
__attribute__ ((naked))
int *get_num(int*) {
__asm__ (
"movl 4(%esp), %eax\n\t"
"movl (%eax), %eax\n\t"
"ret"
);
}
int main() {
int *i =(int*) malloc(sizeof(int));
*i = 5;
int *j = get_num(i);
printf("%d\n", *j);
free(i);
return 0;
}
then I tried using 64bit registers:(don't work with segfault)
__asm__ (
"movq 4(%rsp), %rax\n\t"
"movq (%rax), %rax\n\t"
"ret"
);
And only after I took the value out of rdi register - it all worked.
__asm__ (
"movq %rdi, %rax\n\t"
"ret"
);
Why did I fail to make the transfer through the stack register? I probably made a mistake. Please tell me where is my fail?
Because the x86-64 System V calling convention passes args in registers, not on the stack, unlike the old inefficient i386 System V calling convention.
You always have to write asm that matches the calling convention, if you're writing the whole function in asm, like with a naked function or a stand-along .S file.
GNU C extended asm allows you to use operands to specify the inputs to an asm statement, and the compiler will generate instructions to make that happen. (I wouldn't recommend using it until you understand asm and how compilers turn C into asm with optimization enabled, though.)
Also note that movq %rdi, %rax implements long *foo(long*p){return p;} not return *p. Perhaps you meant mov (%rdi), %rax to dereference the pointer arg?
And BTW, you definitely don't need and shouldn't use inline asm for this. https://gcc.gnu.org/wiki/DontUseInlineAsm, and see https://stackoverflow.com/tags/inline-assembly/info
In GNU C, you can cast a pointer to volatile uint64_t*. Or you can use __atomic_load_n (ptr, __ATOMIC_ACQUIRE) to get basically everything you were getting from that asm, without the overhead of a function call or any of the cost for the optimizer at the call-site of having all the call-clobbered registers be clobbered.
You can use them on any object: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html Unlike C++11 where you can only do atomic ops on a std::atomic<T>.
I'm looking to run inline asm in a thread but I get a seg fault on certain instructions, such as the below:
#include <thread>
void Foo(int &x)
{
int temp;
asm volatile ("movl $5, %%edx;"
"movl $3, %%eax;"
"addl %%edx, %%eax;"
"movl %%eax, -24(%%rbp);" // seg faults here
"movl -24(%%rbp), %0;"
: "=r" (temp) : : );
x=temp;
}
int main()
{
int x;
std::thread t1(Foo, std::ref(x));
t1.join();
return 0;
}
(I'm using std::ref to be able to pass a reference to an std::thread, but have to use the temp variable because the extended asm syntax doesn't work with references.)
I've tried clobbering all involved registers and it doesn't help. If I don't pass any arguments to the thread, it seems to work: and it strangely also works if I clobber the %ebx register (which isn't involved). I'm using gcc 4.8.4 on 64-bit Ubuntu 14.04.
What is the best/safest way I can execute inline asm in a thread?
I'm still not completely convinced I understand what your objectives are. I find the comment "but have to use the temp variable because the extended asm syntax doesn't work with references." a bit unusual . I'd expect you can pass a reference into an assembler template since it really is just a pointer under the hood. Since you are using a contrived example, I'll maintain that, but exclude the move of data to an arbitrary location on the stack, and I'll add the registers that get clobbered to the list. You could probably go with something like:
#include <thread>
void Foo(int &x)
{
asm volatile ("movl $5, %%edx;" // We clobber EDX
"movl $3, %%eax;" // We clobber EAX
"addl %%edx, %%eax;" // Result in EAX=3+5=8
"movl %%eax, %0;" // Move to variable x
: "=r" (x) : : "eax", "edx" );
}
int main()
{
int x;
std::thread t1(Foo, std::ref(x));
t1.join();
return 0;
}
I've added "eax" and "edx" to the clobber list since we destroy them in our assembler template (and they don't appear as input or output constraints). You should also notice I don't use a temporary variable. The assembler code can be reduced to a single instruction since the contrived example is the equivalent of movl $8, %0;.
You could also use the memory address (reference) to x like this:
void Foo(int &x)
{
asm volatile ("movl $5, %%edx;" // We clobber EDX
"movl $3, %%eax;" // We clobber EAX
"addl %%edx, %%eax;" // Result in EAX=3+5=8
"movl %%eax, %0;" // Move to variable x
: "=mr" (x) : : "eax", "edx" );
}
In this case I use =mr (memory operand or register) as an output constraint. This allows us to move a value right to the memory operand without an intermediate register.
I'm playing around with inline assembly in C++ using gcc-4.7 on 64-bit little endian Ubuntu 12.04 LTS with Eclipse CDT and gdb. The general direction of what I'm trying to do is to make some sort of bytecode interpreter for some esoteric stack-based programming language.
In this example, I process the instructions 4-bits at a time (in practice this will depend on the instruction), and when there are no more non-zero instructions (as 0 will be nop) I read the next 64-bit words.
I would like to ask though, how do I use a function-scoped label in inline assembly?
It seems labels in assembly are global, which is unfavourable, and I can't find a way to jump to a C++ function-scoped label from an assembly statement.
The following code is an example of what I'm trying to do (Note the comment):
...
register long ip asm("r8");
register long buf asm("r9");
register long op asm("r10");
...
fetch:
asm("mov (%r8), %r9");
asm("add $8, %r8");
control:
asm("test %r9, %r9");
asm("jz fetch"); // undefined reference to `fetch'
asm("shr $4, %r9");
asm("mov %r9, %r10");
asm("and $0xf, %r10");
switch (op) {
...
}
goto control;
Note the following comment from the gcc inline asm documentation:
Speaking of labels, jumps from one `asm' to another are not supported.
The compiler's optimizers do not know about these jumps, and therefore
they cannot take account of them when deciding how to optimize.
You also can't rely on the flags set in one asm being available in the next, as the compiler might insert something between them
With gcc 4.5 and later, you can use asm goto to do what you want:
fetch:
asm("mov (%r8), %r9");
asm("add $8, %r8");
control:
asm goto("test %r9, %r9\n\t"
"jz %l[fetch]" : : : : fetch);
Note that all the rest of your asm is completely unsafe as it uses registers directly without declaring them in its read/write/clobbered lists, so the compiler may decide to put something else in them (despite the vars with the asm declarations on them -- it may decide that those are dead as they are never used). So if you expect this to actually work with -O1 or higher, you need to write it as:
...
long ip;
long buf;
long op;
...
fetch:
asm("mov (%1), %0" : "=r"(buf) : "r"(ip));
asm("add $8, %0" : "=r"(ip) : "0"(ip));
control:
asm goto("test %0, %0\n\t"
"jz %l[fetch]" : : "r"(buf) : : fetch);
asm("shr $4, %0" : "=r"(buf) : "0"(buf));
asm("mov %1, %0" : "=r"(op) : "r"(buf));
asm("and $0xf, %0" : "=r"(op) : "r"(op));
At which point, its much easier to just write it as C code:
long *ip, buf, op;
fetch:
do {
buf = *op++;
control:
} while (!buf);
op = (buf >>= 4) & 0xf;
switch(op) {
:
}
goto control;
You should be able to do this:
fetch:
asm("afetch: mov(%r8), %r9");
...
asm("jz afetch");
Alternatively, putting the label in a separate asm("afetch:"); should work as well. Note the different name to avoid conflicts - I'm not entirely sure that's necessary, but I suspect it is.
I'd like to increment the 64bit(long type in C++) counter in inline assembly atomically. I know how to do that on 32bit value(int):
asm volatile("lock; incl %0" : "=m" (val) : "m"(val));
But I have no idea how to perform that on long value.
moved self answer from the question to an answer
It was quite easy, but I haven't been familiar with x86-64.
asm volatile("lock; incq %0" : "=m" (val) : "m"(val));
That should be:
asm volatile("lock; incq %0" : "+m" (val));
Specifying separate operands without constraints that force input into the same location as the output could result in code such as:
val = something;
asm volatile("lock; incq %0" : "=m" (val) : "m"(val));
being optimised wrongly. You may also need a memory clobber to prevent accesses to other variables being moved past the asm.