Related
We consider that we are using GCC (or GCC-compatible) compiler on a X86_64 architecture, and that eax, ebx, ecx, edx and level are variables (unsigned int or unsigned int*) for input and output of the instruction (like here).
asm("CPUID":::);
asm volatile("CPUID":::);
asm volatile("CPUID":::"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx)::"memory");
asm volatile("CPUID":"=a"(eax):"0"(level):"memory");
asm volatile("CPUID"::"a"(level):"memory"); // Not sure of this syntax
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
asm("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level));
I am not used to the inline assembly syntax, and I am wondering what would be the difference between all these calls, in a context where I just want to use CPUID as a serializing instruction (e.g. nothing will be done with the output of the instruction).
Can some of these calls lead to errors?
Which one(s) of these calls would be the most suited (given that I want the least overhead as possible, but at the same time the "strongest" serialization possible)?
First of all, lfence may be strongly serializing enough for your use-case, e.g. for rdtsc. If you care about performance, check and see if you can find evidence that lfence is strong enough (at least for your use-case). Possibly even using both mfence; lfence might be better than cpuid, if you want to e.g. drain the store buffer before an rdtsc.
But neither lfence nor mfence are serializing on the whole pipeline in the official technical-terminology meaning, which could matter for cross-modifying code - discarding instructions that might have been fetched before some stores from another core became visible.
2. Yes, all the ones that don't tell the compiler that the asm statement writes E[A-D]X are dangerous and will likely cause hard-to-debug weirdness. (i.e. you need to use (dummy) output operands or clobbers).
You need volatile, because you want the asm code to be executed for the side-effect of serialization, not to produce the outputs.
If you don't want to use the CPUID result for anything (e.g. do double duty by serializing and querying something), you should simply list the registers as clobbers, not outputs, so you don't need any C variables to hold the results.
// volatile is already implied because there are no output operands
// but it doesn't hurt to be explicit.
// Serialize and block compile-time reordering of loads/stores across this
asm volatile("CPUID"::: "eax","ebx","ecx","edx", "memory");
// the "eax" clobber covers RAX in x86-64 code, you don't need an #ifdef __i386__
I am wondering what would be the difference between all these calls
First of all, none of these are "calls". They're asm statements, and inline into the function where you use them. CPUID itself is not a "call" either, although I guess you could look at it as calling a microcode function built-in to the CPU. But by that logic, every instruction is a "call", e.g. mul rcx takes inputs in RAX and RCX, and returns in RDX:RAX.
The first three (and the later one with no outputs, just a level input) destroy RAX through RDX without telling the compiler. It will assume that those registers still hold whatever it was keeping in them. They're obviously unusable.
asm("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory"); (the one without volatile) will optimize away if you don't use any of the outputs. And if you do use them, it can still be hoisted out of loops. A non-volatile asm statement is treated by the optimizer as a pure function with no side effects. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#index-asm-volatile
It has a memory clobber, but (I think) that doesn't stop it from optimizing away, it just means that if / when / where it does run, any variables it could possibly read / write are synced to memory, so memory contents match what the C abstract machine would have at that point. This may exclude locals that haven't had their address taken, though.
asm("" ::: "memory") is very similar to std::atomic_thread_fence(std::memory_order_seq_cst), but note that that asm statement has no outputs, and thus is implicitly volatile. That's why it isn't optimized away, not because of the "memory" clobber itself. A (volatile) asm statement with a memory clobber is a compiler barrier against reordering loads or stores across it.
The optimizer doesn't care at all what's inside the first string literal, only the constraints / clobbers, so asm volatile("anything" ::: register clobbers, "memory") is also a compile-time-only memory barrier. I assume this is what you want, to serialize some memory operations.
"0"(level) is a matching constraint for the first operand (the "=a"). You could equally have written "a"(level), because in this case the compiler doesn't have a choice of which register to select; the output constraint can only be satisfied by eax. You could also have used "+a"(eax) as the output operand, but then you'd have to set eax=level before the asm statement. Matching constraints instead of read-write operands are sometimes necessary for x87 stack stuff; I think that came up once in an SO question. But other than weird stuff like that, the advantage is being able to use different C variables for input and output, or not using a variable at all for the input. (e.g. a literal constant, or an lvalue (expression)).
Anyway, telling the compiler to provide an input will probably result in an extra instruction, e.g. level=0 would result in an xor-zeroing of eax. This would be a waste of an instruction if it didn't already need a zeroed register for anything earlier. Normally xor-zeroing an input would break a dependency on the previous value, but the whole point of CPUID here is that it's serializing, so it has to wait for all previous instructions to finish executing anyway. Making sure eax is ready early is pointless; if you don't care about the outputs, don't even tell the compiler your asm statement takes an input. Compilers make it difficult or impossible to use an undefined / uninitialized value with no overhead; sometimes leaving a C variable uninitialized will result in loading garbage from the stack, or zeroing a register, instead of just using a register without writing it first.
I am in the process of optimizing my code for my n-body simulator, and when profiling my code, have seen this:
These two lines,
float diffX = (pNode->CenterOfMassx - pBody->posX);
float diffY = (pNode->CenterOfMassy - pBody->posY);
Where pNode is a pointer to a object of type Node that I have defined, and contains (with other things) 2 floats, CenterOfMassx and CenterOfMassy
Where pBody is a pointer to a object of type Body that I have defined, and contains (with other things) 2 floats, posX and posY.
Should take the same amount of time, but do not. In fact the first line accounts for 0.46% of function samples, but the second accounts for 5.20%.
Now I can see the second line has 3 instructions, and the first only has one.
My question is why do these seemingly do the same thing but in practice to different things?
As previously stated, profiler is listing only one assembly instruction with the first line, but three with the second line. However, because the optimizer can move code around a lot, this isn't very meaningful. It looks like the code is optimized to load all of the values into registers first, and then perform the subtractions. So it performs an action from the first line, then an action from the second line (the loads), followed by an action from the first line and an action from the second line (the subtractions). Since this is difficult to represent, it just does a best approximation of which line corresponds to which assembly code when displaying the disassembly inline with the code.
Take note that the first load is executed and may still be in the CPU pipeline when the next load instruction is executing. The second load has no dependence on the registers used in the first load. However, the first subtraction does. This instruction requires the previous load instruction to be far enough in the pipeline that the result can be used as one of the operands of the subtraction. This will likely cause a stall in the CPU while the pipeline lets the load finish.
All of this really reinforces the concept of memory optimizations being more important on modern CPUs than CPU optimizations. If, for example, you had loaded the required values into registers 15 instructions earlier, the subtractions might have occurred much quicker.
Generally the best thing you can do for optimizations is to keep the cache fresh with the memory you are going to using, and making sure it gets updated as soon as possible, and not right before the memory is needed. Beyond that, optimizations are complicated.
Of course, all of this is further complicated by modern CPUs that might look ahead 40-60 instructions for out of order execution.
To optimize this further, you might consider using a library that does vector and matrix operations in an optimized manner. Using one of these libraries, it might be possible to use two vector instructions instead of 4 scalar instructions.
According to the expanded assembly, the instructions are reordered to load pNodes's data members before performing subtraction with pBody's data members. The purpose may be to take advantage of memory caching.
Thus, the execution order is not the same as C code anymore. Comparing 1 movss which is accounted for the 1st C statement with 1 movss + 2 subss which are accounted for the 2nd C statement is unfair.
Performance counters aren't cycle-accurate. Sometimes the wrong instruction gets blamed. But in this case, it's probably pointing the finger at the instruction that was producing the result everything else was waiting for.
So probably it ran out of things it could do while waiting for the result of the memory access and FP sub. If cache misses are happening, look for ways to structure your code for better memory locality, or at least for memory access to happen in-order. Hardware prefetchers can detect sequential access patterns up to some limit of stride length.
Also, you compiler could have vectorized this. It loads two scalars from sequential addresses, then subtracts two scalars from sequential addresses. It would be faster to
movq xmm0, [esi+30h] # or movlps, but that wouldn't break the dep
movq xmm1, [edi] # on the old value of xmm0 / xmm1
subps xmm0, xmm1
That leaves diffX and diffY in element 0 and 1 of xmm0, rather than 2 different regs, so the benefit depends on the surrounding code.
I recall from Agner Fog's excellent guide that 64-bit Linux can pass 6 integer function parameters via registers:
http://www.agner.org/optimize/optimizing_cpp.pdf
(page 8)
I have the following function:
void x(signed int a, uint b, char c, unit d, uint e, signed short f);
and I need to pass an additional unsigned short parameter, which would make 7 in total. However, I can actually derive the value of the 7th from one of the existing 6.
So my question is which of the following is a better practice for performance:
Passing the already-calculated value as a 7th argument on 64-bit Linux
Not passing the already-calculated value, but calculating it again for a second time using one of the existing 6 arguments.
The operation in question is a simple bit-shift:
unsigned short g = c & 1;
Not fully understanding x86 assembler I am not too sure how precious registers are and whether it is better to recalculate a value as a local variable, than pass it through function calls as an argument?
My belief is that it would be better to calculate the value twice because it is such a simple 1 CPU cycle task.
EDIT I know I can just profile this- but I'd like to also understand what is happening under the hood with both approaches. Having a 7th argument does this mean cache/memory is involved, rather than registers?
The machine conventions to pass arguments is called the application binary interface (or ABI), and for Linux x86-64 is described in x86-64 ABI spec. See also x86 calling conventions wikipage.
In your case, it is probably not worthwhile to pass c & 1 as an additional parameter (since that 7th parameter is passed on stack).
Don't forget that current processor cores (on desktop or laptop computers) are often doing out-of-order execution and are superscalar, so the c & 1 operation could be done in parallel with other operations and might cost "nothing".
But leave such micro-optimizations to the compiler. If you care a lot about performance, use a recent GCC 4.8 compiler with gcc-4.8 -O3 -flto both for compiling and for linking (i.e. enable link-time optimization).
BTW, cache performance is much more relevant than such micro-optimizations. A single cache miss may take the same time (e.g. 250 nanoseconds) as hundreds of CPU machine instructions. Current CPUs are rumored to mostly wait for the caches. You might want to add a few explicit (and judicious) calls to __builtin_prefetch (see this question and this answer). But adding too much these prefetches would slow down your code.
At last, readability and maintainability of your code should matter much more than raw performance!
Basile's answer is good, I'll just point out another thing to keep in mind:
a) The stack is very likely to be in L1 cache, so passing arguments on the stack should not take more than ~3 cycles extra.
b) The ABI (x86-64 System V, in this case) requires clobbered registers to be restored. Some are saved by the caller, others by the callee. Obviously, the registers used to pass arguments must be saved by the caller if the original contents were needed again. But when your function uses more registers than the caller saved, any additional temporary results the function needs to calculate must go into a callee-saved register. So the function ends up spilling a register on the stack, reusing the register for your temporary variable, and then pops the original value back.
The only way you can avoid accessing memory is by using a smaller, simpler function that needs fewer temporary variables.
I can't find a neat explanation about how I'm supposed to write a piece of inline asm, and what are the problem that can possibly arise from a concurrent use of a foo function that contains asm code in it.
The problem that I see is that in asm the registers are uniquely named, and so 1 name is strictly tied to a really precise portion of your cpu, and that's a big problem if you are writing 1 piece of code that is supposed to run concurrently because you can't simply extra registers with the same name.
The other problem is that asm doesn't really uses a calling convention, you simply call registers and/or values, and sometimes calling a register implies a silent action on another register that doesn't even shows up explicitly in your code; so I can't even expect that my C/C++ function foo will be packed and sealed inside its own stack if it contains asm code .
Now with what gcc calls extended asm I can basically declare where the input and the output goes, so each function can use its own parameters "as registers" , and the pattern is the following
asm ( assembler template
: output
: input
: registers
);
Assuming that my main target for now are mathematical operations, and my function is only supposed to give a certain functionality and perform some computation ( no internal lock ), is extended asm good for concurrency ? How I should design a piece of asm that is supposed to be used by a concurrent application ?
For now I'm using gcc, but I would like a generic answer about the general asm design that I'm supposed to give to this kind of code snippets.
You seem to be misunderstanding what threading actually is. Let's consider a single-processor system first. The threads don't actually run concurrently, since there is only one unit that can successfully decode and execute them. Your operating system is only creating the illusion of running multiple threads (and processes, too) by employing scheduling inside of it : every thread, or process, is allocated a certain amount of time it gets to execute on the processor.
This is why, when threads are executed, they don't overwrite each other's registers. When a currently executed thread or process is switched, the operating system asks the processor to perform something that's called a context switch. In a nutshell, the processor saves its state when it was executing the previous task/thread/process into some memory area, which is controlled by the OS. The new task/thread/process has its context restored from the previously stored state and continues its execution. When this task/thread/process' time slice on the CPU is up, the scheduler decides which task/thread/process to resume next. The time slice is usually very small, which is why you're given the illusion of multiple streams of code running at the same time. Keep in mind that this is a very, very simplified description : refer to CPU manuals or books on operating systems for more detail.
The situation is analogous on multi-processor systems : only with the exception that, then, there is more than one unit that can execute the instructions. This is also true for multi-core processors : every one of the cores has its own set of registers. The basic stuff stays the same - the scheduler in your OS decides whether the code being executed is actually executed at the same time by multiple cores in one processor.
Thus, your concerns in this case are not valid. However, they were raised for very valid reasons. Remember that the only things that threads share is the main memory : each thread has its own registers, and its own stack.
Let me come back to the actual question about gcc's extended inline assembly. The compiler itself cannot work out which registers are modified by the assembly you wrote. That's why you need to specify it. However, it is very rare that an instruction modifies a register without you being able to control it, and it happens only with a small number of instructions - assuming that we're talking about x86. Moreover, gcc can work out the destination/source operands by itself when you want to refer to a C/C++ variable from inside the assembly. In fact, this is the preferred method, since it leaves the compiler much more room for optimization.
Consider this piece of code :
unsigned int get_cr0(void)
{
unsigned int rc;
__asm__ (
"movl %%cr0, %0\n"
: "=r"(rc)
:
:
);
return rc;
}
This function's purpose is to return the contents of the control register cr0. This is a privileged instruction, so the program will not work when you run it in user mode, but this is not important right now. See how I put %0 in the instruction, and then specified "=r"(rc) in the output list. This means that %0 will be automagically aliased by the compiler to your rc variable. You can do this for every variable you specify on the input/output list. They are numbered starting from zero, as you can see.
I can't really remember the instructions which used registers that were not encoded as operands, so I can't give you an example right now. In this case, you would need to put them on the clobber list (the last one). I'm pretty sure you can refer to this for more information.
I also can't answer anything regarding "general asm design", since this is a non-standard extension and thus varies between compilers. The 64-bit Visual Studio compilers don't support it at all, for example.
In llvm, can one trace back to the instruction that defines the value for a particular register? For example, if I have an instruction as:
%add14 = add i32 %add7, %add5
Is here a way for me to trace back to the instruction where add5 is defined?
First of all, there are no registers in LLVM IR: all those things with % in their names are just names of values. You don't store information inside those things, they are not variables or memory locations, they are just names. I recommend reading about SSA form, which helps explains this further.
In any case, what you need to do is invoke the getOperand(n) method on the instruction to get its nth operand - for example, getOperand(0) in your example will return the value named %add7. You can then check whether that value is indeed an instruction (as opposed to, say, a function argument) by checking its type (isa<Instruction>).
To emphasize - calling the getOperand method will give you the actual place in which the operand is defined, nothing else is required.