Interchange 2 variables in C++ with asm code - c++

I have a huge function that sorts a very large amount of int data. The code works fine except the fact that it's slower that it should be. My first step into solving this is to place some asm code inside C++. How can I interchange 2 variables using asm? I've tried this:
_asm{ push a[x]; push a[y]; pop a[x]; pop a[y];}
and this:
_asm(mov eax, a[x];mov ebx,a[y]; mov a[x],ebx; mov a[y],eax;}
but both crash. How can I save some time on these interchanges ? I use VS_2010

In general, it is very difficult to do better than your compiler with simple code like this.
A compiler, when faced with a swap operation on integers, will typically issue code like this:
mov eax, [x]
mov ebx, [y]
mov [x], ebx
mov [y], eax
Before you try to override, first check what the compiler is actually generating. If it's something like this, don't bother going any further; you won't be able to do better than this. Moreover, if you leave it to the compiler, it may, if these variables are used immediately thereafter, choose to reuse one of these registers to save on variable loads/stores as well. This is impossible with hand-coded assembly; the compiler must reload the variables after the black box that is hand-coded asm.
Note that the push/push/pop/pop sequence is likely to be much slower; not only does it add an additional four memory operations to the stack, it also introduces dependencies on the stack pointer, eliminating any possibility of pipelining. With the simple mov sequence, it is at least possible to run the pair of reads and pair of writes in parallel if they are on different memory banks, or one is in cache, etc. It also does not introduce stalls on the stack pointer in later code.
As such, you should not try to micro-optimize the cost of an interchange; instead, reduce the number of interchanges performed. There are many sorting algorithms available, each with slightly different characteristics. You may find some are better (cause less swaps) on your dataset than others.

What makes you think you can produce faster assembly than an optimizing compiler?
Even if you'll get it to work properly, all you're likely to achieve is to confuse the optimizer to produce even slower code.

When you do in-line assembly, you can change things so that assumptions the compiler has made about register contents will no longer be true. Often times EAX is used to pass a parameter or return a value, so trashing EAX might not have much effect, but you clobbered EBX and didn't put it back, and that could cause problems. Try pushing EBX before you use it, then pop it when you are done.

You can use the variable names, function names and labels in assembly code as symbols. Note that things like a[x] is not such valid symbol.
Writing more efficient code takes skill and knowledge, using asm does not necessarily help you there.
You can compare assembly code that your compiler produces for both the function with inline assembler and without to see where you did break it.

Related

Using inline assembly with serialization instructions

We consider that we are using GCC (or GCC-compatible) compiler on a X86_64 architecture, and that eax, ebx, ecx, edx and level are variables (unsigned int or unsigned int*) for input and output of the instruction (like here).
asm("CPUID":::);
asm volatile("CPUID":::);
asm volatile("CPUID":::"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx)::"memory");
asm volatile("CPUID":"=a"(eax):"0"(level):"memory");
asm volatile("CPUID"::"a"(level):"memory"); // Not sure of this syntax
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
asm("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory");
asm volatile("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level));
I am not used to the inline assembly syntax, and I am wondering what would be the difference between all these calls, in a context where I just want to use CPUID as a serializing instruction (e.g. nothing will be done with the output of the instruction).
Can some of these calls lead to errors?
Which one(s) of these calls would be the most suited (given that I want the least overhead as possible, but at the same time the "strongest" serialization possible)?
First of all, lfence may be strongly serializing enough for your use-case, e.g. for rdtsc. If you care about performance, check and see if you can find evidence that lfence is strong enough (at least for your use-case). Possibly even using both mfence; lfence might be better than cpuid, if you want to e.g. drain the store buffer before an rdtsc.
But neither lfence nor mfence are serializing on the whole pipeline in the official technical-terminology meaning, which could matter for cross-modifying code - discarding instructions that might have been fetched before some stores from another core became visible.
2. Yes, all the ones that don't tell the compiler that the asm statement writes E[A-D]X are dangerous and will likely cause hard-to-debug weirdness. (i.e. you need to use (dummy) output operands or clobbers).
You need volatile, because you want the asm code to be executed for the side-effect of serialization, not to produce the outputs.
If you don't want to use the CPUID result for anything (e.g. do double duty by serializing and querying something), you should simply list the registers as clobbers, not outputs, so you don't need any C variables to hold the results.
// volatile is already implied because there are no output operands
// but it doesn't hurt to be explicit.
// Serialize and block compile-time reordering of loads/stores across this
asm volatile("CPUID"::: "eax","ebx","ecx","edx", "memory");
// the "eax" clobber covers RAX in x86-64 code, you don't need an #ifdef __i386__
I am wondering what would be the difference between all these calls
First of all, none of these are "calls". They're asm statements, and inline into the function where you use them. CPUID itself is not a "call" either, although I guess you could look at it as calling a microcode function built-in to the CPU. But by that logic, every instruction is a "call", e.g. mul rcx takes inputs in RAX and RCX, and returns in RDX:RAX.
The first three (and the later one with no outputs, just a level input) destroy RAX through RDX without telling the compiler. It will assume that those registers still hold whatever it was keeping in them. They're obviously unusable.
asm("CPUID":"=a"(eax),"=b"(ebx),"=c"(ecx),"=d"(edx):"0"(level):"memory"); (the one without volatile) will optimize away if you don't use any of the outputs. And if you do use them, it can still be hoisted out of loops. A non-volatile asm statement is treated by the optimizer as a pure function with no side effects. https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#index-asm-volatile
It has a memory clobber, but (I think) that doesn't stop it from optimizing away, it just means that if / when / where it does run, any variables it could possibly read / write are synced to memory, so memory contents match what the C abstract machine would have at that point. This may exclude locals that haven't had their address taken, though.
asm("" ::: "memory") is very similar to std::atomic_thread_fence(std::memory_order_seq_cst), but note that that asm statement has no outputs, and thus is implicitly volatile. That's why it isn't optimized away, not because of the "memory" clobber itself. A (volatile) asm statement with a memory clobber is a compiler barrier against reordering loads or stores across it.
The optimizer doesn't care at all what's inside the first string literal, only the constraints / clobbers, so asm volatile("anything" ::: register clobbers, "memory") is also a compile-time-only memory barrier. I assume this is what you want, to serialize some memory operations.
"0"(level) is a matching constraint for the first operand (the "=a"). You could equally have written "a"(level), because in this case the compiler doesn't have a choice of which register to select; the output constraint can only be satisfied by eax. You could also have used "+a"(eax) as the output operand, but then you'd have to set eax=level before the asm statement. Matching constraints instead of read-write operands are sometimes necessary for x87 stack stuff; I think that came up once in an SO question. But other than weird stuff like that, the advantage is being able to use different C variables for input and output, or not using a variable at all for the input. (e.g. a literal constant, or an lvalue (expression)).
Anyway, telling the compiler to provide an input will probably result in an extra instruction, e.g. level=0 would result in an xor-zeroing of eax. This would be a waste of an instruction if it didn't already need a zeroed register for anything earlier. Normally xor-zeroing an input would break a dependency on the previous value, but the whole point of CPUID here is that it's serializing, so it has to wait for all previous instructions to finish executing anyway. Making sure eax is ready early is pointless; if you don't care about the outputs, don't even tell the compiler your asm statement takes an input. Compilers make it difficult or impossible to use an undefined / uninitialized value with no overhead; sometimes leaving a C variable uninitialized will result in loading garbage from the stack, or zeroing a register, instead of just using a register without writing it first.

How is it known that variables are in registers, or on stack?

I am reading this question about inline on isocpp FAQ, the code is given as
void f()
{
int x = /*...*/;
int y = /*...*/;
int z = /*...*/;
// ...code that uses x, y and z...
g(x, y, z);
// ...more code that uses x, y and z...
}
then it says that
Assuming a typical C++ implementation that has registers and a stack,
the registers and parameters get written to the stack just before the
call to g(), then the parameters get read from the stack inside
g() and read again to restore the registers while g() returns to
f(). But that’s a lot of unnecessary reading and writing, especially
in cases when the compiler is able to use registers for variables x,
y and z: each variable could get written twice (as a register and
also as a parameter) and read twice (when used within g() and to
restore the registers during the return to f()).
I have a big difficulty understanding the paragraph above. I try to list my questions as below:
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data? (I know this question is not particularly related to C++, but understanding this will be helpful to understand how C++ works.)
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
PS
It's very hard to choose an acceptable answer when the answers are all very good(E.g., the ones provided by #MatsPeterson, #TheodorosChatzigiannakis, and #superultranova) I think. I personally like the one by #Potatoswatter a little bit more since the answer offers some guidelines.
Don't take that paragraph too seriously. It seems to be making excessive assumptions and then going into excessive detail, which can't really be generalized.
But, your questions are very good.
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data? (I know this question is not particularly related to C++, but understanding this will be helpful to understand how C++ works.)
More-or-less, everything needs to be loaded into registers. Most computers are organized around a datapath, a bus connecting the registers, the arithmetic circuits, and the top level of the memory hierarchy. Usually, anything that is broadcast on the datapath is identified with a register.
You may recall the great RISC vs CISC debate. One of the key points was that a computer design can be much simpler if the memory is not allowed to connect directly to the arithmetic circuits.
In modern computers, there are architectural registers, which are a programming construct like a variable, and physical registers, which are actual circuits. The compiler does a lot of heavy lifting to keep track of physical registers while generating a program in terms of architectural registers. For a CISC instruction set like x86, this may involve generating instructions that send operands in memory directly to arithmetic operations. But behind the scenes, it's registers all the way down.
Bottom line: Just let the compiler do its thing.
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
Each platform defines a way for C functions to call each other. Passing parameters in registers is more efficient. But, there are trade-offs and the total number of registers is limited. Older ABIs more often sacrificed efficiency for simplicity, and put them all on the stack.
Bottom line: The example is arbitrarily assuming a naive ABI.
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
The compiler tends to prefer to use registers for more frequently accessed values. Nothing in the example requires the use of the stack. However, less frequently accessed values will be placed on the stack to make more registers available.
Only when you take the address of a variable, such as by &x or passing by reference, and that address escapes the inliner, is the compiler required use memory and not registers.
Bottom line: Avoid taking addresses and passing/storing them willy-nilly.
It is entirely up to the compiler (in conjunction with the processor type) whether a variable is stored in memory or a register [or in some cases more than one register] (and what options you give the compiler, assuming it's got options to decide such things - most "good" compilers do). For example, the LLVM/Clang compiler uses a specific optimisation pass called "mem2reg" that moves variables from memory to registers. The decision to do so is based on how the variable(s) are used - for example, if you take the address of a variable at some point, it needs to be in memory.
Other compilers have similar, but not necessarily identical, functionality.
Also, at least in compilers that have some semblance of portability, there will ALSO be a phase of generatinc machine code for the actual target, which contains target-specific optimisations, which again can move a variable from memory to a register.
It is not possible [without understanding how the particular compiler works] to determine if the variables in your code are in registers or in memory. One can guess, but such a guess is just like guessing other "kind of predictable things", like looking out the window to guess if it's going to rain in a few hours - depending on where you live, this may be a complete random guess, or quite predictable - some tropical countries, you can set your watch based on when the rain arrives each afternoon, in other countries, it rarely rains, and in some countries, like here in England, you can't know for certain beyond "right now it is [not] raining right here".
To answer the actual questions:
This depends on the processor. Proper RISC processors such as ARM, MIPS, 29K, etc have no instructions that use memory operands except the load and store type instructions. So if you need to add two values, you need to load the values into registers, and use the add operation on those registers. Some, such as x86 and 68K allows one of the two operands to be a memory operand, and for example PDP-11 and VAX have "full freedom", whether your operands are in memory or register, you can use the same instruction, just different addressing modes for the different operands.
Your original premise here is wrong - it's not guaranteed that arguments to g are on the stack. That is just one of many options. Many ABIs (application binary interface, aka "calling conventions) use registers for the first few arguments to a function. So, again, it depends on which compiler (to some degree) and what processor (much more than which compiler) the compiler targets whether the arguments are in memory or in registers.
Again, this is a decision that the compiler makes - it depends on how many registers the processor has, which are available, what the cost is if "freeing" some register for x, y and z - which ranges from "no cost at all" to "quite a bit" - again, depending on the processor model and the ABI.
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data?
Not even this statement is always true. It is probably true for all the platforms you'll ever work with, but there surely can be another architecture that doesn't make use of processor registers at all.
Your x86_64 computer does however.
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
These two questions cannot be uniquely answered for any compiler and system your code will be compiled on. They cannot even be taken for granted since g's parameters might not be on the stack, it all depends on several concepts I'll explain below.
First you should be aware of the so-called calling conventions which define, among the other things, how function parameters are passed (e.g. pushed on the stack, placed in registers, or a mix of both). This isn't enforced by the C++ standard and calling conventions are a part of the ABI, a broader topic regarding low-level machine code program issues.
Secondly register allocation (i.e. which variables are actually loaded in a register at any given time) is a complex task and a NP-complete problem. Compilers try to do their best with the information they have. In general less frequently accessed variables are put on the stack while more frequently accessed variables are kept on registers. Thus the part Where the data inside g() is stored, register or stack? cannot be answered once-and-for-all since it depends on many factors including register pressure.
Not to mention compiler optimizations which might even eliminate the need for some variables to be around.
Finally the question you linked already states
Naturally your mileage may vary, and there are a zillion variables that are outside the scope of this particular FAQ, but the above serves as an example of the sorts of things that can happen with procedural integration.
i.e. the paragraph you posted makes some assumptions to set things up for an example. Those are just assumptions and you should treat them as such.
As a small addition: regarding the benefits of inline on a function I recommend taking a look at this answer: https://stackoverflow.com/a/145952/1938163
You can't know, without looking at the assembly language, whether a variable is in a register, stack, heap, global memory or elsewhere. A variable is an abstract concept. The compiler is allowed to use registers or other memory as it chooses, as long as the execution isn't changed.
There's also another rule that affects this topic. If you take the address of a variable and store into a pointer, the variable may not be placed into a register because registers don't have addresses.
The variable storage may also depend on the optimization settings for the compiler. Variables can disappear due to simplification. Variables that don't change value may be placed into the executable as a constant.
Regarding your #1 question, yes, non load/store instructions operate on registers.
Regarding your #2 question, if we are assuming that parameters are passed on the stack, then we have to write the registers to the stack, otherwise g() won't be able to access the data, since the code in g() doesn't "know" which registers the parameters are in.
Regarding your #3 question, it is not known that x, y and z will for sure be stored in registers in f(). One could use the register keyword, but that's more of a suggestion. Based on the calling convention, and assuming the compiler doesn't do any optimization involving parameter passing, you may be able to predict whether the parameters are on the stack or in registers.
You should familiarize yourself with calling conventions. Calling conventions deal with the way that parameters are passed to functions and typically involve passing parameters on the stack in a specified order, putting parameters into registers or a combination of both.
stdcall, cdecl, and fastcall are some examples of calling conventions. In terms of parameter passing, stdcall and cdecl are the same, in the parameters are pushed in right to left order onto the stack. In this case, if g() was cdecl or stdcall the caller would push z,y,x in that order:
mov eax, z
push eax
mov eax, x
push eax
mov eax, y
push eax
call g
In 64bit fastcall, registers are used, microsoft uses RCX, RDX, R8, R9 (plus the stack for functions requiring more than 4 params), linux uses RDI, RSI, RDX, RCX, R8, R9. To call g() using MS 64bit fastcall one would do the following (we assume z, x, and y are not in registers)
mov rcx, x
mov rdx, y
mov r8, z
call g
This is how assembly is written by humans, and sometimes compilers. Compilers will use some tricks to avoid passing parameters, as it typically reduces the number of instructions and can reduce the number of time memory is accessed. Take the following code for example (I'm intentionally ignoring non-volatile register rules):
f:
xor rcx, rcx
mov rsi, x
mov r8, z
mov rdx y
call g
mov rcx, rax
ret
g:
mov rax, rsi
add rax, rcx
add rax, rdx
ret
For illustrative purposes, rcx is already in use, and x has been loaded into rsi. The compiler can compile g such that it uses rsi instead of rcx, so values don't have to be swapped between the two registers when it comes time to call g. The compiler could also inline g, now that f and g share the same set of registers for x, y, and z. In that case, the call g instruction would be replaced with the contents of g, excluding the ret instruction.
f:
xor rcx, rcx
mov rsi, x
mov r8, z
mov rdx y
mov rax, rsi
add rax, rcx
add rax, rdx
mov rcx, rax
ret
This will be even faster, because we don't have to deal with the call instruction, since g has been inlined into f.
Short answer: You can't. It completely depends on your compiler and the optimizing features enabled.
The compiler concern is to translate into assembly your program, but how it is done is tighly coupled to how your compiler works.
Some compilers allows you hint what variable map to register.
Check for example this: https://gcc.gnu.org/onlinedocs/gcc/Global-Reg-Vars.html
Your compiler will apply transformations to your code in order to gain something, may be performance, may be lower code size, and it apply cost functions to estimate this gains, so you normally only can see the result disassembling the compilated unit.
Variables are almost always stored in main memory. Many times, due to compiler optimizations, value of your declared variable will never move to main memory but those are intermediate variable that you use in your method which doesn't hold relevance before any other method is called (i.e. occurrence of stack operation).
This is by design - to improve performance as it is easier (and much faster) for processor to address and manipulate data in registers. Architectural registers are limited in size so everything cannot be put in registers. Even if you 'hint' your compiler to put it in register, eventually, OS may manage it outside register, in main memory, if available registers are full.
Most probably, a variable will be in main memory because it hold relevance further in the near execution and may hold reliance for longer period of CPU time. A variable is in architectural register because it holds relevance in upcoming machine instructions and execution will be almost immediate but may not be relevant for long.
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data?
This depends on the architecture and the instruction set it offers. But in practice, yes - it is the typical case.
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
Assuming the compiler doesn't eliminate the local variables, it will prefer to put them in registers, because registers are faster than the stack (which resides in the main memory, or a cache).
But this is far from a universal truth: it depends on the (complicated) inner workings of the compiler (whose details are handwaved in that paragraph).
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
Even if we assume that the variables are, in fact, stored in the registers, when you call a function, the calling convention kicks in. That's a convention that describes how a function is called, where the arguments are passed, who cleans up the stack, what registers are preserved.
All calling conventions have some kind of overhead. One source of this overhead is the argument passing. Many calling conventions attempt to reduce that, by preferring to pass arguments through registers, but since the number of CPU registers is limited (compared to the space of the stack), they eventually fall back to pushing through the stack after a number of arguments.
The paragraph in your question assumes a calling convention that passes everything through the stack and based on that assumption, what it's trying to tell you is that it would be beneficial (for execution speed) if we could "copy" (at compile time) the body of the called function inside the caller (instead of emitting a call to the function). This would yield the same results logically, but it would eliminate the runtime cost of the function call.

Just-in-Time compilation of Java bytecode

We are currently working on the JIT compilation part of our own Java Virtual Machine implementation. Our idea was now to do a simple translation of the given Java bytecode into opcodes, writing them to executable memory and CALLing right to the start of the method.
Assuming the given Java code would be:
int a = 13372338;
int b = 32 * a;
return b;
Now, the following approach was made (assuming that the given memory starts at 0x1000 & the return value is expected in eax):
0x1000: first local variable - accessible via [eip - 8]
0x1004: second local variable - accessible via [eip - 4]
0x1008: start of the code - accessible via [eip]
Java bytecode | Assembler code (NASM syntax)
--------------|------------------------------------------------------------------
| // start
| mov edx, eip
| push ebx
|
| // method content
ldc | mov eax, 13372338
| push eax
istore_0 | pop eax
| mov [edx - 8], eax
bipush | push 32
iload_0 | mov eax, [edx - 8]
| push eax
imul | pop ebx
| pop eax
| mul ebx
| push eax
istore_1 | pop eax
| mov [edx - 4], eax
iload_1 | mov eax, [edx - 4]
| push eax
ireturn | pop eax
|
| // end
| pop ebx
| ret
This would simply use the stack just like the virtual machine does itself.
The questions regarding this solution are:
Is this method of compilation viable?
Is it even possible to implement all the Java instructions this way? How could things like athrow/instanceof and similar commands be translated?
This method of compilation works, is easy to get up and running, and it at least removes interpretation overhead. But it results in pretty large amounts of code and pretty awful performance. One big problem is that it transliterates the stack operations 1:1, even though the target machine (x86) is a register machine. As you can see in the snippet you posted (as well as any other code), this always results in several stack manipulation opcodes for every single operation, so it uses the registers - heck, the the whole ISA - about as ineffectively as possible.
You can also support complicated control flow such as exceptions. It's not very different from implementing it in an interpreter. If you want good performance you don't want to perform work every time you enter or exit a try block. There are schemes to avoid this, used by both C++ and other JVMs (keyword: zero-cost or table-driven exception handling). These are quite complex and complicated to implement, understand and debug, so you should go with a simpler alternative first. Just keep it in mind.
As for the generated code: The first optimization, one which you'll almost definitely will need, is converting the stack operations into three address code or some other representation that uses registers. There are several papers on this and implementations of this, so I won't elaborate unless you want me to. Then, of course, you need to map these virtual registers onto physical registers. Register allocation is one of the most well-researched topics in compiler constructions, and there are at least half a dozen heuristics that are reasonably effective and fast enough to use in a JIT compiler. One example off the top of my head is linear scan register allocation (specifically creates for JIT compilation).
Beyond that, most JIT compilers focused on performance of the generated code (as opposed to quick compilation) use one or more intermediate formats and optimize the programs in this form. This is basically your run of the mill compiler optimization suite, including veterans like constant propagation, value numbering, re-association, loop invariant code motion, etc. - these things are not only simple to understand and implement, they've also been described in thirty years of literature up to and including textbooks and Wikipedia.
The code you'll get with the above will be pretty good for straigt-line code using primitives, arrays and object fields. However, you won't be able to optimize method calls at all. Every method is virtual, which means inlining or even moving method calls (for example out of a loop) is basically impossible except in very special cases. You mentioned that this is for a kernel. If you can accept using a subset of Java without dynamic class loading, you can do better (but it'll be nonstandard) by assuming the JIT knows all classes. Then you can, for example, detect leaf classes (or more generally methods which are never overriden) and inline those.
If you do need dynamic class loading, but expect it to be rare, you can also do better, though it takes more work. The advantage is that this approach generalizes to other things, like eliminating logging statements completely. The basic idea is specializing the code based on some assumptions (for example, that this static does not change or that no new classes are loaded), then de-optimizing if those assumptions are violated. This means you'll sometimes have to re-compile code while it is running (this is hard, but not impossible).
If you go further down this road, its logical conclusion is trace-based JIT compilation, which has been applied to Java, but AFAIK it didn't turn out to be superior to method-based JIT compilers. It's more effective when you have to make dozens or hundreds of assumptions to get good code, as it happens with highly dynamic languages.
Some comments about your JIT compiler (I hope I do not write things "delnan" already wrote):
Generic comments
I'm sure "real" JIT compilers work similar to your one. However you could do some optimization (example: "mov eax,nnn" and "push eax" could be replaced by "push nnn").
You should store local variables on the stack; typically "ebp" is used as local pointer:
push ebx
push ebp
sub esp, 8 // 2 variables with 4 bytes each
mov ebp, esp
// Now local variables are addressed using [ebp+0] and [ebp+4]
...
pop ebp
pop ebx
ret
This is necessary because functions may be recursive. Storing a variable at a fixed location (relative to EIP) would cause the variables to behave like "static" ones. (I'm assuming you are not compile a function multiple times in the case of a recursive function.)
Try/Catch
To implement Try/Catch your JIT compiler does not only have to look at the Java Bytecode but also at the Try/Catch information that is stored in a separate Attribute in the Java class. Try/catch can be implemented in the following way:
// push all useful registers (= the ones, that must not be destroyed)
push eax
push ebp
...
// push the "catch" pointers
push dword ptr catch_pointer
push dword ptr catch_stack
// set the "catch" pointers
mov catch_stack,esp
mov dword ptr catch_pointer, my_catch
... // some code
// Here some "throw" instruction...
push exception
jmp dword ptr catch_pointer
... //some code
// End of the "try" section: Pop all registers
pop dword_ptr catch_stack
...
pop eax
...
// The "catch" block
my_catch:
pop ecx // pop the Exception from the stack
mov esp, catch_stack // restore the stack
// Now restore all registers (same as at the end of the "try" section)
pop dword_ptr catch_stack
...
pop eax
push ecx // push the Exception to the stack
In a multi-thread environment each thread requires its own catch_stack and catch_pointer variable!
Specific exception types can be handled by using an "instanceof" the following way:
try {
// some code
} catch(MyException1 ex) {
// code 1
} catch(MyException2 ex) {
// code 2
}
... is actually compiled like this ...:
try {
// some code
} catch(Throwable ex) {
if(ex instanceof MyException1) {
// code 1
}
else if(ex instanceof MyException2) {
// code 2
}
else throw(ex); // not handled!
}
Objects
A JIT compiler of a simplified Java virtual machine not supporting objects (and arrays) would be quite easy but the objects in Java make the virtual machine very complex.
Objects are simply stored as pointers to the object on the stack or in the local variables. Typically JIT compilers will be implemented like this: For each class a piece of memory exists that contains information about the class (eg. which methods exist and at which address the assembler code of the method is located etc.). An object is some piece of memory that contains all instance variables and a pointer to the memory containing information about the class.
"Instanceof" and "checkcast" could be implemented by looking at the pointer to the memory containing information about the class. This information may contain a list of all parent classes and implemented interfaces.
The main problem of objects however is the memory management in Java: Unlike C++ there is a "new" but no "delete". You have to check how often an object is used. If an object is no longer used it must be deleted from memory and the destructor must be called.
The problems here are local variables (the same local variable may contain an object or a number) and try/catch blocks (the "catch" block must take care about the local variables and the stack (!) containing objects before restoring the stack pointer).

Assembly Performance Tuning

I am writing a compiler (more for fun than anything else), but I want to try to make it as efficient as possible. For example I was told that on Intel architecture the use of any register other than EAX for performing math incurs a cost (presumably because it swaps into EAX to do the actual piece of math). Here is at least one source that states the possibility (http://www.swansontec.com/sregisters.html).
I would like to verify and measure these differences in performance characteristics. Thus, I have written this program in C++:
#include "stdafx.h"
#include <intrin.h>
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
__int64 startval;
__int64 stopval;
unsigned int value; // Keep the value to keep from it being optomized out
startval = __rdtsc(); // Get the CPU Tick Counter using assembly RDTSC opcode
// Simple Math: a = (a << 3) + 0x0054E9
_asm {
mov ebx, 0x1E532 // Seed
shl ebx, 3
add ebx, 0x0054E9
mov value, ebx
}
stopval = __rdtsc();
__int64 val = (stopval - startval);
cout << "Result: " << value << " -> " << val << endl;
int i;
cin >> i;
return 0;
}
I tried this code swapping eax and ebx but I'm not getting a "stable" number. I would hope that the test would be deterministic (the same number every time) because it's so short that it's unlikely a context switch is occurring during the test. As it stands there is no statistical difference but the number fluctuates so wildly that it would be impossible to make that determination. Even if I take a large number of samples the number is still impossibly varied.
I'd also like to test xor eax, eax vs mov eax, 0, but have the same problem.
Is there any way to do these kinds of performance tests on Windows (or anywhere else)? When I used to program Z80 for my TI-Calc I had a tool where I could select some assembly and it would tell me how many clock cycles to execute the code -- can that not be done with our new-fangeled modern processors?
EDIT: There are a lot of answers indicating to run the loop a million times. To clarify, this actually makes things worse. The CPU is much more likely to context switch and the test becomes about everything but what I am testing.
To even have a hope of repeatable, determinstic timing at the level that RDTSC gives, you need to take some extra steps. First, RDTSC is not a serializing instruction, so it can be executed out of order, which will usually render it meaningless in a snippet like the one above.
You normally want to use a serializing instruction, then your RDTSC, then the code in question, another serializing instruction, and the second RDTSC.
Nearly the only serializing instruction available in user mode is CPUID. That, however, adds one more minor wrinkle: CPUID is documented by Intel as requiring varying amounts of time to execute -- the first couple of executions can be slower than others.
As such, the normal timing sequence for your code would be something like this:
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID ; Intel says by the third execution, the timing will be stable.
RDTSC ; read the clock
push eax ; save the start time
push edx
mov ebx, 0x1E532 // Seed // execute test sequence
shl ebx, 3
add ebx, 0x0054E9
mov value, ebx
XOR EAX, EAX ; serialize
CPUID
rdtsc ; get end time
pop ecx ; get start time back
pop ebp
sub eax, ebp ; find end-start
sbb edx, ecx
We're starting to get close, but there's on last point that's difficult to deal with using inline code on most compilers: there can also be some effects from crossing cache lines, so you normally want to force your code to be aligned to a 16-byte (paragraph) boundary. Any decent assembler will support that, but inline assembly in a compiler usually won't.
Having said all that, I think you're wasting your time. As you can guess, I've done a fair amount of timing at this level, and I'm quite certain what you've heard is an outright myth. In reality, all recent x86 CPUs use a set of what are called "rename registers". To make a long story short, this means the name you use for a register doesn't really matter much -- the CPU has a much larger set of registers (e.g., around 40 for Intel) that it uses for the actual operations, so your putting a value in EBX vs. EAX has little effect on the register that the CPU is really going to use internally. Either could be mapped to any rename register, depending primarily on which rename registers happen to be free when that instruction sequence starts.
I'd suggest taking a look at Agner Fog's "Software optimization resources" - in particular, the assembly and microarchitecture manuals (2 and 3), and the test code, which includes a rather more sophisticated framework for measurements using the performance monitor counters.
The Z80, and possibly the TI, had the advantage of synchronized memory access, no caches, and in-order execution of the instructions. That made it a lot easier to calculate to number of clocks per instruction.
On current x86 CPUs, instructions using AX or EAX are not faster per se, but some instructions might be shorter than the instructions using other registers. That might just save a byte in the instruction cache!
Go here and download the Architectures Optimization Reference Manual.
There are many myths. I think the EAX claim is one of them.
Also note that you can't talk anymore about 'which instruction is faster'. On today's hardware there are no 1 to 1 relation between instructions and execution time. Some instructions are preferred to others not because they are 'faster' but because they break dependencies between other instructions.
I believe that if there's a difference nowadays it will only be because some of the legacy instructions have a shorter encoding for the variant that uses EAX. To test this, repeat your test case a million times or more before you compare cycle counts.
You're getting ridiculous variance because rdtsc does not serialize execution. Depending on inaccessible details of the execution state, the instructions you're trying to benchmark may in fact be executed entirely before or after the interval between the rdtsc instructions! You will probably get better results if you insert a serializing instruction (such as cpuid) immediately after the first rdtsc and immediately before the second. See this Intel tech note (PDF) for gory details.
Starting your program is going to take much longer than running 4 assembly instructions once, so any difference from your assembly will drown in the noise. Running the program many times won't help, but it would probably help if you run the 4 assembly instructions inside a loop, say, a million times. That way the program start-up happens only once.
There can still be variation. One especially annoying thing that I've experienced myself is that your CPU might have a feature like Intel's Turbo Boost where it will dynamically adjust it's speed based on things like the temperature of your CPU. This is more likely to be the case on a laptop. If you've got that, then you will have to turn it off for any benchmark results to be reliable.
I think what the article tries to say about the EAX register, is that since some operations can only be performed on EAX, it's better to use it from the start. This was very true with the 8086 (MUL comes to mind), but the 386 made the ISA much more orthogonal, so it's much less true these days.

Why is there no Z80 like LDIR functionality in C/C++/rtl?

In Z80 machine code, a cheap technique to initialize a buffer to a fixed value, say all blanks. So a chunk of code might look something like this.
LD HL, DESTINATION ; point to the source
LD DE, DESTINATION + 1 ; point to the destination
LD BC, DESTINATION_SIZE - 1 ; copying this many bytes
LD (HL), 0X20 ; put a seed space in the first position
LDIR ; move 1 to 2, 2 to 3...
The result being that the chunk of memory at DESTINATION is completely blank filled.
I have experimented with memmove, and memcpy, and can't replicate this behavior. I expected memmove to be able to do it correctly.
Why do memmove and memcpy behave this way?
Is there any reasonable way to do this sort of array initialization?
I am already aware of char array[size] = {0} for array initialization
I am already aware that memset will do the job for single characters.
What other approaches are there to this issue?
There was a quicker way of blanking an area of memory using the stack. Although the use of LDI and LDIR was very common, David Webb (who pushed the ZX Spectrum in all sorts of ways like full screen number countdowns including the border) came up with this technique which is 4 times faster:
saves the Stack Pointer and then
moves it to the end of the screen.
LOADs the HL register pair with
zero,
goes into a massive loop
PUSHing HL onto the Stack.
The Stack moves up the screen and down
through memory and in the process,
clears the screen.
The explanation above was taken from the review of David Webbs game Starion.
The Z80 routine might look a little like this:
DI ; disable interrupts which would write to the stack.
LD HL, 0
ADD HL, SP ; save stack pointer
EX DE, HL ; in DE register
LD HL, 0
LD C, 0x18 ; Screen size in pages
LD SP, 0x4000 ; End of screen
PAGE_LOOP:
LD B, 128 ; inner loop iterates 128 times
LOOP:
PUSH HL ; effectively *--SP = 0; *--SP = 0;
DJNZ LOOP ; loop for 256 bytes
DEC C
JP NZ,PAGE_LOOP
EX DE, HL
LD SP, HL ; restore stack pointer
EI ; re-enable interrupts
However, that routine is a little under twice as fast. LDIR copies one byte every 21 cycles. The inner loop copies two bytes every 24 cycles -- 11 cycles for PUSH HL and 13 for DJNZ LOOP. To get nearly 4 times as fast simply unroll the inner loop:
LOOP:
PUSH HL
PUSH HL
...
PUSH HL ; repeat 128 times
DEC C
JP NZ,LOOP
That is very nearly 11 cycles every two bytes which is about 3.8 times faster than the 21 cycles per byte of LDIR.
Undoubtedly the technique has been reinvented many times. For example, it appeared earlier in sub-Logic's Flight Simulator 1 for the TRS-80 in 1980.
memmove and memcpy don't work that way because it's not a useful semantic for moving or copying memory. It's handy in the Z80 to do be able to fill memory, but why would you expect a function named "memmove" to fill memory with a single byte? It's for moving blocks of memory around. It's implemented to get the right answer (the source bytes are moved to the destination) regardless of how the blocks overlap. It's useful for it to get the right answer for moving memory blocks.
If you want to fill memory, use memset, which is designed to do just what you want.
I believe this goes to the design philosophy of C and C++. As Bjarne Stroustrup once said, one of the major guiding principles of the design of C++ is "What you don’t use, you don’t pay for". And while Dennis Ritchie may not have said it in exactly those same words, I believe that was a guiding principle informing his design of C (and the design of C by subsequent people) as well. Now you may think that if you allocate memory it should automatically be initialized to NULL's and I'd tend to agree with you. But that takes machine cycles and if you're coding in a situation where every cycle is critical, that may not be an acceptable trade-off. Basically C and C++ try to stay out of your way--hence if you want something initialized you have to do it yourself.
The Z80 sequence you show was the fastest way to do that - in 1978. That was 30 years ago. Processors have progressed a lot since then, and today that's just about the slowest way to do it.
Memmove is designed to work when the source and destination ranges overlap, so you can move a chunk of memory up by one byte. That's part of its specified behavior by the C and C++ standards. Memcpy is unspecified; it might work identically to memmove, or it might be different, depending on how your compiler decides to implement it. The compiler is free to choose a method that is more efficient than memmove.
Why do memmove and memcpy behave this way?
Probably because there’s no specific, modern C++ compiler that targets the Z80 hardware? Write one. ;-)
The languages don't specify how a given hardware implements anything. This is entirely up to the programmers of the compiler and libraries. Of course, writing an own, highly specified version for every imaginable hardware configuration is a lot of work. That’ll be the reason.
Is there any reasonable way to do this sort of array initialization?Is there any reasonable way to do this sort of array initialization?
Well, if all else fails you could always use inline assembly. Other than that, I expect std::fill to perform best in a good STL implementation. And yes, I’m fully aware that my expectations are too high and that std::memset often performs better in practice.
If you're fiddling at the hardware level, then some CPUs have DMA controllers that can fill blocks of memory exceedingly quickly (much faster than the CPU could ever do). I've done this on a Freescale i.MX21 CPU.
This be accomplished in x86 assembly just as easily. In fact, it boils down to nearly identical code to your example.
mov esi, source ; set esi to be the source
lea edi, [esi + 1] ; set edi to be the source + 1
mov byte [esi], 0 ; initialize the first byte with the "seed"
mov ecx, 100h ; set ecx to the size of the buffer
rep movsb ; do the fill
However, it is simply more efficient to set more than one byte at a time if you can.
Finally, memcpy/memmove aren't what you are looking for, those are for making copies of blocks of memory from from area to another (memmove allows source and dest to be part of the same buffer). memset fills a block with a byte of your choosing.
There's also calloc that allocates and initializes the memory to 0 before returning the pointer. Of course, calloc only initializes to 0, not something the user specifies.
If this is the most efficient way to set a block of memory to a given value on the Z80, then it's quite possible that memset() might be implemented as you describe on a compiler that targets Z80s.
It might be that memcpy() might also use a similar sequence on that compiler.
But why would compilers targeting CPUs with completely different instruction sets from the Z80 be expected to use a Z80 idiom for these types of things?
Remember that the x86 architecture has a similar set of instructions that could be prefixed with a REP opcode to have them execute repeatedly to do things like copy, fill or compare blocks of memory. However, by the time Intel came out with the 386 (or maybe it was the 486) the CPU would actually run those instructions slower than simpler instructions in a loop. So compilers often stopped using the REP-oriented instructions.
Seriously, if you're writing C/C++, just write a simple for-loop and let the compiler bother for you. As an example, here's some code VS2005 generated for this exact case (using templated size):
template <int S>
class A
{
char s_[S];
public:
A()
{
for(int i = 0; i < S; ++i)
{
s_[i] = 'A';
}
}
int MaxLength() const
{
return S;
}
};
extern void useA(A<5> &a, int n); // fool the optimizer into generating any code at all
void test()
{
A<5> a5;
useA(a5, a5.MaxLength());
}
The assembler output is the following:
test PROC
[snip]
; 25 : A<5> a5;
mov eax, 41414141H ;"AAAA"
mov DWORD PTR a5[esp+40], eax
mov BYTE PTR a5[esp+44], al
; 26 : useA(a5, a5.MaxLength());
lea eax, DWORD PTR a5[esp+40]
push 5 ; MaxLength()
push eax
call useA
It does not get any more efficient than that. Stop worrying and trust your compiler or at least have a look at what your compiler produces before trying to find ways to optimize. For comparison I also compiled the code using std::fill(s_, s_ + S, 'A') and std::memset(s_, 'A', S) instead of the for-loop and the compiler produced the identical output.
If you're on the PowerPC, _dcbz().
There are a number of situations where it would be useful to have a "memspread" function whose defined behavior was to copy the starting portion of a memory range throughout the whole thing. Although memset() does just fine if the goal is to spread a single byte value, there are times when e.g. one may want to fill an array of integers with the same value. On many processor implementations, copying a byte at a time from the source to the destination would be a pretty crummy way to implement it, but a well-designed function could yield good results. For example, start by seeing if the amount of data is less than 32 bytes or so; if so, just do a bytewise copy; otherwise check the source and destination alignment; if they are aligned, round the size down to the nearest word (if necessary), then copy the first word everywhere it goes, copy the next word everywhere it goes, etc.
I too have at times wished for a function that was specified to work as a bottom-up memcpy, intended for use with overlapping ranges. As to why there isn't a standard one, I guess nobody thought it important.
memcpy() should have that behavior. memmove() doesn't by design, if the blocks of memory overlap, it copies the contents starting at the ends of the buffers to avoid that sort of behavior. But to fill a buffer with a specific value you should be using memset() in C or std::fill() in C++, which most modern compilers will optimize to the appropriate block fill instruction (such as REP STOSB on x86 architectures).
As said before, memset() offers the desired functionality.
memcpy() is for moving around blocks of memory in all cases where the source and destination buffers do not overlap, or where dest < source.
memmove() solves the case of buffers overlapping and dest > source.
On x86 architectures, good compilers directly replace memset calls with inline assembly instructions very effectively setting the destination buffer's memory, even applying further optimizations like using 4-byte values to fill as long as possible (if the following code isn't totally syntactically correct blame it on my not using X86 assembly code for a long time):
lea edi,dest
;copy the fill byte to all 4 bytes of eax
mov al,fill
mov ah,al
mov dx,ax
shl eax,16
mov ax,dx
mov ecx,count
mov edx,ecx
shr ecx,2
cld
rep stosd
test edx,2
jz moveByte
stosw
moveByte:
test edx,1
jz fillDone
stosb
fillDone:
Actually this code is far more efficient than your Z80 version, as it doesn't do memory to memory, but only register to memory moves. Your Z80 code is in fact quite a hack as it relies on each copy operation having filled the source of the subsequent copy.
If the compiler is halfway good, it might be able to detect more complicated C++ code that can be broken down to memset (see the post below), but I doubt that this actually happens for nested loops, probably even invoking initialization functions.