Assembly: Why there is an empty memory on stack？

Assembly: Why there is an empty memory on stack？ - c++

I use online complier wrote a simple c++ code :
int main()
{
int a = 4;
int&& b = 2;
}
and the main function part of assembly code complied by gcc 11.20 shown below
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 4
mov eax, 2
mov DWORD PTR [rbp-20], eax
lea rax, [rbp-20]
mov QWORD PTR [rbp-16], rax
mov eax, 0
pop rbp
ret
I notice that when initializing 'a', the instruction just simply move an immediate operand directly to memory while for r-value reference 'b', it first store the immediate value into register eax,then move it to the memory, and also there is an unused memory bettween [rbp-8] ~ [rbp-4], I think that whatever immediate value,they just exist, so it has to be somewhere or it just simply use signal to iniltialize(my guess), I want to know more about the underlying logic.
So my question is that:
Why does inilization differs?
Why there is an empty 4-bytes unused memory on stack?

Let me address the second question first.
Note that there are actually three objects defined in this function: the int variable a, the reference b (implemented as a pointer), and the unnamed temporary int with a value of 2 that b points to. In unoptimized compilation, each of these objects needs to be stored at some unique location on the stack, and the compiler allocates stack space naively, processing the variables one by one and assigning each one space below the previous. It evidently chooses to handle them in the following order:
The variable a, an int needing 4 bytes. It goes in the first available stack slot, at [rbp-4].
The reference b, stored as a pointer needing 8 bytes. You might think it would go at [rbp-12], but the x86-64 ABI requires that pointers be naturally aligned on 8-byte boundaries. So the compiler moves down another 4 bytes to achieve this alignment, putting b at [rbp-16]. The 4 bytes at [rbp-8] are unused so far.
The temporary int, also needing 4 bytes. The compiler puts it right below the previously placed variable, at [rbp-20]. True, there was space at [rbp-8] that could have been used instead, which would be more efficient; but since you told the compiler not to optimize, it doesn't perform this optimization. It would if you used one of the -O flags.
As to why a is initialized with an immediate store to memory, whereas the temporary is initialized via a register: to really answer this, you'd have to read the details of the GCC source code, and frankly I don't think you'll find that there is anything very interesting behind it. Presumably there are different code paths in the compiler for creating and initializing named variables versus temporaries, and the code for temporaries may happen to be written as two steps.
It may be that for convenience, the programmer chose to create an extra object in the intermediate representation (GIMPLE or RTL), perhaps because it simplifies the compiler code in handling more general cases. They wouldn't take any trouble to avoid this, because they know that later optimization passes will clean it up. But if you have optimization turned off, this doesn't happen and you get actual instructions emitted for this unnecessary transfer.

In
int a = 4;
you declare a (typically) 4-byte variable and ask the compiler to fill it with the bit representation of 4.
In
int&& b = 2;
you declare a reference ("r-value reference") to, well, to what? To a literal? Is it possible? In C++ references are typically translated, on the assembly level, into pointers. So one can expect that b will be "a pointer in disguise", that is, without the * and -> semantics. But it will likely occupy 64 bits on a 64-bit machine. Now, pointers must point to some memory stored in RAM, not in registers, cache(s) etc. So the compiler most likely creates a temporary (unnamed) integer, initializes it with 2, and then binds its address to b. I write "most likely" because I doubt the standard standardizes this in such great detail. What we know for sure is that there is an extra unnamed variable involved in the initialization of b in int&& b = 2;.
As for the assembler, I have too little knowledge of it to dare explain anything to you. I guess, however, that the concept of a temporary variable and a pointer behind the && reference solves all your problems here.

Related

Where are expressions and constants stored if not in memory?

From C Programming Language by Brian W. Kernighan
& operator only applies to objects in memory: variables and array
elements. It cannot be applied to expressions, constants or register
variables.
Where are expressions and constants stored if not in memory?
What does that quote mean?
E.g:
&(2 + 3)
Why can't we take its address? Where is it stored?
Will the answer be same for C++ also since C has been its parent?
This linked question explains that such expressions are rvalue objects and all rvalue objects do not have addresses.
My question is where are these expressions stored such that their addresses can't be retrieved?

Consider the following function:
unsigned sum_evens (unsigned number) {
number &= ~1; // ~1 = 0xfffffffe (32-bit CPU)
unsigned result = 0;
while (number) {
result += number;
number -= 2;
}
return result;
}
Now, let's play the compiler game and try to compile this by hand. I'm going to assume you're using x86 because that's what most desktop computers use. (x86 is the instruction set for Intel compatible CPUs.)
Let's go through a simple (unoptimized) version of how this routine could look like when compiled:
sum_evens:
and edi, 0xfffffffe ;edi is where the first argument goes
xor eax, eax ;set register eax to 0
cmp edi, 0 ;compare number to 0
jz .done ;if edi = 0, jump to .done
.loop:
add eax, edi ;eax = eax + edi
sub edi, 2 ;edi = edi - 2
jnz .loop ;if edi != 0, go back to .loop
.done:
ret ;return (value in eax is returned to caller)
Now, as you can see, the constants in the code (0, 2, 1) actually show up as part of the CPU instructions! In fact, 1 doesn't show up at all; the compiler (in this case, just me) already calculates ~1 and uses the result in the code.
While you can take the address of a CPU instruction, it often makes no sense to take the address of a part of it (in x86 you sometimes can, but in many other CPUs you simply cannot do this at all), and code addresses are fundamentally different from data addresses (which is why you cannot treat a function pointer (a code address) as a regular pointer (a data address)). In some CPU architectures, code addresses and data addresses are completely incompatible (although this is not the case of x86 in the way most modern OSes use it).
Do notice that while (number) is equivalent to while (number != 0). That 0 doesn't show up in the compiled code at all! It's implied by the jnz instruction (jump if not zero). This is another reason why you cannot take the address of that 0 — it doesn't have one, it's literally nowhere.
I hope this makes it clearer for you.

where are these expressions stored such that there addresses can't be retrieved?
Your question is not well-formed.
Conceptually
It's like asking why people can discuss ownership of nouns but not verbs. Nouns refer to things that may (potentially) be owned, and verbs refer to actions that are performed. You can't own an action or perform a thing.
In terms of language specification
Expressions are not stored in the first place, they are evaluated.
They may be evaluated by the compiler, at compile time, or they may be evaluated by the processor, at run time.
In terms of language implementation
Consider the statement
int a = 0;
This does two things: first, it declares an integer variable a. This is defined to be something whose address you can take. It's up to the compiler to do whatever makes sense on a given platform, to allow you to take the address of a.
Secondly, it sets that variable's value to zero. This does not mean an integer with value zero exists somewhere in your compiled program. It might commonly be implemented as
xor eax,eax
which is to say, XOR (exclusive-or) the eax register with itself. This always results in zero, whatever was there before. However, there is no fixed object of value 0 in the compiled code to match the integer literal 0 you wrote in the source.
As an aside, when I say that a above is something whose address you can take - it's worth pointing out that it may not really have an address unless you take it. For example, the eax register used in that example doesn't have an address. If the compiler can prove the program is still correct, a can live its whole life in that register and never exist in main memory. Conversely, if you use the expression &a somewhere, the compiler will take care to create some addressable space to store a's value in.
Note for comparison that I can easily choose a different language where I can take the address of an expression.
It'll probably be interpreted, because compilation usually discards these structures once the machine-executable output replaces them. For example Python has runtime introspection and code objects.
Or I can start from LISP and extend it to provide some kind of addressof operation on S-expressions.
The key thing they both have in common is that they are not C, which as a matter of design and definition does not provide those mechanisms.

Such expressions end up part of the machine code. An expression 2 + 3 likely gets translated to the machine code instruction "load 5 into register A". CPU registers don't have addresses.

It does not really make sense to take the address to an expression. The closest thing you can do is a function pointer. Expressions are not stored in the same sense as variables and objects.
Expressions are stored in the actual machine code. Of course you could find the address where the expression is evaluated, but it just don't make sense to do it.
Read a bit about assembly. Expressions are stored in the text segment, while variables are stored in other segments, such as data or stack.
https://en.wikipedia.org/wiki/Data_segment
Another way to explain it is that expressions are cpu instructions, while variables are pure data.
One more thing to consider: The compiler often optimizes away things. Consider this code:
int x=0;
while(x<10)
x+=1;
This code will probobly be optimized to:
int x=10;
So what would the address to (x+=1) mean in this case? It is not even present in the machine code, so it has - by definition - no address at all.

Where are expressions and constants stored if not in memory
In some (actually many) cases, a constant expression is not stored at all. In particular, think about optimizing compilers, and see CppCon 2017: Matt Godbolt's talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”
In your particular case of some C code having 2 + 3, most optimizing compilers would have constant folded that into 5, and that 5 constant might be just inside some machine code instruction (as some bitfield) of your code segment and not even have a well defined memory location. If that constant 5 was a loop limit, some compilers could have done loop unrolling, and that constant won't appear anymore in the binary code.
See also this answer, etc...
Be aware that C11 is a specification written in English. Read its n1570 standard. Read also the much bigger specification of C++11 (or later).
Taking the address of a constant is forbidden by the semantics of C (and of C++).

How is it known that variables are in registers, or on stack?

I am reading this question about inline on isocpp FAQ, the code is given as
void f()
{
int x = /*...*/;
int y = /*...*/;
int z = /*...*/;
// ...code that uses x, y and z...
g(x, y, z);
// ...more code that uses x, y and z...
}
then it says that
Assuming a typical C++ implementation that has registers and a stack,
the registers and parameters get written to the stack just before the
call to g(), then the parameters get read from the stack inside
g() and read again to restore the registers while g() returns to
f(). But that’s a lot of unnecessary reading and writing, especially
in cases when the compiler is able to use registers for variables x,
y and z: each variable could get written twice (as a register and
also as a parameter) and read twice (when used within g() and to
restore the registers during the return to f()).
I have a big difficulty understanding the paragraph above. I try to list my questions as below:
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data? (I know this question is not particularly related to C++, but understanding this will be helpful to understand how C++ works.)
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
PS
It's very hard to choose an acceptable answer when the answers are all very good(E.g., the ones provided by #MatsPeterson, #TheodorosChatzigiannakis, and #superultranova) I think. I personally like the one by #Potatoswatter a little bit more since the answer offers some guidelines.

Don't take that paragraph too seriously. It seems to be making excessive assumptions and then going into excessive detail, which can't really be generalized.
But, your questions are very good.
For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data? (I know this question is not particularly related to C++, but understanding this will be helpful to understand how C++ works.)
More-or-less, everything needs to be loaded into registers. Most computers are organized around a datapath, a bus connecting the registers, the arithmetic circuits, and the top level of the memory hierarchy. Usually, anything that is broadcast on the datapath is identified with a register.
You may recall the great RISC vs CISC debate. One of the key points was that a computer design can be much simpler if the memory is not allowed to connect directly to the arithmetic circuits.
In modern computers, there are architectural registers, which are a programming construct like a variable, and physical registers, which are actual circuits. The compiler does a lot of heavy lifting to keep track of physical registers while generating a program in terms of architectural registers. For a CISC instruction set like x86, this may involve generating instructions that send operands in memory directly to arithmetic operations. But behind the scenes, it's registers all the way down.
Bottom line: Just let the compiler do its thing.
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
Each platform defines a way for C functions to call each other. Passing parameters in registers is more efficient. But, there are trade-offs and the total number of registers is limited. Older ABIs more often sacrificed efficiency for simplicity, and put them all on the stack.
Bottom line: The example is arbitrarily assuming a naive ABI.
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
The compiler tends to prefer to use registers for more frequently accessed values. Nothing in the example requires the use of the stack. However, less frequently accessed values will be placed on the stack to make more registers available.
Only when you take the address of a variable, such as by &x or passing by reference, and that address escapes the inliner, is the compiler required use memory and not registers.
Bottom line: Avoid taking addresses and passing/storing them willy-nilly.

It is entirely up to the compiler (in conjunction with the processor type) whether a variable is stored in memory or a register [or in some cases more than one register] (and what options you give the compiler, assuming it's got options to decide such things - most "good" compilers do). For example, the LLVM/Clang compiler uses a specific optimisation pass called "mem2reg" that moves variables from memory to registers. The decision to do so is based on how the variable(s) are used - for example, if you take the address of a variable at some point, it needs to be in memory.
Other compilers have similar, but not necessarily identical, functionality.
Also, at least in compilers that have some semblance of portability, there will ALSO be a phase of generatinc machine code for the actual target, which contains target-specific optimisations, which again can move a variable from memory to a register.
It is not possible [without understanding how the particular compiler works] to determine if the variables in your code are in registers or in memory. One can guess, but such a guess is just like guessing other "kind of predictable things", like looking out the window to guess if it's going to rain in a few hours - depending on where you live, this may be a complete random guess, or quite predictable - some tropical countries, you can set your watch based on when the rain arrives each afternoon, in other countries, it rarely rains, and in some countries, like here in England, you can't know for certain beyond "right now it is [not] raining right here".
To answer the actual questions:
This depends on the processor. Proper RISC processors such as ARM, MIPS, 29K, etc have no instructions that use memory operands except the load and store type instructions. So if you need to add two values, you need to load the values into registers, and use the add operation on those registers. Some, such as x86 and 68K allows one of the two operands to be a memory operand, and for example PDP-11 and VAX have "full freedom", whether your operands are in memory or register, you can use the same instruction, just different addressing modes for the different operands.
Your original premise here is wrong - it's not guaranteed that arguments to g are on the stack. That is just one of many options. Many ABIs (application binary interface, aka "calling conventions) use registers for the first few arguments to a function. So, again, it depends on which compiler (to some degree) and what processor (much more than which compiler) the compiler targets whether the arguments are in memory or in registers.
Again, this is a decision that the compiler makes - it depends on how many registers the processor has, which are available, what the cost is if "freeing" some register for x, y and z - which ranges from "no cost at all" to "quite a bit" - again, depending on the processor model and the ABI.

For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data?
Not even this statement is always true. It is probably true for all the platforms you'll ever work with, but there surely can be another architecture that doesn't make use of processor registers at all.
Your x86_64 computer does however.
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
These two questions cannot be uniquely answered for any compiler and system your code will be compiled on. They cannot even be taken for granted since g's parameters might not be on the stack, it all depends on several concepts I'll explain below.
First you should be aware of the so-called calling conventions which define, among the other things, how function parameters are passed (e.g. pushed on the stack, placed in registers, or a mix of both). This isn't enforced by the C++ standard and calling conventions are a part of the ABI, a broader topic regarding low-level machine code program issues.
Secondly register allocation (i.e. which variables are actually loaded in a register at any given time) is a complex task and a NP-complete problem. Compilers try to do their best with the information they have. In general less frequently accessed variables are put on the stack while more frequently accessed variables are kept on registers. Thus the part Where the data inside g() is stored, register or stack? cannot be answered once-and-for-all since it depends on many factors including register pressure.
Not to mention compiler optimizations which might even eliminate the need for some variables to be around.
Finally the question you linked already states
Naturally your mileage may vary, and there are a zillion variables that are outside the scope of this particular FAQ, but the above serves as an example of the sorts of things that can happen with procedural integration.
i.e. the paragraph you posted makes some assumptions to set things up for an example. Those are just assumptions and you should treat them as such.
As a small addition: regarding the benefits of inline on a function I recommend taking a look at this answer: https://stackoverflow.com/a/145952/1938163

You can't know, without looking at the assembly language, whether a variable is in a register, stack, heap, global memory or elsewhere. A variable is an abstract concept. The compiler is allowed to use registers or other memory as it chooses, as long as the execution isn't changed.
There's also another rule that affects this topic. If you take the address of a variable and store into a pointer, the variable may not be placed into a register because registers don't have addresses.
The variable storage may also depend on the optimization settings for the compiler. Variables can disappear due to simplification. Variables that don't change value may be placed into the executable as a constant.

Regarding your #1 question, yes, non load/store instructions operate on registers.
Regarding your #2 question, if we are assuming that parameters are passed on the stack, then we have to write the registers to the stack, otherwise g() won't be able to access the data, since the code in g() doesn't "know" which registers the parameters are in.
Regarding your #3 question, it is not known that x, y and z will for sure be stored in registers in f(). One could use the register keyword, but that's more of a suggestion. Based on the calling convention, and assuming the compiler doesn't do any optimization involving parameter passing, you may be able to predict whether the parameters are on the stack or in registers.
You should familiarize yourself with calling conventions. Calling conventions deal with the way that parameters are passed to functions and typically involve passing parameters on the stack in a specified order, putting parameters into registers or a combination of both.
stdcall, cdecl, and fastcall are some examples of calling conventions. In terms of parameter passing, stdcall and cdecl are the same, in the parameters are pushed in right to left order onto the stack. In this case, if g() was cdecl or stdcall the caller would push z,y,x in that order:
mov eax, z
push eax
mov eax, x
push eax
mov eax, y
push eax
call g
In 64bit fastcall, registers are used, microsoft uses RCX, RDX, R8, R9 (plus the stack for functions requiring more than 4 params), linux uses RDI, RSI, RDX, RCX, R8, R9. To call g() using MS 64bit fastcall one would do the following (we assume z, x, and y are not in registers)
mov rcx, x
mov rdx, y
mov r8, z
call g
This is how assembly is written by humans, and sometimes compilers. Compilers will use some tricks to avoid passing parameters, as it typically reduces the number of instructions and can reduce the number of time memory is accessed. Take the following code for example (I'm intentionally ignoring non-volatile register rules):
f:
xor rcx, rcx
mov rsi, x
mov r8, z
mov rdx y
call g
mov rcx, rax
ret
g:
mov rax, rsi
add rax, rcx
add rax, rdx
ret
For illustrative purposes, rcx is already in use, and x has been loaded into rsi. The compiler can compile g such that it uses rsi instead of rcx, so values don't have to be swapped between the two registers when it comes time to call g. The compiler could also inline g, now that f and g share the same set of registers for x, y, and z. In that case, the call g instruction would be replaced with the contents of g, excluding the ret instruction.
f:
xor rcx, rcx
mov rsi, x
mov r8, z
mov rdx y
mov rax, rsi
add rax, rcx
add rax, rdx
mov rcx, rax
ret
This will be even faster, because we don't have to deal with the call instruction, since g has been inlined into f.

Short answer: You can't. It completely depends on your compiler and the optimizing features enabled.
The compiler concern is to translate into assembly your program, but how it is done is tighly coupled to how your compiler works.
Some compilers allows you hint what variable map to register.
Check for example this: https://gcc.gnu.org/onlinedocs/gcc/Global-Reg-Vars.html
Your compiler will apply transformations to your code in order to gain something, may be performance, may be lower code size, and it apply cost functions to estimate this gains, so you normally only can see the result disassembling the compilated unit.

Variables are almost always stored in main memory. Many times, due to compiler optimizations, value of your declared variable will never move to main memory but those are intermediate variable that you use in your method which doesn't hold relevance before any other method is called (i.e. occurrence of stack operation).
This is by design - to improve performance as it is easier (and much faster) for processor to address and manipulate data in registers. Architectural registers are limited in size so everything cannot be put in registers. Even if you 'hint' your compiler to put it in register, eventually, OS may manage it outside register, in main memory, if available registers are full.
Most probably, a variable will be in main memory because it hold relevance further in the near execution and may hold reliance for longer period of CPU time. A variable is in architectural register because it holds relevance in upcoming machine instructions and execution will be almost immediate but may not be relevant for long.

For a computer to do some operations on some data which are residing in the main memory, is it true that the data must be loaded to some registers first then the CPU can operate on the data?
This depends on the architecture and the instruction set it offers. But in practice, yes - it is the typical case.
How is it known that the declarations for x, y, z make them stored in the registers? Where the data inside g() is stored, register or stack?
Assuming the compiler doesn't eliminate the local variables, it will prefer to put them in registers, because registers are faster than the stack (which resides in the main memory, or a cache).
But this is far from a universal truth: it depends on the (complicated) inner workings of the compiler (whose details are handwaved in that paragraph).
I think f() is a function in a way the same as g(x, y, z) is a function. How come x, y, z before calling g() are in the registers, and the parameters passed in g() are on the stack?
Even if we assume that the variables are, in fact, stored in the registers, when you call a function, the calling convention kicks in. That's a convention that describes how a function is called, where the arguments are passed, who cleans up the stack, what registers are preserved.
All calling conventions have some kind of overhead. One source of this overhead is the argument passing. Many calling conventions attempt to reduce that, by preferring to pass arguments through registers, but since the number of CPU registers is limited (compared to the space of the stack), they eventually fall back to pushing through the stack after a number of arguments.
The paragraph in your question assumes a calling convention that passes everything through the stack and based on that assumption, what it's trying to tell you is that it would be beneficial (for execution speed) if we could "copy" (at compile time) the body of the called function inside the caller (instead of emitting a call to the function). This would yield the same results logically, but it would eliminate the runtime cost of the function call.

Just-in-Time compilation of Java bytecode

We are currently working on the JIT compilation part of our own Java Virtual Machine implementation. Our idea was now to do a simple translation of the given Java bytecode into opcodes, writing them to executable memory and CALLing right to the start of the method.
Assuming the given Java code would be:
int a = 13372338;
int b = 32 * a;
return b;
Now, the following approach was made (assuming that the given memory starts at 0x1000 & the return value is expected in eax):
0x1000: first local variable - accessible via [eip - 8]
0x1004: second local variable - accessible via [eip - 4]
0x1008: start of the code - accessible via [eip]
Java bytecode | Assembler code (NASM syntax)
--------------|------------------------------------------------------------------
| // start
| mov edx, eip
| push ebx
|
| // method content
ldc | mov eax, 13372338
| push eax
istore_0 | pop eax
| mov [edx - 8], eax
bipush | push 32
iload_0 | mov eax, [edx - 8]
| push eax
imul | pop ebx
| pop eax
| mul ebx
| push eax
istore_1 | pop eax
| mov [edx - 4], eax
iload_1 | mov eax, [edx - 4]
| push eax
ireturn | pop eax
|
| // end
| pop ebx
| ret
This would simply use the stack just like the virtual machine does itself.
The questions regarding this solution are:
Is this method of compilation viable?
Is it even possible to implement all the Java instructions this way? How could things like athrow/instanceof and similar commands be translated?

This method of compilation works, is easy to get up and running, and it at least removes interpretation overhead. But it results in pretty large amounts of code and pretty awful performance. One big problem is that it transliterates the stack operations 1:1, even though the target machine (x86) is a register machine. As you can see in the snippet you posted (as well as any other code), this always results in several stack manipulation opcodes for every single operation, so it uses the registers - heck, the the whole ISA - about as ineffectively as possible.
You can also support complicated control flow such as exceptions. It's not very different from implementing it in an interpreter. If you want good performance you don't want to perform work every time you enter or exit a try block. There are schemes to avoid this, used by both C++ and other JVMs (keyword: zero-cost or table-driven exception handling). These are quite complex and complicated to implement, understand and debug, so you should go with a simpler alternative first. Just keep it in mind.
As for the generated code: The first optimization, one which you'll almost definitely will need, is converting the stack operations into three address code or some other representation that uses registers. There are several papers on this and implementations of this, so I won't elaborate unless you want me to. Then, of course, you need to map these virtual registers onto physical registers. Register allocation is one of the most well-researched topics in compiler constructions, and there are at least half a dozen heuristics that are reasonably effective and fast enough to use in a JIT compiler. One example off the top of my head is linear scan register allocation (specifically creates for JIT compilation).
Beyond that, most JIT compilers focused on performance of the generated code (as opposed to quick compilation) use one or more intermediate formats and optimize the programs in this form. This is basically your run of the mill compiler optimization suite, including veterans like constant propagation, value numbering, re-association, loop invariant code motion, etc. - these things are not only simple to understand and implement, they've also been described in thirty years of literature up to and including textbooks and Wikipedia.
The code you'll get with the above will be pretty good for straigt-line code using primitives, arrays and object fields. However, you won't be able to optimize method calls at all. Every method is virtual, which means inlining or even moving method calls (for example out of a loop) is basically impossible except in very special cases. You mentioned that this is for a kernel. If you can accept using a subset of Java without dynamic class loading, you can do better (but it'll be nonstandard) by assuming the JIT knows all classes. Then you can, for example, detect leaf classes (or more generally methods which are never overriden) and inline those.
If you do need dynamic class loading, but expect it to be rare, you can also do better, though it takes more work. The advantage is that this approach generalizes to other things, like eliminating logging statements completely. The basic idea is specializing the code based on some assumptions (for example, that this static does not change or that no new classes are loaded), then de-optimizing if those assumptions are violated. This means you'll sometimes have to re-compile code while it is running (this is hard, but not impossible).
If you go further down this road, its logical conclusion is trace-based JIT compilation, which has been applied to Java, but AFAIK it didn't turn out to be superior to method-based JIT compilers. It's more effective when you have to make dozens or hundreds of assumptions to get good code, as it happens with highly dynamic languages.

Some comments about your JIT compiler (I hope I do not write things "delnan" already wrote):
Generic comments
I'm sure "real" JIT compilers work similar to your one. However you could do some optimization (example: "mov eax,nnn" and "push eax" could be replaced by "push nnn").
You should store local variables on the stack; typically "ebp" is used as local pointer:
push ebx
push ebp
sub esp, 8 // 2 variables with 4 bytes each
mov ebp, esp
// Now local variables are addressed using [ebp+0] and [ebp+4]
...
pop ebp
pop ebx
ret
This is necessary because functions may be recursive. Storing a variable at a fixed location (relative to EIP) would cause the variables to behave like "static" ones. (I'm assuming you are not compile a function multiple times in the case of a recursive function.)
Try/Catch
To implement Try/Catch your JIT compiler does not only have to look at the Java Bytecode but also at the Try/Catch information that is stored in a separate Attribute in the Java class. Try/catch can be implemented in the following way:
// push all useful registers (= the ones, that must not be destroyed)
push eax
push ebp
...
// push the "catch" pointers
push dword ptr catch_pointer
push dword ptr catch_stack
// set the "catch" pointers
mov catch_stack,esp
mov dword ptr catch_pointer, my_catch
... // some code
// Here some "throw" instruction...
push exception
jmp dword ptr catch_pointer
... //some code
// End of the "try" section: Pop all registers
pop dword_ptr catch_stack
...
pop eax
...
// The "catch" block
my_catch:
pop ecx // pop the Exception from the stack
mov esp, catch_stack // restore the stack
// Now restore all registers (same as at the end of the "try" section)
pop dword_ptr catch_stack
...
pop eax
push ecx // push the Exception to the stack
In a multi-thread environment each thread requires its own catch_stack and catch_pointer variable!
Specific exception types can be handled by using an "instanceof" the following way:
try {
// some code
} catch(MyException1 ex) {
// code 1
} catch(MyException2 ex) {
// code 2
}
... is actually compiled like this ...:
try {
// some code
} catch(Throwable ex) {
if(ex instanceof MyException1) {
// code 1
}
else if(ex instanceof MyException2) {
// code 2
}
else throw(ex); // not handled!
}
Objects
A JIT compiler of a simplified Java virtual machine not supporting objects (and arrays) would be quite easy but the objects in Java make the virtual machine very complex.
Objects are simply stored as pointers to the object on the stack or in the local variables. Typically JIT compilers will be implemented like this: For each class a piece of memory exists that contains information about the class (eg. which methods exist and at which address the assembler code of the method is located etc.). An object is some piece of memory that contains all instance variables and a pointer to the memory containing information about the class.
"Instanceof" and "checkcast" could be implemented by looking at the pointer to the memory containing information about the class. This information may contain a list of all parent classes and implemented interfaces.
The main problem of objects however is the memory management in Java: Unlike C++ there is a "new" but no "delete". You have to check how often an object is used. If an object is no longer used it must be deleted from memory and the destructor must be called.
The problems here are local variables (the same local variable may contain an object or a number) and try/catch blocks (the "catch" block must take care about the local variables and the stack (!) containing objects before restoring the stack pointer).

Assembler push rdi, pop rdi around function call

What's purpose of push rdi and pop rdi when calling function in C++?
VS2010, x64, debug, no optimizations
C++
int calc()
{
return 8 + 7;
}
Disassembly:
int calc()
{
000000013F0B1020 push rdi
return 8 + 7;
000000013F0B1022 mov eax,0Fh
}
000000013F0B1027 pop rdi
000000013F0B1028 ret

There is no purpose to it. This is a common artifact of unoptimized code. The code generator emits the push edi instruction in anticipation of having to perform an addition. The EDI register must be preserved across function calls. But then, later, figures out that the addition can be performed at compile time.
Getting rid of extraneous code like this requires "peephole optimization". But that optimization isn't enabled in the Debug build. To know what the real code look like, you have to turn on the optimizer, best done by building the Release build. It in fact will completely eliminate the function, you can prevent it from doing so with:
__declspec(noline) int calc()
{
return 8 + 7;
}
Which produces in the Release build:
return 8 + 7;
000007F7038E1000 mov eax,0Fh
000007F7038E1005 ret

Have you heard of "caller-save" and "callee-save" registers?
Since your CPU only has a small, finite number of registers, it's usually impossible for caller/called functions to always use different registers. If a caller function and called function both want to use the same register, it means the value in the caller will have to be saved/restored before/after the call.
Saving/restoring register values can be done either by the caller or by the callee -- which one does so is a matter of convention. The benefit of "caller-save" registers is that if the caller knows it won't need the value in register XYZ after the call, it can omit the save/restore operations. The benefit of "callee-save" registers is that if the callee knows it won't modify the value in register XYZ, it can omit the save/restore operations.
I'm guessing that your compiler treats RDI as a callee-save register, but doesn't omit the unnecessary save/restore operations unless you have compiler optimizations turned on. (If someone knows this is incorrect, please post another answer!)
UPDATE: I found an article on x86 calling conventions: http://en.wikipedia.org/wiki/X86_calling_conventions
It seems to confirm that with most calling conventions, RDI would be callee-save. This doesn't explain why it isn't pushing and popping all the other callee-save registers. Maybe there is something else going on here.

May a pointer ever point to a cpu register?

I'm wondering if a pointer may point to a cpu register since in the case it may not, using reference instead of pointer where possible would give compiler opportunity to do some optimizations because the referenced object may reside in some register but an object pointed to by a pointer may not.

In general, CPU registers do not have memory addresses, though a CPU architecure could make them addressable (I;m not familar with any - if someone knows of one, I'd appreciate a comment). However, there's no standard way in C to get the address of a register. In fact if you mark a variable with the register storage class you aren't permitted to take that variables address using the & operator.
The key issue is aliasing - if the compiler can determine that an object isn't aliased then it can generally perform optimizations (whether the object is accessed via a pointer or a reference). I don't think you'll get any optimization benefit using a reference over a pointer (in general anyway).
However if you copy the object into a local variable, then the compiler can make an easier determination that there's no aliasing of the local assuming you don't pass the temporaries address around. This is a case where you can help the compiler optimize; however if the copy operation is expensive, it might not pay off in the end.
For something that would fit in a CPU register, copying to a temp is often a good way to go - compilers are great at optimizing those to registers.

When a reference is passed to a function, the compiler will probably implement it as a hidden pointer - so changing the type won't matter.
When a reference is created and used locally, the compiler may be smart enough to know what it refers to and treat it as an alias to the referenced variable. If the variable is optimized to a register, the compiler would know that the reference is also that same register.
A pointer will always need to point to a memory location. Even on the odd architecture that gives memory locations to its registers, it seems unlikely that the compiler would support such an operation.
Edit: As an example, here is the generated code from Microsoft C++ with optimizations on. The code for the pointer and a passed reference are identical. The parameter passed by value for some reason did not end up in a register, even when I rearranged the parameter list. Even so, once the value was copied to a register both the local variable and the local reference used the same register without reloading it.
void __fastcall test(int i, int * ptr, int & ref)
{
_i$ = 8 ; size = 4
_ref$ = 12 ; size = 4
?test##YIXHPAHAAH#Z PROC ; test, COMDAT
; _ptr$ = ecx
; 8 : global_int1 += *ptr;
mov edx, DWORD PTR [ecx]
; 9 :
; 10 : global_int2 += ref;
mov ecx, DWORD PTR _ref$[esp-4]
mov eax, DWORD PTR _i$[esp-4]
add DWORD PTR ?global_int1##3HA, edx ; global_int1
mov edx, DWORD PTR [ecx]
add DWORD PTR ?global_int2##3HA, edx ; global_int2
; 11 :
; 12 : int & ref2 = i;
; 13 : global_int3 += ref2;
add DWORD PTR ?global_int3##3HA, eax ; global_int3
; 14 :
; 15 : global_int4 += i;
add DWORD PTR ?global_int4##3HA, eax ; global_int4

I think what you meant to say is whether an integral value referred to by a reference reside in a register.
Usually, most compilers treat references the same way as pointers. That is to say references are just pointers with special "dereference" semantics built in. So, sadly there usually is no optimization unlike with integral values that can fit into registers. The only difference between a reference and a pointer is that a reference must (but not enforced by the compiler) refer to a valid object, whereas a pointer can be NULL.

In many(if not most or all) implementations a reference is deep inside implemented via a pointer. So I think that doing it via a pointer or reference is pretty much irrelevant for an optimizer.

I would say generally not. As mentioned in an above comment there are some processors where you can address a register in memory space, but that is probably a bad idea (unless the chip was designed for you to program it that way).
It is more like the opposite of what you are asking actually happens. The optimizers can see what you are doing with an pointer and what it points to and depending on the architecture may not actually use a register for the pointer and a register to hold what it points to but for example may hardcode the address into the instruction using no registers at all. May load the value pointed to into a register but use a register for the address or use it longer than it takes to get the value. Sometimes it is not that efficient, it may save the value in a register to ram just so it can read it back into a register using its address, when changing the code would avoid that two step. It depends heavily on the program/code and the instruction set and compiler.
So instead of trying to address the register to try to get some optimization, know the compiler and target and know when it is better to use pointers or arrays or values, etc. some constructs work well on most processors and some only work well on one but bad on others.

Pointer points to memory locations. So it is not possible to access CPU registers using pointers. References are less powerful version of pointers (you can't perform arithmetic on references). However compilers generally put variables into registers to perform operations. For example, compiler may put a loop counter into one of CPU registers for quick access. Or may put function parameters that don't take much space in registers. There is a keyword in C that you can use to request compiler to put certain variable into CPU register. The keyword is register:
for (int i = 0; i < I; i++)
for (int j = 0; j < J; j++)
for (register int k = 0; k < K; k++)
{
// to do
}

Michael Burr is correct. CPU registers do not have memory addresses.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js