Get the "size" (length) of a C++ function? [duplicate]

Get the "size" (length) of a C++ function? [duplicate] - c++

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to get the length of a function in bytes?
I'm making a Hooking program that will be used to insert a method into the specified section of memory.
I need to get the length of a local C++ function, I've used a cast to get the location of a function, but how would I get the length?
would
int GetFuncLen()
{
int i = 0;
while((DWORD*)Function+i<max)
{
if((DWORD*)Function+i==0xC3)
{
return i;
}
i++;
}
}
work?

Your code seems to be operating system, compiler, and machine architecture specific.
(I know nothing about Windows)
It could be wrong if max is not defined.
It is operating system specific (probably Windows only) because DWORD is not a standard C++ type. You could use intptr_t (from <cstdint> header).
Your code is compiler specific, because you assume that every compiled function has a well defined unique end, and don't share any code with some other functions. (Some compilers are able to do such optimizations, and e.g. make two functions sharing a common epilogue or code chunk, using jump instructions).
Your code is machine specific, because you assume that the last instruction would be a RET coded 0xC3 and this is specific to x86 & x86-64 (won't work on Alpha or ARM, on which Windows is rumored to have been or to be ported). Also, that byte could appear inside other instructions or inlined constants (as Mat commented).
I am not sure that the notion of where a binary function ends has a well defined meaning. But if it does, I would expect that the linker may know about it. On some systems, for example on Linux with ELF executable, the compiler and the linker produces the size of each function.
Perhaps you better need to find the symbol near to a given address. I don't know if Windows has such a functionality (on Linux, the dladdr GNU function from <dlfcn.h> could be useful). Perhaps your operating system provides an equivalent?

No. For a few reasons.
1) 0xC3 is only a 'ret' instruction if it is at the point where a new instruction is expected. There could easily be other instructions that include a 0xc3 byte within their operands, and you'd only get part of the code.
2) There can be multiple 'ret' instructions in any given function, depending on the compiler and it's settings. Again, you'd only get part of the function.
3) Functions often use constructs like "switch" statements, that use "jump tables" that are located AFTER the ret instruction. Again, you'd only get part of the function.
And what you're trying to do is not likely to work anyway.
The biggest problem is that various assembly instructions will often reference specific areas of memory by using offsets rather than absolute addresses. So while extremely minimal functions might work, any functions that call out into other functions will likely fail.
Assuming you're trying to load these functions into an external process, and you're trying to do this on Windows, a better solution is to use DLL injection to load a DLL into your target process.
If you really need to inject the memory, then you'll need an assembly language parser for your particular platform to update all of the memory addresses for the relevant instructions, which is a very complex task. Or you could write your functions in assembly language and make sure that you're not using relative offsets for anything other than referencing parts of your own code, which is a bit easier, but more limiting in what you can do.

You could force your function to be put in a section all by itself (see eg http://msdn.microsoft.com/en-us/library/s20kdbse(v=VS.71).aspx).
I think that if you define a section, declare a variable in it, define your function in it, then define another variable in it then the addresses of the two variables will cover your function.
Better is to put the two variables and the function in separate sections and then use section merging to control the order they appear in the resulting code (see How to refer to the start-of a user-defined segment in a Visual Studio-project?)
As others have pointed out you probably can't do anything useful with this, and it's not at all portable.

The only reliable way to do this is to compile your code with a dummy number for the length of the function (but not run it), disassemble it, and calculate the length by hand, then take that number and substitute it for the dummy number, and recompile your program.
When I needed to do this, I just made a guess as to how big the function should be. As long as your guess is not to small (and not way way too big) you should have no problems.

You can use
objdump
to get the size of objects with external linkage. Otherwise, you could take the assembly output of the compiler (gcc -S, e.g.) and assemble it manually, you'll have the opportunity to see what names the length fields get:
.file "test.cpp"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.6.0-3~ppa1) 4.6.1 20110409 (prerelease)"
.section .note.GNU-stack,"",#progbits
See the .size main, .-main evaluation: it calculates the function size

Related

Using base pointer register in C++ inline asm

I want to be able to use the base pointer register (%rbp) within inline asm. A toy example of this is like so:
void Foo(int &x)
{
asm volatile ("pushq %%rbp;" // 'prologue'
"movq %%rsp, %%rbp;" // 'prologue'
"subq $12, %%rsp;" // make room
"movl $5, -12(%%rbp);" // some asm instruction
"movq %%rbp, %%rsp;" // 'epilogue'
"popq %%rbp;" // 'epilogue'
: : : );
x = 5;
}
int main()
{
int x;
Foo(x);
return 0;
}
I hoped that, since I am using the usual prologue/epilogue function-calling method of pushing and popping the old %rbp, this would be ok. However, it seg faults when I try to access x after the inline asm.
The GCC-generated assembly code (slightly stripped-down) is:
_Foo:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
# INLINEASM
pushq %rbp; // prologue
movq %rsp, %rbp; // prologue
subq $12, %rsp; // make room
movl $5, -12(%rbp); // some asm instruction
movq %rbp, %rsp; // epilogue
popq %rbp; // epilogue
# /INLINEASM
movq -8(%rbp), %rax
movl $5, (%rax) // x=5;
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Foo
movl $0, %eax
leave
ret
Can anyone tell me why this seg faults? It seems that I somehow corrupt %rbp but I don't see how. Thanks in advance.
I'm running GCC 4.8.4 on 64-bit Ubuntu 14.04.

See the bottom of this answer for a collection of links to other inline-asm Q&As.
Your code is broken because you step on the red-zone below RSP (with push) where GCC was keeping a value.
What are you hoping to learn to accomplish with inline asm? If you want to learn inline asm, learn to use it to make efficient code, rather than horrible stuff like this. If you want to write function prologues and push/pop to save/restore registers, you should write whole functions in asm. (Then you can easily use nasm or yasm, rather than the less-preferred-by-most AT&T syntax with GNU assembler directives1.)
GNU inline asm is hard to use, but allows you to mix custom asm fragments into C and C++ while letting the compiler handle register allocation and any saving/restoring if necessary. Sometimes the compiler will be able to avoid the save and restore by giving you a register that's allowed to be clobbered. Without volatile, it can even hoist asm statements out of loops when the input would be the same. (i.e. unless you use volatile, the outputs are assumed to be a "pure" function of the inputs.)
If you're just trying to learn asm in the first place, GNU inline asm is a terrible choice. You have to fully understand almost everything that's going on with the asm, and understand what the compiler needs to know, to write correct input/output constraints and get everything right. Mistakes will lead to clobbering things and hard-to-debug breakage. The function-call ABI is a much simpler and easier to keep track of boundary between your code and the compiler's code.
Why this breaks
You compiled with -O0, so gcc's code spills the function parameter from %rdi to a location on the stack. (This could happen in a non-trivial function even with -O3).
Since the target ABI is the x86-64 SysV ABI, it uses the "Red Zone" (128 bytes below %rsp that even asynchronous signal handlers aren't allowed to clobber), instead of wasting an instruction decrementing the stack pointer to reserve space.
It stores the 8B pointer function arg at -8(rsp_at_function_entry). Then your inline asm pushes %rbp, which decrements %rsp by 8 and then writes there, clobbering the low 32b of &x (the pointer).
When your inline asm is done,
gcc reloads -8(%rbp) (which has been overwritten with %rbp) and uses it as the address for a 4B store.
Foo returns to main with %rbp = (upper32)|5 (orig value with the low 32 set to 5).
main runs leave: %rsp = (upper32)|5
main runs ret with %rsp = (upper32)|5, reading the return address from virtual address (void*)(upper32|5), which from your comment is 0x7fff0000000d.
I didn't check with a debugger; one of those steps might be slightly off, but the problem is definitely that you clobber the red zone, leading to gcc's code trashing the stack.
Even adding a "memory" clobber doesn't get gcc to avoid using the red zone, so it looks like allocating your own stack memory from inline asm is just a bad idea. (A memory clobber means you might have written some memory you're allowed to write to, e.g. a global variable or something pointed-to by a global, not that you might have overwritten something you're not supposed to.)
If you want to use scratch space from inline asm, you should probably declare an array as a local variable and use it as an output-only operand (which you never read from).
AFAIK, there's no syntax for declaring that you modify the red-zone, so your only options are:
use an "=m" output operand (possibly an array) for scratch space; the compiler will probably fill in that operand with an addressing mode relative to RBP or RSP. You can index into it with constants like 4 + %[tmp] or whatever. You might get an assembler warning from 4 + (%rsp) but not an error.
skip over the red-zone with add $-128, %rsp / sub $-128, %rsp around your code. (Necessary if you want to use an unknown amount of extra stack space, e.g. push in a loop, or making a function call. Yet another reason to deref a function pointer in pure C, not inline asm.)
compile with -mno-red-zone (I don't think you can enable that on a per-function basis, only per-file)
Don't use scratch space in the first place. Tell the compiler what registers you clobber and let it save them.
Here's what you should have done:
void Bar(int &x)
{
int tmp;
long tmplong;
asm ("lea -16 + %[mem1], %%rbp\n\t"
"imul $10, %%rbp, %q[reg1]\n\t" // q modifier: 64bit name.
"add %k[reg1], %k[reg1]\n\t" // k modifier: 32bit name
"movl $5, %[mem1]\n\t" // some asm instruction writing to mem
: [mem1] "=m" (tmp), [reg1] "=r" (tmplong) // tmp vars -> tmp regs / mem for use inside asm
:
: "%rbp" // tell compiler it needs to save/restore %rbp.
// gcc refuses to let you clobber %rbp with -fno-omit-frame-pointer (the default at -O0)
// clang lets you, but memory operands still use an offset from %rbp, which will crash!
// gcc memory operands still reference %rsp, so don't modify it. Declaring a clobber on %rsp does nothing
);
x = 5;
}
Note the push/pop of %rbp in the code outside the #APP / #NO_APP section, emitted by gcc. Also note that the scratch memory it gives you is in the red zone. If you compile with -O0, you'll see that it's at a different position from where it spills &x.
To get more scratch regs, it's better to just declare more output operands that are never used by the surrounding non-asm code. That leaves register allocation to the compiler, so it can be different when inlined into different places. Choosing ahead of time and declaring a clobber only makes sense if you need to use a specific register (e.g. shift count in %cl). Of course, an input constraint like "c" (count) gets gcc to put the count in rcx/ecx/cx/cl, so you don't emit a potentially redundant mov %[count], %%ecx.
If this looks too complicated, don't use inline asm. Either lead the compiler to the asm you want with C that's like the optimal asm, or write a whole function in asm.
When using inline asm, keep it as small as possible: ideally just the one or two instructions that gcc isn't emitting on its own, with input/output constraints to tell it how to get data into / out of the asm statement. This is what it's designed for.
Rule of thumb: if your GNU C inline asm start or ends with a mov, you're usually doing it wrong and should have used a constraint instead.
Footnotes:
You can use GAS's intel-syntax in inline-asm by building with -masm=intel (in which case your code will only work with that option), or using dialect alternatives so it works with the compiler in Intel or AT&T asm output syntax. But that doesn't change the directives, and GAS's Intel-syntax is not well documented. (It's like MASM, not NASM, though.) I don't really recommend it unless you really hate AT&T syntax.
Inline asm links:
x86 wiki. (The tag wiki also links to this question, for this collection of links)
The inline-assembly tag wiki
The manual. Read this. Note that inline asm was designed to wrap single instructions that the compiler doesn't normally emit. That's why it's worded to say things like "the instruction", not "the block of code".
A tutorial
Looping over arrays with inline assembly Using r constraints for pointers/indices and using your choice of addressing mode, vs. using m constraints to let gcc choose between incrementing pointers vs. indexing arrays.
How can I indicate that the memory *pointed* to by an inline ASM argument may be used? (pointer inputs in registers do not imply that the pointed-to memory is read and/or written, so it might not be in sync if you don't tell the compiler).
In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?. Using %q0 to get %rax vs. %w0 to get %ax. Using %g[scalar] to get %zmm0 instead of %xmm0.
Efficient 128-bit addition using carry flag Stephen Canon's answer explains a case where an early-clobber declaration is needed on a read+write operand. Also note that x86/x86-64 inline asm doesn't need to declare a "cc" clobber (the condition codes, aka flags); it's implicit. (gcc6 introduces syntax for using flag conditions as input/output operands. Before that you have to setcc a register that gcc will emit code to test, which is obviously worse.)
Questions about the performance of different implementations of strlen: my answer on a question with some badly-used inline asm, with an answer similar to this one.
llvm reports: unsupported inline asm: input with type 'void *' matching output with type 'int': Using offsetable memory operands (in x86, all effective addresses are offsettable: you can always add a displacement).
When not to use inline asm, with an example of 32b/32b => 32b division and remainder that the compiler can already do with a single div. (The code in the question is an example of how not to use inline asm: many instructions for setup and save/restore that should be left to the compiler by writing proper in/out constraints.)
MSVC inline asm vs. GNU C inline asm for wrapping a single instruction, with a correct example of inline asm for 64b/32b=>32bit division. MSVC's design and syntax require a round trip through memory for inputs and outputs, making it terrible for short functions. It's also "never very reliable" according to Ross Ridge's comment on that answer.
Using x87 floating point, and commutative operands. Not a great example, because I didn't find a way to get gcc to emit ideal code.
Some of those re-iterate some of the same stuff I explained here. I didn't re-read them to try to avoid redundancy, sorry.

In x86-64, the stack pointer needs to be aligned to 8 bytes.
This:
subq $12, %rsp; // make room
should be:
subq $16, %rsp; // make room

VM interpreter - weighting performance benefits and drawbacks of larger instruction set / dispatch loop

I am developing a simple VM and I am in the middle of a crossroad.
My initial goal was to use byte long instruction, and therefore a small loop and a quick computed goto dispatch.
However, turns out reality could not be further from it - 256 is nowhere near enough to cover signed and unsigned 8, 16, 32 and 64bit integers, floats and doubles, pointer operations, the different combinations of addressing. One option was to not implement byte and shorts but the goal is to make a VM that supports the full C subset as well as vector operations, since they are pretty much everywhere anyway, albeit in different implementations.
So I switched to 16bit instruction, so now I am also able to add portable SIMD intrinsics and more compiled common routines that really save on performance by not being interpreted. There is also caching of global addresses, initially compiled as base pointer offsets, the first time an address is compiled it simply overwrites the offset and instruction so that next time it is a direct jump, at the cost of and extra instruction in the set for each use of a global by an instruction.
Since I am not in the stage of profiling, I am in a dilemma, are the extra instructions worth the more flexibility, will the presence of more instructions and therefore the absence of copying back and forth instructions make up for the increased dispatch loop size? Keeping in mind the instructions are just a few assembly instructions each, e.g:
.globl __Z20assign_i8u_reg8_imm8v
.def __Z20assign_i8u_reg8_imm8v; .scl 2; .type 32; .endef
__Z20assign_i8u_reg8_imm8v:
LFB13:
.cfi_startproc
movl _ip, %eax
movb 3(%eax), %cl
movzbl 2(%eax), %eax
movl _sp, %edx
movb %cl, (%edx,%eax)
addl $4, _ip
ret
.cfi_endproc
LFE13:
.p2align 2,,3
.globl __Z18assign_i8u_reg_regv
.def __Z18assign_i8u_reg_regv; .scl 2; .type 32; .endef
__Z18assign_i8u_reg_regv:
LFB14:
.cfi_startproc
movl _ip, %edx
movl _sp, %eax
movzbl 3(%edx), %ecx
movb (%ecx,%eax), %cl
movzbl 2(%edx), %edx
movb %cl, (%eax,%edx)
addl $4, _ip
ret
.cfi_endproc
LFE14:
.p2align 2,,3
.globl __Z24assign_i8u_reg_globCachev
.def __Z24assign_i8u_reg_globCachev; .scl 2; .type 32; .endef
__Z24assign_i8u_reg_globCachev:
LFB15:
.cfi_startproc
movl _ip, %eax
movl _sp, %edx
movl 4(%eax), %ecx
addl %edx, %ecx
movl %ecx, 4(%eax)
movb (%ecx), %cl
movzwl 2(%eax), %eax
movb %cl, (%eax,%edx)
addl $8, _ip
ret
.cfi_endproc
LFE15:
.p2align 2,,3
.globl __Z19assign_i8u_reg_globv
.def __Z19assign_i8u_reg_globv; .scl 2; .type 32; .endef
__Z19assign_i8u_reg_globv:
LFB16:
.cfi_startproc
movl _ip, %eax
movl 4(%eax), %edx
movb (%edx), %cl
movzwl 2(%eax), %eax
movl _sp, %edx
movb %cl, (%edx,%eax)
addl $8, _ip
ret
.cfi_endproc
This example contains the instructions to:
assign unsigned byte from immediate value to register
assign unsigned byte from register to register
assign unsigned byte from global offset to register and, cache and change to direct instruction
assign unsigned byte from global offset to register (the now cached previous version)
... and so on...
Naturally, when I produce a compiler for it, I will be able to test the instruction flow in production code and optimize the arrangement of the instructions in memory to pack together the frequently used ones and get more cache hits.
I just have a hard time figuring if such a strategy is a good idea, the bloat will make up for flexibility, but what about performance? Will more compiled routines make up for a larger dispatch loop? Is it worth caching global addresses?
I would also like for someone, decent in assembly to express an opinion on the quality of the code that is generated by GCC - are there any obvious inefficiencies and room for optimization? To make the situation clear, there is a sp pointer, which points to the stack that implements the registers (there is no other stack), ip is logically the current instruction pointer, and gp is the global pointer (not referenced, accessed as an offset).
EDIT: Also, this is the basic format I am implementing the instructions in:
INSTRUCTION assign_i8u_reg16_glob() { // assign unsigned byte to reg from global offset
FETCH(globallAddressCache);
REG(quint8, i.d16_1) = GLOB(quint8);
INC(globallAddressCache);
}
FETCH returns a reference to the struct, which the instruction is using based on the opcode
REG returns a reference to register value T from offset
GLOB retursn a reference to global value from a cached global offset (effectively absolute address)
INC just increments the instruction pointer by the size of the instruction.
Some people will probably suggest against the usage of macroses, but with templates it is much less readable. This way the code is pretty obvious.
EDIT: I would like to add a few points to the question:
I could go for a "register operations only" solution which can only move data between registers and "memory" - be that global or heap. In this case, every "global" and heap access will have to copy the value, modify or use it, and move it back to update. This way I have a shorter dispatch loop, but a few extra instructions for each instruction that addresses non-register data. So the dilemma is a few times more native code with longer direct jumps, or a few times more interpreted instructions with shorter dispatch loop. Will a short dispatch loop give me enough performance to make up for the extra and costly memory operations? Maybe the delta between the shorter and longer dispatch loop is not enough to make a real difference? In terms of cache hits, in terms of the cost of assembly jumps.
I could go for additional decoding and only 8bit wide instructions, however, this may add another jump - jump to wherever this instruction is handled, then waste time on either jumping to the case the particular addressing scheme is handled or decoding operations and a more complex execution method. And in the first case, the dispatch loop still grows, plus adding yet another jump. The second option - register operations can be used to decode the addressing, but a more complex instruction with more compile time unknown will be needed in order to address anything. I am not really sure how will this stack up with a shorter dispatch loop, once again, uncertain how my "shorter and longer dispatch loop" relates to what is considered short or long in terms of assembly instructions, the memory they need and the speed of their execution.
I could go for the "many instructions" solution - the dispatch loop is a few times larger, but it still uses pre-computed direct jumping. Complex addressing is specific and optimized for each instruction and compiled to native, so the extra memory operations that would be needed by the "register only" approach will be compiled and mostly executed on the registers, which is good for performance. Generally, the idea is add more to the instruction set but also add to the amount of work that can be compiled in advance and done in a single "instruction". The loner instruction set also means longer dispatch loop, longer jumps (although that can be optimized to minimize), less cache hits, but the question is BY HOW MUCH? Considering every "instruction" is just a few assembly instructions, is an assembly snippet of about 7-8k instructions considered normal, or too much? Considering the average instruction size varies around 2-3b, this should not be more than 20k of memory, enough to completely fit in most L1 caches. But this is not concrete math, just stuff I came at googling around, so maybe my "calculations" are off? Or maybe it doesn't work that way? I am not that experienced in caching mechanisms.
To me, as I currently weight the arguments, the "many instructions" approach appears to have the biggest chances for best performance, provided of course, my theory about fitting the "extended dispatch loop" in the L1 cache holds. So here is where your expertise and experience comes into play. Now that the context is narrowed and a few support ideas presented, maybe it will be easier to give a more concrete answer whether the benefits of a larger instruction set prevail over the size increase of native code by decreasing the amount of the slower, interpreted code.
My instruciton size data is based on those stats.

You might want to consider separating the VM ISA and its implementation.
For instance, in a VM I wrote I had a "load value direct" instruction. The next value in the instruction stream wasn't decoded as an instruction, but loaded as a value into a register. You can consider this one macro instruction or two separate values.
Another instruction I implemented was a "load constant value", which took loaded a constant from memory (using a base address for the table of constants and an offset). A common pattern in the instruction stream was therefore load value direct (index); load constant value. Your VM implementation may recognize this pattern and handle the pair with a single optimized implementation.
Obviously, if you have enough bits, you can use some of them to identify a register. With 8 bits it may be necessary to have a single register for all operations. But again, you could add another instruction with register X which modifies the next operation. In your C++ code, that instruction would merely set the currentRegister pointer which the other instructions use.

Will more compiled routines make up for a larger dispatch loop?
I take it you didn't fancy having single byte instructions with a second byte of extra opcode for certain instructions? I think a decode for 16-bit opcodes may be less efficient than 8-bit + extra byte(s), assuming the extra byte(s) aren't too common or too difficult to decode in themselves.
If it was me, I'd work on getting the compiler (not necessarily a full-fledged compiler with "everything", but a basic model) going with a fairly limited set of "instructions". Keep the code generation part fairly flexible so that it'll be easy to alter the actual encoding later. Once you have that working, you can experiment with various encodings and see what the result is in performance, and other aspects.
A lot of your minor question points are very hard to answer for anyone that hasn't done both of the choices. I have never written a VM in this sense, but I have worked on several disassemblers, instruction set simulators and such things. I have also implemented a couple of languages of different kinds, in terms of interpreted languages.
You probably also want to consider a JIT approach, where instead of loading bytecode, you interpret the bytecode and produce direct machine code for the architecture in question.
The GCC code doesn't look terrible, but there are several places where code depends on the value of the immediately preceding instruction - which is not great in modern processors. Unfortunately, I don't see any solution to that - it's a "too short code to shuffle things around" problem - adding more instructions obviously won't work.
I do see one little problem: Loading a 32-bit constant will require that it's 32-bit aligned for best performance. I have no idea how (or if) Java VM's deal with that.

I think you are asking the wrong question, and not because it is a bad question, on the contrary, it is an interesting subject and I suspect many people are interested in the results just as I am.
However, so far no one is sharing similar experience, so I guess you may have to do some pioneering. Instead of wondering which approach to use and waste time on the implementation of boilerplate code, focus on creating a “reflection” component that describes the structure and properties of the language, create a nice polymorphic structure with virtual methods, without worrying about performance, create modular components you can assemble during runtime, there is even the option to use a declarative language once you have established the object hierarchy. Since you appear to use Qt, you have half the work cut out for you. Then you can use the tree structure to analyze and generate a variety of different code – C code to compile or bytecode for a specific VM implementation, of which you can create multiple, you can even use that to programmatically generate the C code for your VM instead of typing it all by hand.
I think this set of advices will be more beneficial in case you resort to pioneering on the subject without a concrete answer in advance, it will allow you to easily test out all the scenarios and make your mind based on actual performance rather than personal assumptions and those of others. Then maybe you can share the results and answer your question with performance data.

The instruction length in bytes has been handled the same way for quite a while. Obviously being limited to 256 instructions isn't a good thing when there's so many types of operations you wish to perform.
This is why there's an prefix value. Back in the gameboy architecture, there wasn't enough room to include the needed 256 bit-control instructions, that's why one opcode was used as a prefix instruction. This kept the original 256 opcodes as well as 256 more if starting with that prefix byte.
For example:
One operation might look like this: D6 FF = SUB A, 0xFF
But a prefixed instruction would be presented as: CB D6 FF = SET 2, (HL)
If the processor read CB it'd immediately start looking in another instruction set of 256 opcodes.
The same goes for x86 architecture today. Where any instructions prefixed with 0F would be a part of another instruction set, essentially.
With the sort of execution you're using for your emulator, this is the best way of extending your instruction set. 16-bit opcodes would take up way more space than necessary, and the prefix doesn't provide such a long search.

One thing you should decide is what balance you wish to strike between code-file size efficiency, cache efficiency, and raw-execution-speed efficiency. Depending upon the coding patterns for the code you're interpreting, it may be helpful to have each instruction, regardless of its length in the code file, get translated into a structure containing a pointer and an integer. The first pointer would point to a function that takes a pointer to the instruction-info structure as well as to the execution context. The main execution loop would thus be something like:
do
{
pc = pc->func(pc, &context);
} while(pc);
the function associated with an "add short immediate instruction" would be something like:
INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
context->op_stack[0] += pc->operand;
return pc+1;
}
while "add long immediate" would be:
INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
context->op_stack[0] += (uint32_t)pc->operand + ((int64_t)(pc[1].operand) << 32);
return pc+2;
}
and the function associated with an "add local" instruction would be:
INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
CONTEXT_ITEM *op_stack = context->op_stack;
op_stack[0].asInt64 += op_stack[pc->operand].asInt64;
return pc+1;
}
Your "executables" would consist of compressed bytecode format, but they would then get translated into a table of instructions, eliminating a level of indirection when decoding the instructions at run-time.

Where is the one to one correlation between the assembly and cpp code?

I tried to examine how the this code will be in assembly:
int main(){
if (0){
int x = 2;
x++;
}
return 0;
}
I was wondering what does if (0) mean?
I used the shell command g++ -S helloWorld.cpp in Linux
and got this code:
.file "helloWorld.cpp"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
.section .note.GNU-stack,"",#progbits
I expected that the assembly will contain some JZ but where is it?
How can I compile the code without optimization?

There is no direct, guaranteed relationship between C++ source code and
the generated assembler. The C++ source code defines a certain
semantics, and the compiler outputs machine code which will implement
the observable behavior of those semantics. How the compiler does this,
and the actual code it outputs, can vary enormously, even over the same
underlying hardware; I would be very disappointed in a compiler which
generated code which compared 0 with 0, and then did a conditional
jump if the results were equal, regardless of what the C++ source code
was.
In your example, the only observable behavior in your code is to return
0 to the OS. Anything the compiler generates must do this (and have
no other observable behavior). The code you show isn't optimal for
this:
xorl %eax, %eax
ret
is really all that is needed. But of course, the compiler is free to
generate a lot more if it wants. (Your code, for example, sets up a
frame to support local variables, even though there aren't any. Many
compilers do this systematically, because most debuggers expect it, and
get confused if there is no frame.)
With regards to optimization, this depends on the compiler. With g++,
-O0 (that's the letter O followed by the number zero) turns off all
optimization. This is the default, however, so it is effectively what
you are seeing. In addition to having several different levels of
optimization, g++ supports turning individual optimizations off or on.
You might want to look at the complete list:
http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Optimize-Options.html#Optimize-Options.

The compiler eliminates that code as dead code, e.g. code that will never run. What you're left with is establishing the stack frame and setting the return value of the function. if(0) is never true, after all. If you want to get JZ, then you should probably do something like if(variable == 0). Keep in mind that the compiler is in no way required to actually emit the JZ instruction, it may use any other means to achieve the same thing. Compiling a high level language to assembly is very rarely a clear, one-to-one correlation.

The code has probably been optimized.
if (0){
int x = 2;
x++;
}
has been eliminated.
movl $0, %eax is where the return value been set. It seems the other instructions are just program init and exit.

There is a possibility that the compiler optimized it away, since it's never true.

The optimizer removed the if conditional and all of the code inside, so it doesn't show up at all.

the if (0) {} block has been optimized out by the compiler, as this will never be called.
so your function do only return 0 (movl $0, %eax)

Whether variable name in any programming language takes memory space

e.g.
int a=3;//-----------------------(1)
and
int a_long_variable_name_used_instead_of_small_one=3;//-------------(2)
out of (1) and (2) which will acquire more memory space or equal space would be aquire?

In C++ and most statically compiled languages, variable names may take up more space during the compilation process but by run time the names will have been discarded and thus take up no space at all.
In interpreted languages and compiled languages which provide run time introspection/reflection the name may take up more space.
Also, language implementation will affect how much space variable names take up. The implementer may have decided to use a fixed-length buffer for each variable name, in which case each name takes up the same space regardless of length. Or they may have dynamically allocated space based on the length.

Both occupy the same amount of memory. Variable names are just to help you, the programmer, remember what the variable is for, and to help the compiler associate different uses of the same variable. With the exception of debugging symbols, they make no appearance in the compiled code.

The name you give to a variable in C/C++ will not affect the size of the resulting executable code. When you declare a variable like that, the compiler reserves memory space (in the case of an int on x86/x64, four bytes) to store the value. To access or alter the value it will then use the address rather than the variable name (which is lost in the compilation process).

In most interpreted languages, the name would be stored in a table somewhere in memory, thus taking up different amounts of space.

If my understanding is correct, they'll take up the same amount of memory.
I believe (and am ready to get shot down in flames) that in C++ the names are symbolic to help the user and the compiler will just create a block of memory sufficient to hold the type you're declaring, in this case an int.
So, they should both occupy the same memory size, ie, the memory required to hold an address.

For C++,
$ cat name.cpp
int main() {
int a = 74678;
int bcdefghijklmnopqrstuvwxyz = 5664;
}
$ g++ -S name.cpp
$ cat name.s
.file "name.cpp"
.text
.align 2
.globl main
.type main, #function
main:
.LFB2:
pushl %ebp
.LCFI0:
movl %esp, %ebp
.LCFI1:
subl $8, %esp
.LCFI2:
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
subl %eax, %esp
movl $74678, -4(%ebp)
movl $5664, -8(%ebp)
movl $0, %eax
leave
ret
.LFE2:
.size main, .-main
.section .note.GNU-stack,"",#progbits
.ident "GCC: (GNU) 3.4.6 20060404 (Red Hat 3.4.6-11.0.1)"
$
As you can see, neither a nor bcdefghijklmnopqrstuvwxyz reflect in the assembler output. So, the length of the variable name does not matter at runtime in terms of memory.
But, variable names are huge contributors to the program design. Some programmers even rely on good naming conventions instead of comments to explain the design of their program.
A relevant quote from Hacker News,
Code should be written so as to completely describe the program's functionality to human readers, and only incidentally to be interpreted by computers. We have a hard time remembering short names for a long time, and we have a hard time looking at long names over and over again in a row. Additionally, short names carry a higher likelihood of collisions (since the search space is smaller), but are easier to "hold onto" for short periods of reading.
Thus, our conventions for naming things should take into consideration the limitations of the human brain. The length of a variable's name should be proportional to the distance between its definition and its use, and inversely proportional to its frequency of use.

In modern compilers the name of a variable does not impact the amount of space that is required to hold it in C++.

Field names (instance variable names) in Java use memory, but only once per field. This is for reflection to work. The same goes for other languages that are based on the JVM, and I guess for DotNet.

compilers are there for a reason...
They optimize code to use a little space as possible and run as fast as possible especially modern ones.

No. Both will occupy equal space.

what will be the addressing mode in assembly code generated by the compiler here?

Suppose we've got two integer and character variables:
int adad=12345;
char character;
Assuming we're discussing a platform in which, length of an integer variable is longer than or equal to three bytes, I want to access third byte of this integer and put it in the character variable, with that said I'd write it like this:
character=*((char *)(&adad)+2);
Considering that line of code and the fact that I'm not a compiler or assembly expert, I know a little about addressing modes in assembly and I'm wondering the address of the third byte (or I guess it's better to say offset of the third byte) here would be within the instructions generated by that line of code themselves, or it'd be in a separate variable whose address (or offset) is within those instructions ?

The best thing to do in situations like this is to try it. Here's an example program:
int main(int argc, char **argv)
{
int adad=12345;
volatile char character;
character=*((char *)(&adad)+2);
return 0;
}
I added the volatile to avoid the assignment line being completely optimized away. Now, here's what the compiler came up with (for -Oz on my Mac):
_main:
pushq %rbp
movq %rsp,%rbp
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
xorl %eax,%eax
leave
ret
The only three lines that we care about are:
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
The movl is the initialization of adad. Then, as you can see, it reads out the 3rd byte of adad, and stores it back into memory (the volatile is forcing that store back).
I guess a good question is why does it matter to you what assembly gets generated? For example, just by changing my optimization flag to -O0, the assembly output for the interesting part of the code is:
movl $0x00003039,0xf8(%rbp)
leaq 0xf8(%rbp),%rax
addq $0x02,%rax
movzbl (%rax),%eax
movb %al,0xff(%rbp)
Which is pretty straightforwardly seen as the exact logical operations of your code:
Initialize adad
Take the address of adad
Add 2 to that address
Load one byte by dereferencing the new address
Store one byte into character
Various optimizations will change the output... if you really need some specific behaviour/addressing mode for some reason, you might have to write the assembly yourself.

Without knowing anything about the compiler and the underlying CPU architecture no definitive answer can be given. For example, not all CPU architectures allow the addressing of every arbitrary byte in memory (though I believe all the currently popular ones do): on a CPU that's word-addressed, instead of byte-addressed, what the compiler will generate is inevitably going to be the loading into some register of the whole word adad (presumably by an offset from a base pointer register, if the variable in question is on stack [1]), followed by shifting and masking to isolate the byte of interest.
[1] note that, without knowing what CPU architecture we're talking about and how the compiler uses it, we can't even say whether "load a word at a fixed offset from a base register" is something that's done inline within the instruction (as one might hope, and many popular architectures definitely do support;-) or needs separate address arithmetic in an auxiliary register.
IOW, whether it's a good idea or not, it's definitely possible to define a CPU architecture which cannot load / store registers except from other registers or memory addresses defined by other registers or constant, and some such architectures exist (though they may not be all that popular at this time;-).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js