gcov and switch statements - c++

I'm running gcov over some C code with a switch statement. I've written test cases to cover every possible path through that switch statement, but it still reports a branch in the switch statement as not taken and less than 100% on the "Taken at least once" stat.
Here's some sample code to demonstrate:
#include "stdio.h"
void foo(int i)
{
switch(i)
{
case 1:printf("a\n");break;
case 2:printf("b\n");break;
case 3:printf("c\n");break;
default: printf("other\n");
}
}
int main()
{
int i;
for(i=0;i<4;++i)
foo(i);
return 0;
}
I built with "gcc temp.c -fprofile-arcs -ftest-coverage", ran "a", then did "gcov -b -c temp.c". The output indicates eight branches on the switch and one (branch 6) not taken.
What are all those branches and how do I get 100% coverage?

Oho! bde's assembly dump shows that that version of GCC is compiling this switch statement as some approximation of a binary tree, starting at the middle of the set. So it checks if i is equal to 2, then checks if it's greater or less than 2, and then for each side it checks if it's equal to 1 or 3 respectively, and if not, then it goes to default.
That means there are two different code paths for it to get to the default result -- one for numbers higher than 2 that aren't 3, and one for numbers lower than 2 that aren't 1.
Looks like you'll get to 100% coverage if you change that i<4 in your loop to i<=4, so as to test the path on each side.
(And, yes, that's something that's very likely to have changed from GCC 3.x to GCC 4.x. I wouldn't say it's "fixed", as it's not "wrong" exactly aside from making the gcov results confusing. It's just that on a modern processor with branch prediction, it's probably slow as well as overly complicated.)

I get the same result using gcc/gcov 3.4.6.
For a switch statement, it should normally generate two branches for each case statement. One is if the case is true and should be executed, and the other is a "fallthrough" branch that goes on to the next case.
In your situation, it looks like gcc is making a "fallthrough" branch for the last case, which doesn't make sense since there is nothing to fall into.
Here's an excerpt from the assembly code generated by gcc (I changed some of the labels for readability):
cmpl $2, -4(%ebp)
je CASE2
cmpl $2, -4(%ebp)
jg L7
cmpl $1, -4(%ebp)
je CASE1
addl $1, LPBX1+16
adcl $0, LPBX1+20
jmp DEFAULT
L7:
cmpl $3, -4(%ebp)
je CASE3
addl $1, LPBX1+32
adcl $0, LPBX1+36
jmp DEFAULT
I admit that I don't know much about x86 assembly, and I don't understand the use of the L7 label but it might have something to do with the extra branch. Maybe someone with more knowledge about gcc can explain what is going on here.
It sounds like it might be an issue with the older version of gcc/gcov, upgrading to a newer gcc/gcov might fix the problem, especially given the other post where the results look correct.

Are you sure you are running a.out? Here is my results (gcc 4.4.1):
File 't.c'
Lines executed:100.00% of 11
Branches executed:100.00% of 6
Taken at least once:100.00% of 6
Calls executed:100.00% of 5
t.c:creating 't.c.gcov'

I'm using mingw on windows (which is not the latest gcc) and it looks like this may be sorted out in newer versions of gcc.

Related

Why might a C++ compiler duplicate a function exit basic block?

Consider the following snippet of code:
int* find_ptr(int* mem, int sz, int val) {
for (int i = 0; i < sz; i++) {
if (mem[i] == val) {
return &mem[i];
}
}
return nullptr;
}
GCC on -O3 compiles this to:
find_ptr(int*, int, int):
mov rax, rdi
test esi, esi
jle .L4 # why not .L8?
lea ecx, [rsi-1]
lea rcx, [rdi+4+rcx*4]
jmp .L3
.L9:
add rax, 4
cmp rax, rcx
je .L8
.L3:
cmp DWORD PTR [rax], edx
jne .L9
ret
.L8:
xor eax, eax
ret
.L4:
xor eax, eax
ret
In this assembly, the blocks with labels .L4 and .L8 are identical. Would it not be better to rewrite jumps to .L4 to .L8 and drop .L4? I thought this might be a bug, but clang also duplicates the xor-ret sequence back to back. However, ICC and MSVC each take a pretty different approach.
Is this an optimization in this case and, if not, are there times when it would be? What is the rationale behind this behavior?
This is always a missed optimizations. Having both return-0 paths use the same basic block would be pure win on all microarchitectures that current compilers care about.
But unfortunately this missed-optimization is not rare with gcc. Often it's a separate bare ret that gcc conditionally branches to, instead of branching to a ret in another existing path. (x86 doesn't have a conditional ret, so simple functions that don't need any stack cleanup often just need to branch to a ret.
Often functions this small would get inlined in a complete program, so maybe it doesn't hurt a lot in real life?)
CPUs (since Pentium Pro if not earlier) have a return-address predictor stack that easily predicts the branch target for ret instructions, so there's not going to be an effect from one ret instruction more often returning to one caller and another ret more often returning to another caller. It doesn't help branch prediction to separate them and let them use different entries.
IDK about Pentium 4 and whether the traces in its trace cache follow call/ret. But fortunately that's not relevant anymore. The decoded-uop cache in SnB-family and Ryzen is not a trace cache; a line/way of uop cache holds uops for a contiguous block of x86 machine code, and unconditional jumps end a uop cache line. (https://agner.org/optimize/) So if anything, this could be worse for SnB-family because each return path needs a separate line of the uop cache even though they're each only 2 uops total (xor-zero and ret are both single-uop instructions).
Report this MCVE to gcc's bugzilla with keyword missed-optimization: https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc
(update: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90178 was reported by the OP. A fix was attempted, but reverted; for now it's still open. In this case it seems to be caused by -mavx, perhaps some interaction with return paths that need vzeroupper or not.)
Cause:
You can kind of see how it might arrive at 2 exit blocks: compilers normally transform for loops into if(sz>0) { do{}while(); } if there's a possibility of it needing to run 0 times, like gcc did here. So there's one branch that leaves the function without entering the loop at all. But the other exit is from fall through from the loop. Perhaps before optimizing away some stuff, there was some extra cleanup. Or just those paths got split up when the first branch was created.
I don't know why gcc fails to notice and merge two identical basic blocks that end with ret.
Maybe it only looked for that in some GIMPLE or RTL pass where they weren't actually identical, and only became identical during final x86 code-gen. Maybe after optimizing away save/restore of a register to hold some temporary that it ended up no needing?
You could dig deeper if you look at GCC's GIMPLE or RTL with -fdump-tree-... options after certain optimization passes: Godbolt has UI for that, in the + dropdown -> tree / RTL output. https://godbolt.org/z/l9mVlE. But unless you're a gcc-internals expert and planning to work on a patch or idea to help gcc find this optimization, it's probably not worth your time.
Interesting discovery that it only happens with -mavx (enabled by -march=skylake or directly). GCC and clang don't know how to auto-vectorize loops where the trip count is not known before the first iteration. e.g. search loops like this or memchr or strlen. So IDK why AVX even makes a difference at all.
(Note that the C abstract machine never reads mem[i] beyond the search point, and those elements might not actually exist. e.g. there's no UB if you passed this function a pointer to the last int before an unmapped page, and sz=1000, as long as *mem == val. So to auto-vectorize without int mem[static sz] guaranteed object size, the compiler would have to align the pointer... Not that C11 int mem[static sz] would even help; even a static array of compile-time-constant size larger than the max possible trip count wouldn't get gcc to auto-vectorize.)

Changing a number defined in a C++(C) program without compiling the source again

Suppose I have this simple program which prints a number:
#include <iostream>
int unique_id = 112233;
int main()
{
std::cout << unique_id;
return 0;
}
Then I compile it to something like a.exe. Now I want to create another application that opens a.exe and changes unique_id to something else. Is it possible?
I'm not going to pass a parameter to the program because of some restrictions.
I want to use the unique_id, as its name implies, to uniquely identify where my program is running. But I don't want to compile my program 1000 times for 1000 customers. I know I can use Hard Disk Serial number, but in virtual machines, this serial number may be omitted. I know I can use CPU serial number, But I read in S.O posts that this serial number is deprecated. I know I can use MAC address too :), but that address can be changed easily. So I decided to put the unique ID in exe file itself.
Considering the motivation you added to the question, you could simply make the exe read the id from a .txt file, and ship a different .txt file with the exe for every customer.
Or, equivalently, you could make a DLL (or the equivalent for your platform) that has a function returning the id, and only recompile the DLL for every customer.
In general, you cannot change anything without re-compiling.
In practice and in very limited cases, you might patch your binary. This is mostly processor specific (and executable format specific and ABI specific) and depends less on your particular operating system version (e.g. if it works for Windows 9, it could work for Windows 10).
(However, I don't know and never used Windows; I'm only using Linux; you should adapt my answer to your operating system)
So in some cases you might reverse-engineer your binary executable. If you do have the C source code, you could ask your compiler to emit the assembler code (e.g. by compiling with gcc -O -fverbose-asm -S with GCC). Then you might disassemble your executable, and change, with a binary or hexadecimal editor, the machine code containing that constant.
This won't always work, because the machine instruction (and its size) could depend on the magnitude (bit size) of your constant.
To take a simple example, in C, for GCC 7, on Linux/x86-64, consider the following C file:
/// A, B, C are preprocessor symbols defined as integers
int f(int x) {
if (x > 0)
return A*x + B;
return C;
}
If I compile that with gcc -fverbose-asm -S -O -DA=12751 -DB=32 -DC=11 e.c I'm getting:
.type f, #function
f:
.LFB0:
.cfi_startproc
# e.c:3: if (x > 0)
testl %edi, %edi # x
jle .L3 #,
# e.c:4: return A * x + B;
imull $12751, %edi, %edi #, x, tmp90
leal 32(%rdi), %eax #, <retval>
ret
.L3:
# e.c:5: return C;
movl $11, %eax #, <retval>
# e.c:6: }
ret
.cfi_endproc
.LFE0:
.size f, .-f
But if I do gcc -S -O -fverbose-asm -DA=12753 -DB=32 -DC=10 e.c I'm getting
.type f, #function
f:
.LFB0:
.cfi_startproc
# e.c:3: if (x > 0)
testl %edi, %edi # x
jle .L3 #,
# e.c:4: return A * x + B;
imull $12753, %edi, %edi #, x, tmp90
leal 32(%rdi), %eax #, <retval>
ret
.L3:
# e.c:5: return C;
movl $10, %eax #, <retval>
# e.c:6: }
ret
So indeed, in the above case I could patch the binary (I would need to find the 12751 and 11 constants in machine code; it is doable but tedious in that case).
Now, let's try with A being a small power of two, like 16, and C being 0, so
gcc -S -O -fverbose-asm -DA=16 -DB=32 -DC=0 e.c:
f:
.LFB0:
.cfi_startproc
# e.c:4: return A * x + B;
leal 2(%rdi), %eax #, tmp90
sall $4, %eax #, tmp93
testl %edi, %edi # x
movl $0, %edx #, tmp92
cmovle %edx, %eax # tmp93,, tmp92, <retval>
# e.c:6: }
ret
Because of compiler optimizations, the code changed significantly. It is not easy to patch.
Important notice
With enough effort, money and time (think of NSA-like abilities) a lot of things are possible.
if your goal is to obfuscate some data in your binary (e.g. some password), you might encrypt it to make hackers' life harder (but don't be naive, the NSA will be able to get it). Remember the motto: there is No Silver Bullet; it looks that is your goal, but don't be too naive (BTW, the legal protections around your software, e.g. the license, matters even more; so you need a lawyer to write a good EULA).
If your goal is on the contrary to adapt some performance-critical code, you could use metaprogramming and partial evaluation techniques. A practice I like doing is generate at runtime some temporary C (or C++) code (better suited for your particular situation and data), compile that temporary C or C++ code as some plugin, then dynamically load that temporary plugin (using dlopen and dlsym on Linux; on Windows you'll need LoadLibrary but I leave you to understand the details and consequences). Instead of generating C or C++ code at runtime you could use some JIT compiling library like libgccjit. If you are fond of such techniques, consider instead using better programming languages (like Common Lisp with SBCL) if your management allows them.
But I don't want to compile my program 1000 times for 1000 customers
That surprises me a lot. Compiling a simple (short) C file containing just constants is quick, and linking time is also quick. I would instead consider recompilation for each customer.
BTW, I feel you are incredibly naive. The most important protection is not technical in your binary, it is a legal protection (and you need a good contract, so find and pay a good lawyer).
Did you consider on the contrary to make your product free software? Many companies are doing that (and making money on something else that licenses, e.g. support).
NB. there are lots of existing license managers. Did you consider buying and using one? Notice also that corporations have large incentives to avoid cheating, and those willing to steal your software will be able to do that anyway. You'll sell more products by working on software quality, not by spending efforts on vain "protection" measures which are annoying your customers, increasing your logistics and distribution and maintenance costs, and harden the debugging of customer-found bugs.
No, the behaviour of changing a variable that is const is undefined. So you can't do this with standard C or C++.
Your best bet is to resort to an inline assembly solution; but note that UNIQUE_ID might be compiled out altogether (neither C nor C++ are reflective languages). In order to increase the probability of UNIQUE_ID being retained, remove the const qualifier and possibly introduce volatile.
Personally I'd pass UNIQUE_ID on the command line to your program.
Starting point: https://msdn.microsoft.com/en-us/library/fabdxz08.aspx

Using base pointer register in C++ inline asm

I want to be able to use the base pointer register (%rbp) within inline asm. A toy example of this is like so:
void Foo(int &x)
{
asm volatile ("pushq %%rbp;" // 'prologue'
"movq %%rsp, %%rbp;" // 'prologue'
"subq $12, %%rsp;" // make room
"movl $5, -12(%%rbp);" // some asm instruction
"movq %%rbp, %%rsp;" // 'epilogue'
"popq %%rbp;" // 'epilogue'
: : : );
x = 5;
}
int main()
{
int x;
Foo(x);
return 0;
}
I hoped that, since I am using the usual prologue/epilogue function-calling method of pushing and popping the old %rbp, this would be ok. However, it seg faults when I try to access x after the inline asm.
The GCC-generated assembly code (slightly stripped-down) is:
_Foo:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
# INLINEASM
pushq %rbp; // prologue
movq %rsp, %rbp; // prologue
subq $12, %rsp; // make room
movl $5, -12(%rbp); // some asm instruction
movq %rbp, %rsp; // epilogue
popq %rbp; // epilogue
# /INLINEASM
movq -8(%rbp), %rax
movl $5, (%rax) // x=5;
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Foo
movl $0, %eax
leave
ret
Can anyone tell me why this seg faults? It seems that I somehow corrupt %rbp but I don't see how. Thanks in advance.
I'm running GCC 4.8.4 on 64-bit Ubuntu 14.04.
See the bottom of this answer for a collection of links to other inline-asm Q&As.
Your code is broken because you step on the red-zone below RSP (with push) where GCC was keeping a value.
What are you hoping to learn to accomplish with inline asm? If you want to learn inline asm, learn to use it to make efficient code, rather than horrible stuff like this. If you want to write function prologues and push/pop to save/restore registers, you should write whole functions in asm. (Then you can easily use nasm or yasm, rather than the less-preferred-by-most AT&T syntax with GNU assembler directives1.)
GNU inline asm is hard to use, but allows you to mix custom asm fragments into C and C++ while letting the compiler handle register allocation and any saving/restoring if necessary. Sometimes the compiler will be able to avoid the save and restore by giving you a register that's allowed to be clobbered. Without volatile, it can even hoist asm statements out of loops when the input would be the same. (i.e. unless you use volatile, the outputs are assumed to be a "pure" function of the inputs.)
If you're just trying to learn asm in the first place, GNU inline asm is a terrible choice. You have to fully understand almost everything that's going on with the asm, and understand what the compiler needs to know, to write correct input/output constraints and get everything right. Mistakes will lead to clobbering things and hard-to-debug breakage. The function-call ABI is a much simpler and easier to keep track of boundary between your code and the compiler's code.
Why this breaks
You compiled with -O0, so gcc's code spills the function parameter from %rdi to a location on the stack. (This could happen in a non-trivial function even with -O3).
Since the target ABI is the x86-64 SysV ABI, it uses the "Red Zone" (128 bytes below %rsp that even asynchronous signal handlers aren't allowed to clobber), instead of wasting an instruction decrementing the stack pointer to reserve space.
It stores the 8B pointer function arg at -8(rsp_at_function_entry). Then your inline asm pushes %rbp, which decrements %rsp by 8 and then writes there, clobbering the low 32b of &x (the pointer).
When your inline asm is done,
gcc reloads -8(%rbp) (which has been overwritten with %rbp) and uses it as the address for a 4B store.
Foo returns to main with %rbp = (upper32)|5 (orig value with the low 32 set to 5).
main runs leave: %rsp = (upper32)|5
main runs ret with %rsp = (upper32)|5, reading the return address from virtual address (void*)(upper32|5), which from your comment is 0x7fff0000000d.
I didn't check with a debugger; one of those steps might be slightly off, but the problem is definitely that you clobber the red zone, leading to gcc's code trashing the stack.
Even adding a "memory" clobber doesn't get gcc to avoid using the red zone, so it looks like allocating your own stack memory from inline asm is just a bad idea. (A memory clobber means you might have written some memory you're allowed to write to, e.g. a global variable or something pointed-to by a global, not that you might have overwritten something you're not supposed to.)
If you want to use scratch space from inline asm, you should probably declare an array as a local variable and use it as an output-only operand (which you never read from).
AFAIK, there's no syntax for declaring that you modify the red-zone, so your only options are:
use an "=m" output operand (possibly an array) for scratch space; the compiler will probably fill in that operand with an addressing mode relative to RBP or RSP. You can index into it with constants like 4 + %[tmp] or whatever. You might get an assembler warning from 4 + (%rsp) but not an error.
skip over the red-zone with add $-128, %rsp / sub $-128, %rsp around your code. (Necessary if you want to use an unknown amount of extra stack space, e.g. push in a loop, or making a function call. Yet another reason to deref a function pointer in pure C, not inline asm.)
compile with -mno-red-zone (I don't think you can enable that on a per-function basis, only per-file)
Don't use scratch space in the first place. Tell the compiler what registers you clobber and let it save them.
Here's what you should have done:
void Bar(int &x)
{
int tmp;
long tmplong;
asm ("lea -16 + %[mem1], %%rbp\n\t"
"imul $10, %%rbp, %q[reg1]\n\t" // q modifier: 64bit name.
"add %k[reg1], %k[reg1]\n\t" // k modifier: 32bit name
"movl $5, %[mem1]\n\t" // some asm instruction writing to mem
: [mem1] "=m" (tmp), [reg1] "=r" (tmplong) // tmp vars -> tmp regs / mem for use inside asm
:
: "%rbp" // tell compiler it needs to save/restore %rbp.
// gcc refuses to let you clobber %rbp with -fno-omit-frame-pointer (the default at -O0)
// clang lets you, but memory operands still use an offset from %rbp, which will crash!
// gcc memory operands still reference %rsp, so don't modify it. Declaring a clobber on %rsp does nothing
);
x = 5;
}
Note the push/pop of %rbp in the code outside the #APP / #NO_APP section, emitted by gcc. Also note that the scratch memory it gives you is in the red zone. If you compile with -O0, you'll see that it's at a different position from where it spills &x.
To get more scratch regs, it's better to just declare more output operands that are never used by the surrounding non-asm code. That leaves register allocation to the compiler, so it can be different when inlined into different places. Choosing ahead of time and declaring a clobber only makes sense if you need to use a specific register (e.g. shift count in %cl). Of course, an input constraint like "c" (count) gets gcc to put the count in rcx/ecx/cx/cl, so you don't emit a potentially redundant mov %[count], %%ecx.
If this looks too complicated, don't use inline asm. Either lead the compiler to the asm you want with C that's like the optimal asm, or write a whole function in asm.
When using inline asm, keep it as small as possible: ideally just the one or two instructions that gcc isn't emitting on its own, with input/output constraints to tell it how to get data into / out of the asm statement. This is what it's designed for.
Rule of thumb: if your GNU C inline asm start or ends with a mov, you're usually doing it wrong and should have used a constraint instead.
Footnotes:
You can use GAS's intel-syntax in inline-asm by building with -masm=intel (in which case your code will only work with that option), or using dialect alternatives so it works with the compiler in Intel or AT&T asm output syntax. But that doesn't change the directives, and GAS's Intel-syntax is not well documented. (It's like MASM, not NASM, though.) I don't really recommend it unless you really hate AT&T syntax.
Inline asm links:
x86 wiki. (The tag wiki also links to this question, for this collection of links)
The inline-assembly tag wiki
The manual. Read this. Note that inline asm was designed to wrap single instructions that the compiler doesn't normally emit. That's why it's worded to say things like "the instruction", not "the block of code".
A tutorial
Looping over arrays with inline assembly Using r constraints for pointers/indices and using your choice of addressing mode, vs. using m constraints to let gcc choose between incrementing pointers vs. indexing arrays.
How can I indicate that the memory *pointed* to by an inline ASM argument may be used? (pointer inputs in registers do not imply that the pointed-to memory is read and/or written, so it might not be in sync if you don't tell the compiler).
In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?. Using %q0 to get %rax vs. %w0 to get %ax. Using %g[scalar] to get %zmm0 instead of %xmm0.
Efficient 128-bit addition using carry flag Stephen Canon's answer explains a case where an early-clobber declaration is needed on a read+write operand. Also note that x86/x86-64 inline asm doesn't need to declare a "cc" clobber (the condition codes, aka flags); it's implicit. (gcc6 introduces syntax for using flag conditions as input/output operands. Before that you have to setcc a register that gcc will emit code to test, which is obviously worse.)
Questions about the performance of different implementations of strlen: my answer on a question with some badly-used inline asm, with an answer similar to this one.
llvm reports: unsupported inline asm: input with type 'void *' matching output with type 'int': Using offsetable memory operands (in x86, all effective addresses are offsettable: you can always add a displacement).
When not to use inline asm, with an example of 32b/32b => 32b division and remainder that the compiler can already do with a single div. (The code in the question is an example of how not to use inline asm: many instructions for setup and save/restore that should be left to the compiler by writing proper in/out constraints.)
MSVC inline asm vs. GNU C inline asm for wrapping a single instruction, with a correct example of inline asm for 64b/32b=>32bit division. MSVC's design and syntax require a round trip through memory for inputs and outputs, making it terrible for short functions. It's also "never very reliable" according to Ross Ridge's comment on that answer.
Using x87 floating point, and commutative operands. Not a great example, because I didn't find a way to get gcc to emit ideal code.
Some of those re-iterate some of the same stuff I explained here. I didn't re-read them to try to avoid redundancy, sorry.
In x86-64, the stack pointer needs to be aligned to 8 bytes.
This:
subq $12, %rsp; // make room
should be:
subq $16, %rsp; // make room

VM interpreter - weighting performance benefits and drawbacks of larger instruction set / dispatch loop

I am developing a simple VM and I am in the middle of a crossroad.
My initial goal was to use byte long instruction, and therefore a small loop and a quick computed goto dispatch.
However, turns out reality could not be further from it - 256 is nowhere near enough to cover signed and unsigned 8, 16, 32 and 64bit integers, floats and doubles, pointer operations, the different combinations of addressing. One option was to not implement byte and shorts but the goal is to make a VM that supports the full C subset as well as vector operations, since they are pretty much everywhere anyway, albeit in different implementations.
So I switched to 16bit instruction, so now I am also able to add portable SIMD intrinsics and more compiled common routines that really save on performance by not being interpreted. There is also caching of global addresses, initially compiled as base pointer offsets, the first time an address is compiled it simply overwrites the offset and instruction so that next time it is a direct jump, at the cost of and extra instruction in the set for each use of a global by an instruction.
Since I am not in the stage of profiling, I am in a dilemma, are the extra instructions worth the more flexibility, will the presence of more instructions and therefore the absence of copying back and forth instructions make up for the increased dispatch loop size? Keeping in mind the instructions are just a few assembly instructions each, e.g:
.globl __Z20assign_i8u_reg8_imm8v
.def __Z20assign_i8u_reg8_imm8v; .scl 2; .type 32; .endef
__Z20assign_i8u_reg8_imm8v:
LFB13:
.cfi_startproc
movl _ip, %eax
movb 3(%eax), %cl
movzbl 2(%eax), %eax
movl _sp, %edx
movb %cl, (%edx,%eax)
addl $4, _ip
ret
.cfi_endproc
LFE13:
.p2align 2,,3
.globl __Z18assign_i8u_reg_regv
.def __Z18assign_i8u_reg_regv; .scl 2; .type 32; .endef
__Z18assign_i8u_reg_regv:
LFB14:
.cfi_startproc
movl _ip, %edx
movl _sp, %eax
movzbl 3(%edx), %ecx
movb (%ecx,%eax), %cl
movzbl 2(%edx), %edx
movb %cl, (%eax,%edx)
addl $4, _ip
ret
.cfi_endproc
LFE14:
.p2align 2,,3
.globl __Z24assign_i8u_reg_globCachev
.def __Z24assign_i8u_reg_globCachev; .scl 2; .type 32; .endef
__Z24assign_i8u_reg_globCachev:
LFB15:
.cfi_startproc
movl _ip, %eax
movl _sp, %edx
movl 4(%eax), %ecx
addl %edx, %ecx
movl %ecx, 4(%eax)
movb (%ecx), %cl
movzwl 2(%eax), %eax
movb %cl, (%eax,%edx)
addl $8, _ip
ret
.cfi_endproc
LFE15:
.p2align 2,,3
.globl __Z19assign_i8u_reg_globv
.def __Z19assign_i8u_reg_globv; .scl 2; .type 32; .endef
__Z19assign_i8u_reg_globv:
LFB16:
.cfi_startproc
movl _ip, %eax
movl 4(%eax), %edx
movb (%edx), %cl
movzwl 2(%eax), %eax
movl _sp, %edx
movb %cl, (%edx,%eax)
addl $8, _ip
ret
.cfi_endproc
This example contains the instructions to:
assign unsigned byte from immediate value to register
assign unsigned byte from register to register
assign unsigned byte from global offset to register and, cache and change to direct instruction
assign unsigned byte from global offset to register (the now cached previous version)
... and so on...
Naturally, when I produce a compiler for it, I will be able to test the instruction flow in production code and optimize the arrangement of the instructions in memory to pack together the frequently used ones and get more cache hits.
I just have a hard time figuring if such a strategy is a good idea, the bloat will make up for flexibility, but what about performance? Will more compiled routines make up for a larger dispatch loop? Is it worth caching global addresses?
I would also like for someone, decent in assembly to express an opinion on the quality of the code that is generated by GCC - are there any obvious inefficiencies and room for optimization? To make the situation clear, there is a sp pointer, which points to the stack that implements the registers (there is no other stack), ip is logically the current instruction pointer, and gp is the global pointer (not referenced, accessed as an offset).
EDIT: Also, this is the basic format I am implementing the instructions in:
INSTRUCTION assign_i8u_reg16_glob() { // assign unsigned byte to reg from global offset
FETCH(globallAddressCache);
REG(quint8, i.d16_1) = GLOB(quint8);
INC(globallAddressCache);
}
FETCH returns a reference to the struct, which the instruction is using based on the opcode
REG returns a reference to register value T from offset
GLOB retursn a reference to global value from a cached global offset (effectively absolute address)
INC just increments the instruction pointer by the size of the instruction.
Some people will probably suggest against the usage of macroses, but with templates it is much less readable. This way the code is pretty obvious.
EDIT: I would like to add a few points to the question:
I could go for a "register operations only" solution which can only move data between registers and "memory" - be that global or heap. In this case, every "global" and heap access will have to copy the value, modify or use it, and move it back to update. This way I have a shorter dispatch loop, but a few extra instructions for each instruction that addresses non-register data. So the dilemma is a few times more native code with longer direct jumps, or a few times more interpreted instructions with shorter dispatch loop. Will a short dispatch loop give me enough performance to make up for the extra and costly memory operations? Maybe the delta between the shorter and longer dispatch loop is not enough to make a real difference? In terms of cache hits, in terms of the cost of assembly jumps.
I could go for additional decoding and only 8bit wide instructions, however, this may add another jump - jump to wherever this instruction is handled, then waste time on either jumping to the case the particular addressing scheme is handled or decoding operations and a more complex execution method. And in the first case, the dispatch loop still grows, plus adding yet another jump. The second option - register operations can be used to decode the addressing, but a more complex instruction with more compile time unknown will be needed in order to address anything. I am not really sure how will this stack up with a shorter dispatch loop, once again, uncertain how my "shorter and longer dispatch loop" relates to what is considered short or long in terms of assembly instructions, the memory they need and the speed of their execution.
I could go for the "many instructions" solution - the dispatch loop is a few times larger, but it still uses pre-computed direct jumping. Complex addressing is specific and optimized for each instruction and compiled to native, so the extra memory operations that would be needed by the "register only" approach will be compiled and mostly executed on the registers, which is good for performance. Generally, the idea is add more to the instruction set but also add to the amount of work that can be compiled in advance and done in a single "instruction". The loner instruction set also means longer dispatch loop, longer jumps (although that can be optimized to minimize), less cache hits, but the question is BY HOW MUCH? Considering every "instruction" is just a few assembly instructions, is an assembly snippet of about 7-8k instructions considered normal, or too much? Considering the average instruction size varies around 2-3b, this should not be more than 20k of memory, enough to completely fit in most L1 caches. But this is not concrete math, just stuff I came at googling around, so maybe my "calculations" are off? Or maybe it doesn't work that way? I am not that experienced in caching mechanisms.
To me, as I currently weight the arguments, the "many instructions" approach appears to have the biggest chances for best performance, provided of course, my theory about fitting the "extended dispatch loop" in the L1 cache holds. So here is where your expertise and experience comes into play. Now that the context is narrowed and a few support ideas presented, maybe it will be easier to give a more concrete answer whether the benefits of a larger instruction set prevail over the size increase of native code by decreasing the amount of the slower, interpreted code.
My instruciton size data is based on those stats.
You might want to consider separating the VM ISA and its implementation.
For instance, in a VM I wrote I had a "load value direct" instruction. The next value in the instruction stream wasn't decoded as an instruction, but loaded as a value into a register. You can consider this one macro instruction or two separate values.
Another instruction I implemented was a "load constant value", which took loaded a constant from memory (using a base address for the table of constants and an offset). A common pattern in the instruction stream was therefore load value direct (index); load constant value. Your VM implementation may recognize this pattern and handle the pair with a single optimized implementation.
Obviously, if you have enough bits, you can use some of them to identify a register. With 8 bits it may be necessary to have a single register for all operations. But again, you could add another instruction with register X which modifies the next operation. In your C++ code, that instruction would merely set the currentRegister pointer which the other instructions use.
Will more compiled routines make up for a larger dispatch loop?
I take it you didn't fancy having single byte instructions with a second byte of extra opcode for certain instructions? I think a decode for 16-bit opcodes may be less efficient than 8-bit + extra byte(s), assuming the extra byte(s) aren't too common or too difficult to decode in themselves.
If it was me, I'd work on getting the compiler (not necessarily a full-fledged compiler with "everything", but a basic model) going with a fairly limited set of "instructions". Keep the code generation part fairly flexible so that it'll be easy to alter the actual encoding later. Once you have that working, you can experiment with various encodings and see what the result is in performance, and other aspects.
A lot of your minor question points are very hard to answer for anyone that hasn't done both of the choices. I have never written a VM in this sense, but I have worked on several disassemblers, instruction set simulators and such things. I have also implemented a couple of languages of different kinds, in terms of interpreted languages.
You probably also want to consider a JIT approach, where instead of loading bytecode, you interpret the bytecode and produce direct machine code for the architecture in question.
The GCC code doesn't look terrible, but there are several places where code depends on the value of the immediately preceding instruction - which is not great in modern processors. Unfortunately, I don't see any solution to that - it's a "too short code to shuffle things around" problem - adding more instructions obviously won't work.
I do see one little problem: Loading a 32-bit constant will require that it's 32-bit aligned for best performance. I have no idea how (or if) Java VM's deal with that.
I think you are asking the wrong question, and not because it is a bad question, on the contrary, it is an interesting subject and I suspect many people are interested in the results just as I am.
However, so far no one is sharing similar experience, so I guess you may have to do some pioneering. Instead of wondering which approach to use and waste time on the implementation of boilerplate code, focus on creating a “reflection” component that describes the structure and properties of the language, create a nice polymorphic structure with virtual methods, without worrying about performance, create modular components you can assemble during runtime, there is even the option to use a declarative language once you have established the object hierarchy. Since you appear to use Qt, you have half the work cut out for you. Then you can use the tree structure to analyze and generate a variety of different code – C code to compile or bytecode for a specific VM implementation, of which you can create multiple, you can even use that to programmatically generate the C code for your VM instead of typing it all by hand.
I think this set of advices will be more beneficial in case you resort to pioneering on the subject without a concrete answer in advance, it will allow you to easily test out all the scenarios and make your mind based on actual performance rather than personal assumptions and those of others. Then maybe you can share the results and answer your question with performance data.
The instruction length in bytes has been handled the same way for quite a while. Obviously being limited to 256 instructions isn't a good thing when there's so many types of operations you wish to perform.
This is why there's an prefix value. Back in the gameboy architecture, there wasn't enough room to include the needed 256 bit-control instructions, that's why one opcode was used as a prefix instruction. This kept the original 256 opcodes as well as 256 more if starting with that prefix byte.
For example:
One operation might look like this: D6 FF = SUB A, 0xFF
But a prefixed instruction would be presented as: CB D6 FF = SET 2, (HL)
If the processor read CB it'd immediately start looking in another instruction set of 256 opcodes.
The same goes for x86 architecture today. Where any instructions prefixed with 0F would be a part of another instruction set, essentially.
With the sort of execution you're using for your emulator, this is the best way of extending your instruction set. 16-bit opcodes would take up way more space than necessary, and the prefix doesn't provide such a long search.
One thing you should decide is what balance you wish to strike between code-file size efficiency, cache efficiency, and raw-execution-speed efficiency. Depending upon the coding patterns for the code you're interpreting, it may be helpful to have each instruction, regardless of its length in the code file, get translated into a structure containing a pointer and an integer. The first pointer would point to a function that takes a pointer to the instruction-info structure as well as to the execution context. The main execution loop would thus be something like:
do
{
pc = pc->func(pc, &context);
} while(pc);
the function associated with an "add short immediate instruction" would be something like:
INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
context->op_stack[0] += pc->operand;
return pc+1;
}
while "add long immediate" would be:
INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
context->op_stack[0] += (uint32_t)pc->operand + ((int64_t)(pc[1].operand) << 32);
return pc+2;
}
and the function associated with an "add local" instruction would be:
INSTRUCTION *add_instruction(INSTRUCTION *pc, EXECUTION_CONTEXT *context)
{
CONTEXT_ITEM *op_stack = context->op_stack;
op_stack[0].asInt64 += op_stack[pc->operand].asInt64;
return pc+1;
}
Your "executables" would consist of compressed bytecode format, but they would then get translated into a table of instructions, eliminating a level of indirection when decoding the instructions at run-time.

Correlate Source with Assembly Listing of a C++ Program

Analyzing Core Dump in retail build often requires to correlate the objdump of any specific module and the source. Normally correlating the assembly dump with the source becomes a pain if the function is quite involved.
Today I tried to create an assembly listing of one particular module (with the compile option -S) expecting I would see an interleaving source with assembly or some correlation. Unfortunately the listing was not friendly enough to correlate so I was wondering
Given a core-dump from which I can determine the crash location
objdump of the failing module Assembly Listing by recompiling the
module with -S option.
Is it possible to do a one-to-one correspondence with the source?
As an example I see the assembly listing as
.LBE7923:
.loc 2 4863 0
movq %rdi, %r14
movl %esi, %r12d
movl 696(%rsp), %r15d
movq 704(%rsp), %rbp
.LBB7924:
.loc 2 4880 0
testq %rdx, %rdx
je .L2680
.LVL2123:
testl %ecx, %ecx
jle .L2680
movslq %ecx,%rax
.loc 2 4882 0
testl %r15d, %r15d
.loc 2 4880 0
leaq (%rax,%rax,4), %rax
leaq -40(%rdx,%rax,8), %rdx
movq %rdx, 64(%rsp)
but could not understand how to interpret the labels like .LVL2123 and directives like .loc 2 4863 0
Note
As the answers depicted, reading through the assembly source and intuitively determining pattern based on symbols (like function calls, branches, return statement) is what I generally do. I am not denying that it doesn't work but when a function is quite involved, reading though pages of Assembly Listing is a pain and often you end up with listing that seldom match either because of functions getting in-lined or optimizer have simply tossed the code as it pleased. I have a feeling seeing how efficiently Valgrind handles optimized binaries and how in Windows WinDBG can handled optimized binaries, there is something I am missing. So I though I would start with the compiler output and use it to correlate. If my compiler is responsible for mangling the binary it would be the best person to say how to correlate with the source, but unfortunately that was least helpful and the .loc is really misleading.
Unfortunately I often have to read through unreproducible dumps across various platforms and the least time I spend is in debugging Windows Mini-dumps though WinDBG and considerable time in debugging Linux Coredumps. I though that may be I am not doing things correctly so I came up with this question.
Is it possible to do a one-to-one correspondence with the source?
A: no, unless all optimisation is disabled.
The compiler may emit some group of instructions (or instruction-like things) per line initially, but the optimiser then reorders, splits, fuses and generally changes them completely.
If I'm disassembling release code, I look at the instructions which should have a clear logical relationship to the code. Eg,
.LBB7924:
.loc 2 4880 0
testq %rdx, %rdx
je .L2680
looks like a branch if %rdx is zero, and it comes from line 4880. Find the line, identify the variable being tested, make a note that it's currently assigned to %rdx.
.LVL2123:
testl %ecx, %ecx
jle .L2680
OK, so this test and branch has the same target, so whatever comes next knows %rdx and %ecx are both nonzero. The original code might be structured like:
if (a && b) {
or perhaps it was:
if (!a || !b) {
and the optimiser reordered the two branches ...
Now you've got some structure you can hopefully match to the original code, you can also figure out the register assignments. Eg, if you know the thing being tested is the data member of some structure, read backwards to see where %rdx was loaded from memory: was it loaded from a fixed offset to some other register? If so, that register is probably the object address.
Good luck!
The .loc directive is what you're looking for. These indicate line #4863, 4880, etc. There is no perfect mapping between source and optimized assembler (which is why you see 4880 more than once). But .loc is how you know where it is in the file. The syntax is:
.loc <file> <line> <column>
Unless you statically link against system libraries, even without debug symbols there will be symbolic names in the binary - that of the system library functions linked to.
These can often help narrow down where you are in the code. For example, if you see that in function foo() it calls open() and then ioctl() and then it crashes right before calling read(), you can probably find that point in the source of foo quite easily. (For that matter you might not even need the dump - on linux you can get the record of crash occurrence relative to library and system functions using ltrace or strace)
Note that in some binary formats though, there may be an indirection to library functions via tiny wrappers elsewhere in the binary. Often a dump will still have relevant symbolic name information at the address of the invocation in the program flow. But even if not, you can recognize those external linkage wrappers by their range of address in the binary, and when you see one you can go find its code and figure out what external function it links to.
But as others have mentioned, if you have the source code and the system where it crashes frequently enough to be helpful, your fastest bet would usually be to rebuild with debug symbols, or insert logging output and get a more useful crash record.