How to offload memory offset calculation from runtime in C/C++?

How to offload memory offset calculation from runtime in C/C++? - c++

I am implementing a simple VM, and currently I am using runtime arithmetic to calculate individual program object addresses as offsets from base pointers.
I asked a couple of questions on the subject today, but I seem to be going slowly nowhere.
I learned a couple of things thou, from question one -
Object and struct member access and address offset calculation -
I learned that modern processors have virtual addressing capabilities, allowing to calculate memory offsets without any additional cycles devoted to arithmetic.
And from question two - Are address offsets resolved during compile time in C/C++? - I learned that there is no guarantee for this happening when doing the offsets manually.
By now it should be clear that what I want to achieve is to take advantage of the virtual memory addressing features of the hardware and offload those from the runtime.
I am using GCC, as for platform - I am developing on x86 in windows, but since it is a VM I'd like to have it efficiently running on all platforms supported by GCC.
So ANY information on the subject is welcome and will be very appreciated.
Thanks in advance!
EDIT: Some overview on my program code generation - during the design stage the program is build as a tree hierarchy, which is then recursively serialized into one continuous memory block, along with indexing objects and calculating their offset from the beginning of the program memory block.
EDIT 2: Here is some pseudo code of the VM:
switch *instruction
case 1: call_fn1(*(instruction+1)); instruction += (1+sizeof(parameter1)); break;
case 2: call_fn2(*(instruction+1), *(instruction+1+sizeof(parameter1));
instruction += (1+sizeof(parameter1)+sizeof(parameter2); break;
case 3: instruction += *(instruction+1); break;
Case 1 is a function that takes one parameter, which is found immediately after the instruction, so it is passed as an offset of 1 byte from the instruction. The instruction pointer is incremented by 1 + the size of the first parameter to find the next instruction.
Case 2 is a function that takes two parameters, same as before, first parameter passed as 1 byte offset, second parameter passed as offset of 1 byte plus the size of the first parameter. The instruction pointer is then incremented by the size of the instruction plus sizes of both parameters.
Case 3 is a goto statement, the instruction pointer is incremented by an offset which immediately follows the goto instruction.
EDIT 3: To my understanding, the OS will provide each process with its own dedicated virtual memory addressing space. If so, does this mean the first address is always ... well zero, so the offset from the first byte of the memory block is actually the very address of this element? If memory address is dedicated to every process, and I know the offset of my program memory block AND the offset of every program object from the first byte of the memory block, then are the object addresses resolved during compile time?
Problem is those offsets are not available during the compilation of the C code, they become known during the "compilation" phase and translation to bytecode. Does this mean there is no way to do object memory address calculation for "free"?
How is this done in Java for example, where only the virtual machine is compiled to machine code, does this mean the calculation of object addresses takes a performance penalty because of runtime arithmetics?

Here's an attempt to shed some light on how the linked questions and answers apply to this situation.
The answer to the first question mixes two different things, the first is the addressing modes in X86 instruction and the second is virtual-to-physical address mapping. The first is something that is done by compilers and the second is something that is (typically) set up by the operating system. In your case you should only be worrying about the first.
Instructions in X86 assembly have great flexibility in how they access a memory address. Instructions that read or write memory have the address calculated according to the following formula:
segment + base + index * size + offset
The segment portion of the address is almost always the default DS segment and can usually be ignored. The base portion is given by one of the general purpose registers or the stack pointer. The index part is given by one of the general purpose registers and the size is either 1, 2, 4, or 8. Finally the offset is a constant value embedded in the instruction. Each of these components is optional, but obviously at least one must be given.
This addressing capability is what is generally meant when talking about computing addresses without explicit arithmetic instructions. There is a special instruction that one of the commenters mentioned: LEA which does the address calculation but instead of reading or writing memory, stores the computed address in a register.
For the code you included in the question, it is quite plausible that the compiler would use these addressing modes to avoid explicit arithmetic instructions.
As an example, the current value of the instruction variable could be held in the ESI register. Additionally, each of sizeof(parameter1) and sizeof(parameter2) are compile time constants. In the standard X86 calling conventions function arguments are pushed in reverse order (so the first argument is at the top of the stack) so the assembly codes might look something like
case1:
PUSH [ESI+1]
CALL fn1
ADD ESP,4 ; drop arguments from stack
ADD ESI,5
JMP end_switch
case2:
PUSH [ESI+5]
PUSH [ESI+1]
CALL fn2
ADD ESP,8 ; drop arguments from stack
ADD ESI,9
JMP end_swtich
case3:
MOV ESI,[ESI+1]
JMP end_switch
end_switch:
this is assuming that the size of both parameters is 4 bytes. Of course the actual code is up to the compiler and it is reasonable to expect that the compiler will output fairly efficient code as long as you ask for some level optimization.

You have a data item X in the VM, at relative address A, and an instruction that says (for instance) push X, is that right? And you want to be able to execute this instruction without having to add A to the base address of the VM's data area.
I have written a VM that solves this problem by mapping the VM's data area to a fixed Virtual Address. The compiler knows this Virtual Address, and so can adjust A at compile time. Would this solution work for you? Can you change the compiler yourself?
My VM runs on a smart card, and I have complete control over the OS, so it's a very different environment from yours. But Windows does have some facilities for allocating memory at a fixed address -- the VirtualAlloc function, for instance. You might like to try this out. If you do try it out, you might find that Windows is allocating regions that clash with your fixed-address data area, so you will probably have to load by hand any DLLs that you use, after you have allocated the VM's data area.
But there will probably be unforeseen problems to overcome, and it might not be worth the trouble.

Playing with virtual address translation, page tables or TLBs is something that can only be done at the OS kernel level, and is unportable between platforms and processor families. Furthermore hardware address translation on most CPU ISAs is usually supported only at the level of certain page sizes.

To answer my own question, based on the many responses I got.
Turns out what I want to achieve is not really possible in my situation, getting memory address calculations for free is attainable only when specific requirements are met and requires compilation to machine specific instructions.
I am developing a visual element, lego style drag and drop programming environment for educational purposes, which relies on a simple VM to execute the program code. I was hoping to maximize performance, but it is just not possible in my scenario. It is not that big of a deal thou, because program elements can also generate their C code equivalent, which can then be compiled conventionally to maximize performance.
Thanks to everyone who responded and clarified a matter that wasn't really clear to me!

Related

Why in C++ recursion the addresses of stack variables grow backwards? [duplicate]

I am preparing some training materials in C and I want my examples to fit the typical stack model.
What direction does a C stack grow in Linux, Windows, Mac OSX (PPC and x86), Solaris, and most recent Unixes?

Stack growth doesn't usually depend on the operating system itself, but on the processor it's running on. Solaris, for example, runs on x86 and SPARC. Mac OSX (as you mentioned) runs on PPC and x86. Linux runs on everything from my big honkin' System z at work to a puny little wristwatch.
If the CPU provides any kind of choice, the ABI / calling convention used by the OS specifies which choice you need to make if you want your code to call everyone else's code.
The processors and their direction are:
x86: down.
SPARC: selectable. The standard ABI uses down.
PPC: down, I think.
System z: in a linked list, I kid you not (but still down, at least for zLinux).
ARM: selectable, but Thumb2 has compact encodings only for down (LDMIA = increment after, STMDB = decrement before).
6502: down (but only 256 bytes).
RCA 1802A: any way you want, subject to SCRT implementation.
PDP11: down.
8051: up.
Showing my age on those last few, the 1802 was the chip used to control the early shuttles (sensing if the doors were open, I suspect, based on the processing power it had :-) and my second computer, the COMX-35 (following my ZX80).
PDP11 details gleaned from here, 8051 details from here.
The SPARC architecture uses a sliding window register model. The architecturally visible details also include a circular buffer of register-windows that are valid and cached internally, with traps when that over/underflows. See here for details. As the SPARCv8 manual explains, SAVE and RESTORE instructions are like ADD instructions plus register-window rotation. Using a positive constant instead of the usual negative would give an upward-growing stack.
The afore-mentioned SCRT technique is another - the 1802 used some or it's sixteen 16-bit registers for SCRT (standard call and return technique). One was the program counter, you could use any register as the PC with the SEP Rn instruction. One was the stack pointer and two were set always to point to the SCRT code address, one for call, one for return. No register was treated in a special way. Keep in mind these details are from memory, they may not be totally correct.
For example, if R3 was the PC, R4 was the SCRT call address, R5 was the SCRT return address and R2 was the "stack" (quotes as it's implemented in software), SEP R4 would set R4 to be the PC and start running the SCRT call code.
It would then store R3 on the R2 "stack" (I think R6 was used for temp storage), adjusting it up or down, grab the two bytes following R3, load them into R3, then do SEP R3 and be running at the new address.
To return, it would SEP R5 which would pull the old address off the R2 stack, add two to it (to skip the address bytes of the call), load it into R3 and SEP R3 to start running the previous code.
Very hard to wrap your head around initially after all the 6502/6809/z80 stack-based code but still elegant in a bang-your-head-against-the-wall sort of way. Also one of the big selling features of the chip was a full suite of 16 16-bit registers, despite the fact you immediately lost 7 of those (5 for SCRT, two for DMA and interrupts from memory). Ahh, the triumph of marketing over reality :-)
System z is actually quite similar, using its R14 and R15 registers for call/return.

In C++ (adaptable to C) stack.cc:
static int
find_stack_direction ()
{
static char *addr = 0;
auto char dummy;
if (addr == 0)
{
addr = &dummy;
return find_stack_direction ();
}
else
{
return ((&dummy > addr) ? 1 : -1);
}
}

The advantage of growing down is in older systems the stack was typically at the top of memory. Programs typically filled memory starting from the bottom thus this sort of memory management minimized the need to measure and place the bottom of the stack somewhere sensible.

Just a small addition to the other answers, which as far as I can see have not touched this point:
Having the stack grow downwards makes all addresses within the stack have a positive offset relative to the stack pointer. There's no need for negative offsets, as they would only point to unused stack space. This simplifies accessing stack locations when the processor supports stackpointer-relative addressing.
Many processors have instructions that allow accesses with a positive-only offset relative to some register. Those include many modern architectures, as well as some old ones. For example, the ARM Thumb ABI provides for stackpointer-relative accesses with a positive offset encoded within a single 16-bit instruction word.
If the stack grew upwards, all useful offsets relative to the stackpointer would be negative, which is less intuitive and less convenient. It also is at odds with other applications of register-relative addressing, for example for accessing fields of a struct.

Stack grows down on x86 (defined by the architecture, pop increments stack pointer, push decrements.)

In MIPS and many modern RISC architectures (like PowerPC, RISC-V, SPARC...) there are no push and pop instructions. Those operations are explicitly done by manually adjusting the stack pointer then load/store the value relatively to the adjusted pointer. All registers (except the zero register) are general purpose so in theory any register can be a stack pointer, and the stack can grow in any direction the programmer wants
That said, the stack typically grows down on most architectures, probably to avoid the case when the stack and program data or heap data grows up and clash to each other. There's also the great addressing reasons mentioned sh-'s answer. Some examples: MIPS ABIs grows downwards and use $29 (A.K.A $sp) as the stack pointer, RISC-V ABI also grows downwards and use x2 as the stack pointer
In Intel 8051 the stack grows up, probably because the memory space is so tiny (128 bytes in original version) that there's no heap and you don't need to put the stack on top so that it'll be separated from the heap growing from bottom
You can find more information about the stack usage in various architectures in https://en.wikipedia.org/wiki/Calling_convention
See also
Why does the stack grow downward?
What are the advantages to having the stack grow downward?
Why do stacks typically grow downwards?
Does stack grow upward or downward?

On most systems, stack grows down, and my article at https://gist.github.com/cpq/8598782 explains WHY it grows down. It is simple: how to layout two growing memory blocks (heap and stack) in a fixed chunk of memory? The best solution is to put them on the opposite ends and let grow towards each other.

It grows down because the memory allocated to the program has the "permanent data" i.e. code for the program itself at the bottom, then the heap in the middle. You need another fixed point from which to reference the stack, so that leaves you the top. This means the stack grows down, until it is potentially adjacent to objects on the heap.

This macro should detect it at runtime without UB:
#define stk_grows_up_eh() stk_grows_up__(&(char){0})
_Bool stk_grows_up__(char *ParentsLocal);
__attribute((__noinline__))
_Bool stk_grows_up__(char *ParentsLocal) {
return (uintptr_t)ParentsLocal < (uintptr_t)&ParentsLocal;
}

Why stack get backwards? [duplicate]

I know that in the architectures I'm personally familiar with (x86, 6502, etc), the stack typically grows downwards (i.e. every item pushed onto the stack results in a decremented SP, not an incremented one).
I'm wondering about the historical rationale for this. I know that in a unified address space, it's convenient to start the stack on the opposite end of the data segment (say) so there's only a problem if the two sides collide in the middle. But why does the stack traditionally get the top part? Especially given how this is the opposite of the "conceptual" model?
(And note that in the 6502 architecture, the stack also grows downwards, even though it is bounded to a single 256-byte page, and this direction choice seems arbitrary.)

As to the historic rationale, I can't say for certain (because I didn't design them). My thoughts on the matter are that early CPUs got their original program counter set to 0 and it was a natural desire to start the stack at the other end and grow downwards, since their code naturally grows upward.
As an aside, note that this setting of the program counter to 0 on reset is not the case for all early CPUs. For example, the Motorola 6809 would fetch the program counter from addresses 0xfffe/f so you could start running at an arbitrary location, depending on what was supplied at that address (usually, but by no means limited to, ROM).
One of the first things some historical systems would do would be to scan memory from the top until it found a location that would read back the same value written, so that it would know the actual RAM installed (e.g., a z80 with 64K address space didn't necessarily have 64K or RAM, in fact 64K would have been massive in my early days). Once it found the top actual address, it would set the stack pointer appropriately and could then start calling subroutines. This scanning would generally be done by the CPU running code in ROM as part of start-up.
With regard to the stacks growth, not all of them grow downwards, see this answer for details.

One good explanation I heard was that some machines in the past could only have unsigned offsets, so you'd want to the stack to grow downward so you could hit your locals without having to lose the extra instruction to fake a negative offset.

Stanley Mazor (4004 and 8080 architect) explains how stack growth direction was chosen for 8080 (and eventually for 8086) in "Intel Microprocessors: 8008 to 8086":
The stack pointer was chosen to run "downhill" (with the stack advancing toward lower memory) to simplify indexing into the stack from the user's program (positive indexing) and to simplify displaying the contents of the stack from a front panel.

One possible reason might be that it simplifies alignment. If you place a local variable on the stack which must be placed on a 4-byte boundary, you can simply subtract the size of the object from the stack pointer, and then zero out the two lower bits to get a properly aligned address. If the stack grows upwards, ensuring alignment becomes a bit trickier.

IIRC the stack grows downwards because the heap grows upwards. It could have been the other way around.

I believe it's purely a design decision. Not all of them grow downward -- see this SO thread for some good discussion on the direction of stack growth on different architectures.

I'm not certain but I did some programming for the VAX/VMS back in the days. I seem to remember one part of memory (the heap??) going up and the stack going down. When the two met, then you were out of memory.

I believe the convention began with the IBM 704 and its infamous "decrement register". Modern speech would call it an offset field of the instruction, but the point is they went down, not up.

Just 2c more:
Beyond all the historic rationale mentioned, I'm quite certain there's no reason which is valid in modern processors. All processors can take signed offsets, and maximizing the heap/stack distance is rather moot ever since we started dealing with multiple threads.
I personally consider this a security design flaw. If, say, the designers of the x64 architecture would have reversed the stack growth direction, most stack buffer overflows would have been eliminated - which is kind of a big deal. (since strings grow upward).

Because then a POP uses the same addressing mode that is commonly used to scan through strings and arrays
An instruction that pops a value off of a stack needs to do two things: read the value out of memory, and adjust the stack pointer. There are four possible design choices for this operation:
Preincrement the stack pointer first, then read the value. This implies that the stack will grow "downwards" (towards lower memory addresses).
Predecrement the stack pointer first, then read the value. This implies that the stack will grow "upwards" (towards higher memory addresses).
Read the value first, then postincrement the stack pointer. This implies that the stack will grow downwards.
Read the value first, then postdecrement the stack pointer. This implies that the stack will grow upwards.
In many computer languages (particularly C), strings and arrays are passed to functions as pointers to their first element. A very common operation is to read the elements of the string or array in order, starting with the first element. Such an operation needs only the postincrement addressing mode described above.
Furthermore, reading the elements of a string or array is more common than writing the elements. Indeed, there are many standard library functions that perform no writing at all (e.g. strlen(), strchr(), strcmp())!
Therefore, if you have a limited number of addressing modes in your instruction set design, the most useful addressing mode would be a read that postincrements. This results in not only the most useful string and array operations, but also a POP instruction that grows the stack downward.
The second-most-useful addressing mode would then be a post-decrement write, which can be used for the matching PUSH instruction.
Indeed, the PDP-11 had postincrement and predecrement addressing modes, which produced a downward-growing stack. Even the VAX did not have preincrement or postdecrement.

One advantage of descending stack growth in a minimal embedded system is that a single chunk of RAM can be redundantly mapped into both page O and page 1, allowing zero page variables to be assigned starting at 0x000 and the stack growing downwards from 0x1FF, maximizing the amount it would have to grow before overwriting variables.
One of the original design goals of the 6502 was that it could be combined with, for example, a 6530, resulting in a two-chip microcontroller system with 1 KB of program ROM, timer, I/O, and 64 bytes of RAM shared between stack and page zero variables. By comparison, the minimal embedded system of that time based on an 8080 or 6800 would be four or five chips.

How to safely implement "Using Uninitialized Memory For Fun And Profit"?

I would like to build a dense integer set in C++ using the trick described at https://research.swtch.com/sparse . This approach achieves good performance by allowing itself to read uninitialized memory.
How can I implement this data structure without triggering undefined behavior, and without running afoul of tools like Valgrand or ASAN?
Edit: It seems like responders are focusing on the word "uninitialized" and interpreting it in the context of the language standard. That was probably a poor word choice on my part - here "uninitialized" means only that its value is not important to the correct functioning of the algorithm. It's obviously possible to implement this data structure safely (LLVM does it in SparseMultiSet). My question is what is the best and most performant way to do so?

I can see four basic approaches you can take. These are applicable not only to C++ but also most other low-level languages like C that make uninitialized access possible but not allowed, and the last is applicable even to higher-level "safe" languages.
Ignore the standard, implement it in the usual way
This is the one crazy trick language lawyers hate! Don't freak out yet though - the solutions following this one won't break the rules, so just skip this part if you are of the rules-stickler variety.
The standard makes most uses of uninitialized values undefined and the few loopholes it does allow (e.g., copying one undefined value to another) don't really give you enough rope to actually implement what you want - even in C which is slightly less restrictive (see for example this answer covering C11, which explains that while accessing an indeterminiate value may not directly trigger UB anything that results is also indeterminate and indeed the value may appear to chance from access to access).
So you just implement it anyway, with the knowledge that most or all currently compilers will just compile it to the expected code, and know that your code is not standards compliant.
At least in my test all of gcc, clang and icc didn't take advantage of the illegal access to do anything crazy. Of course, the test is not comprehensive and even if you could construct one, the behavior could change in a new version of the compiler.
You would be safest if the implementation of the methods that access uninitialized memory was compiled, once, in a separate compilation unit - this makes it easy to check that it does the right thing (just check the assembly once) and makes it nearly impossible (outside of LTGC) for the compiler to do anything tricky, since it can't prove whether uninitialized values are being accessed.
Still, this approach is theoretically unsafe and you should check the compiled output very carefully and have additional safeguards in place if you take it.
If you take this approach, tools like valgrind are fairly likely to report a uninitialized read error.
Now these tools work at the assembly level, and some uninitialized reads may be fine (see, for example, the next item on fast standard library implementations), so they don't actually report a uninitialized read immediately, but rather have a variety of heuristics to determine if invalid values are actually used. For example, they may avoid reporting an error until they determine the uninitialized value is used to determine the direction of a conditional jump, or some other action that is not trackable/recoverable according to the heuristic. You may be able to get the compiler to emit code that reads uninitialized memory but is safe according to this heuristic.
More likely, you won't be able to do that (since the logic here is fairly subtle as it relies on the relationship between the values in two arrays), so you can use the suppression options in your tools of choice to hide the errors. For example, valgrind can suppress based on the stack trace - and in fact there are already many such suppression entries used by default to hide false-positives in various standard libraries.
Since it works based on stack traces, you'll probably have difficulties if the reads occur in inlined code, since the top-of-stack will then be different for every call-site. You could avoid this my making sure the function is not inlined.
Use assembly
What is ill-defined in the standard, is usually well-defined at the assembly level. It is why the compiler and standard library can often implement things in a faster way than you could achieve with C or C++: a libc routine written in assembly is already targeting a specific architecture and doesn't have to worry about the various caveats in the language specification that are there to make things run fast on a variety of hardware.
Normally, implementing any serious amount of code in assembly is a costly undertaking, but here it is only a handful, so it may be feasible depending on how many platforms you are targeting. You don't even really need to write the methods yourself - just compile the C++ version (or use godbolt and copy the assembly. The is_member function, for example1, looks like:
sparse_array::is_member(unsigned long):
mov rax, QWORD PTR [rdi+16]
mov rdx, QWORD PTR [rax+rsi*8]
xor eax, eax
cmp rdx, QWORD PTR [rdi]
jnb .L1
mov rax, QWORD PTR [rdi+8]
cmp QWORD PTR [rax+rdx*8], rsi
sete al
Rely on calloc magic
If you use calloc2, you explicitly request zeroed memory from the underlying allocator. Now a correct version of calloc may simply call malloc and then zero out the returned memory, but actual implementations rely on the fact that the OS-level memory allocation routines (sbrk and mmap, pretty much) will generally return you zeroed memory on any OS with protected memory (i.e., all the big ones), to avoid zeroing out the memory again.
As a practical matter, for large allocations, this is typically satisfied by implementing a call like anonymous mmap by mapping a special zero page of all zeros. When (if ever) the memory is written, does copy-on-write actually allocate a new page. So the allocation of large, zeroed memory regions may be for free since the OS already needs to zero the pages.
In that case, implementing your sparse set on top of calloc could be just as fast as the nominally uninitialized version, while being safe and standards compliant.
Calloc Caveats
You should of course test to ensure that calloc is behaving as expected. The optimized behavior is usually only going to occur when your program initializes a lot of long-lived zeroed memory approximately "up-front". That is, the typical logic for optimized calloc if something like this:
calloc(N)
if (can satisfy a request for N bytes from allocated-then-freed memory)
memset those bytes to zero and return them
else
ask the OS for memory, return it directly because it is zeroed
Basically, the malloc infrastructure (which also underlies new and friends) has a (possibly empty) pool of memory that it has already requested from the OS and generally tries to allocated there first. This pool is composed of memory from the last block request from the OS but not handed out (e.g., because the user requested 32 bytes but the allocated asks for chunks from the OS in 1 MB blocks, so there is a lot left over), and also of memory that was handed out to the process but subsequently returned via free or delete or whatever. The memory in that pool has arbitrary values, and if a calloc can be satisfied from that pool, you don't get your magic, since the zero-init has to occur.
On the other hand if the memory has to be allocated from the OS, you get the magic. So it depends on your use case: if you are frequently creating and destroying sparse_set objects, you will generally just be drawing from the internal malloc pools and will pay the zeroing costs. If you have a long-lived sparse_set objects which take up a lot of memory, they likely were allocated by asking the OS and you got the zeroing nearly for free.
The good news is that if you don't want to rely on the calloc behavior above (indeed, on your OS or with your allocator it may not even be optimized in that way), you could usually replicate the behavior by mapping in /dev/zero manually for your allocations. On OSes that offer it, this guarantees that you get the "cheap" behavior.
Use Lazy Initialization
For a solution that is totally platform agnostic you could simply use yet another array which tracks the initialization state of the array.
First you choose some granule, at which you will track initialization, and use bitmap where each bit tracks the initialization state of that granule of the sparse array.
For example, let's say you choose your granule to be 4 elements, and the size of the elements in your array is 4 bytes (e.g., int32_t values): you need 1 bit to track every 4 elements * 4 bytes/element * 8 bits/byte, which is an overhead of less than 1%3 in allocated memory.
Now you simply check the corresponding bit in this array before accessing sparse. This adds some small cost to accessing the sparse array, but doesn't change the overall complexity, and the check is still quite fast.
For example, your is_member function now looks like:
bool sparse_set::is_member(size_t i){
bool init = is_init[i >> INIT_SHIFT] & (1UL << (i & INIT_MASK));
return init && sparse[i] < n && dense[sparse[i]] == i;
}
The generated assembly on x86 (gcc) now starts with:
mov rax, QWORD PTR [rdi+24]
mov rdx, rsi
shr rdx, 8
mov rdx, QWORD PTR [rax+rdx*8]
xor eax, eax
bt rdx, rsi
jnc .L2
...
.L2:
ret
That's all associate with the bitmap check. It's all going to be pretty quick (and often off the critical path since it isn't part of the data flow).
In general, the cost of this approach depends on the density of your set, and what functions you are calling - is_member is about the worse case for this approach since some functions (e.g., clear) aren't affected at all, and others (e.g., iterate) can batch up the is_init check and only do it once every INIT_COVERAGE elements (meaning the overhead would be again ~1% for the example values).
Sometimes this approach will be faster than the approach suggested in the OP's link, especially when the handling elements not in the set - in this case, the is_init check will fail and often shortcut the remaining code, and in this case you have a working set that is much smaller (256 times using the example granule size) than the size of the sparse array, so you may great reduce misses to DRAM or outer cache levels.
The granule size itself is an important tunable for this approach. Intuitively, a larger granule size pays a larger initialization cost when the an element covered by the granule is accessed for the first time, but saves on memory and up-front is_init initialization cost. You can come up with a formula that finds the optimum size in the simple case - but the behavior also depends on the "clustering" of values and other factors. Finally, it is totally reasonable to use a dynamic granule size to cover your bases under varying workloads - but it comes at the cost of variable shifts.
Really Lazy Solution
It is worth noting that there is a similarity between the calloc and lazy init solutions: both lazily initialize blocks of memory as they are needed, but the calloc solution track this implicitly in hardware through MMU magic with page tables and TLB entries, while the lazy init solution does it in software, with a bitmap explicitly tracking which granules have been allocated.
The hardware approach has the advantage of being nearly free in (for the "hit" case, anyway) since it uses the always-present virtual memory support in the CPU to detect misses, but the software case has the advantage of being portable and allowing precise control over the granule size etc.
You can actually combine these approaches, to make a lazy approach that doesn't use a bitmap, and doesn't even need the dense array at all: just allocate your sparse array with mmap as PROT_NONE, so you fault whenever you read from an un-allocated page in a sparse array. You catch the fault and allocate the page in the sparse array with a sentinel value indicating "not present" for every element.
This is the fastest of all for the "hot" case: you don't need any of the ... && dense[sparse[i]] == i checks any more.
The downsides are:
Your code is really not portable since you need to implement the fault-handling logic, which is usually platform specific.
You can't control the granule size: it must be at page granularity (or some multiple thereof). If your set is very sparse (say less than 1 out of 4096 elements occupied) and uniformly distributed, you end up paying a high initialization cost since you need to handle a fault and initialize a full page of values for every element.
Misses (i.e., non-insert accesses to set elements that don't exist) either need to allocate a page even if no elements will exist in that range, or will be very slow (incurring a fault) every time.
1 This implementation has no "range checking" - i.e., it doesn't check if i is greater than MAX_ELEM - depending on your use case you may want to check this. My implementation used a template parameter for MAX_ELEM, which may result in slightly faster code, but also more bloat, and you'd do fine to just make the max size a class member as well.
2 Really, the only requirement that you use something that calls calloc under the covers or performs the equivalent zero-fill optimization, but based on my tests more idiomatic C++ approaches like new int[size]() just do the allocation followed by a memset. gcc does optimize malloc followed by memset into calloc, but that's not useful if you are trying to avoid the use of C routines anyway!
3 Precisely, you need 1 extra bit to track every 128 bits of the sparse array.

If we reword your question:
What code reads from uninitialized memory without tripping tools designed to catch reads from uninitialized memory?
Then the answer becomes clear -- it is not possible. Any way of doing this that you could find represents a bug in Valgrind that would be fixed.
Maybe it's possible to get the same performance without UB, but the restrictions you put on your question "I would like to... use the trick... allowing itself to read uninitialized memory" guarantee UB. Any competing method avoiding UB will not be using the trick that you so love.

Valgrind does not complain if you just read uninitialised memory.
Valgrind will complain if you use this data in a way that influences
the visible behaviour of the program, e.g. using this data as input in a syscall, or doing a jump based on this data, or using this data to commpute another address. See http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.uninitvals for more info.
So, it might very well be that you will have no problem with Valgrind.
If valgrind still complain but your algorithm is correct even when using this uninit data, then you can use client requests to declare this memory as initialised.

Which costs more, computed goto/jump vs fastcall through function pointer?

I am in a dilemma, what would be the more performing option for the loop of a VM:
option 1 - force inline for the instruction functions, use computed goto for switch to go the call (effectively inlined code) of the instruction on that label... or...
option 2 - use a lookup array of function pointers, each pointing to a fastcall function, and the instruction determines the index.
Basically, what is better, a lookup table with jump addresses and in-line code or a lookup table with fastcall function addresses. Yes, I know, both are effectively just memory addresses and jumps back and forth, but I think fastcall may still cause some data to be pushed on the stack if out of register space, even if forced to use registers for the parameters.
Compiler is GCC.

I assume, that with "virtual machine", you refer to a simulated processor executing some sort of bytecode, similiar to the "Java virtual machine", and not a whole simulated computer that allows installation of another OS (like in VirtualBox/VMware).
My suggestion is to let the compiler do the decision, about what has the best performance, and create a big traditional "switch" on the current item of the byte code stream. This will likely result in a jump table created by the compiler, so it it as fast (or slow) as your computed goto variant, but more portable.
Your variant 2 - lookup array of function pointers - is likely slower than inlined functions, as there is likely extra overhead with non-inlined functions, such as the handling of return values. After all, some of your VM-op functions (like "goto" or "set-register-to-immediate") have to modify the instruction pointer, others don't need to.
Generally, calls to function pointers (or jumps via a jump table) are slow on current CPUs, as they are hardly predicted right by branch prediction. So, if you think about optimizing your VM, try to find a set of instructions, that requires as few code points as necessary.

What is a jump table?

Can someone explain the mechanics of a jump table and why is would be needed in embedded systems?

A jump table can be either an array of pointers to functions or an array of machine code jump instructions. If you have a relatively static set of functions (such as system calls or virtual functions for a class) then you can create this table once and call the functions using a simple index into the array. This would mean retrieving the pointer and calling a function or jumping to the machine code depending on the type of table used.
The benefits of doing this in embedded programming are:
Indexes are more memory efficient than machine code or pointers, so there is a potential for memory savings in constrained environments.
For any particular function the index will remain stable and changing the function merely requires swapping out the function pointer.
If does cost you a tiny bit of performance for accessing the table, but this is no worse than any other virtual function call.

A jump table, also known as a branch table, is a series of instructions, all unconditionally branching to another point in code.
You can think of them as a switch (or select) statement where all the cases are filled:
MyJump(int c)
{
switch(state)
{
case 0:
goto func0label;
case 1:
goto func1label;
case 2:
goto func2label;
}
}
Note that there's no return - the code that it jumps to will execute the return, and it will jump back to wherever myjump was called.
This is useful for state machines where you execute certain code based on the state variable. There are many, many other uses, but this is one of the main uses.
It's used where you don't want to waste time fiddling with the stack, and want to save code space. It is especially of use in interrupt handlers where speed is extremely important, and the peripheral that caused the interrupt is only known by a single variable. This is similar to the vector table in processors with interrupt controllers.
One use would be taking a $0.60 microcontroller and generating a composite (TV) signal for video applications. the micro isn't powerful - in fact it's just barely fast enough to write each scan line. A jump table would be used to draw characters, because it would take too long to load a bitmap from memory, and use a for() loop to shove the bitmap out. Instead there's a separate jump to the letter and scan line, and then 8 or so instructions that actually write the data directly to the port.
-Adam

Jump tables are commonly (but not exclusively) used in finite state machines to make them data driven.
Instead of nested switch/case
switch (state)
case A:
switch (event):
case e1: ....
case e2: ....
case B:
switch (event):
case e3: ....
case e1: ....
you can make a 2d array or function pointers and just call handleEvent[state][event]

From Wikipedia:
In computer programming, a branch
table (sometimes known as a jump
table) is a term used to describe an
efficient method of transferring
program control (branching) to another
part of a program (or a different
program that may have been dynamically
loaded) using a table of branch
instructions. The branch table
construction is commonly used when
programming in assembly language but
may also be generated by a compiler.
A branch table consists of a serial
list of unconditional branch
instructions that is branched into
using an offset created by multiplying
a sequential index by the instruction
length (the number of bytes in memory
occupied by each branch instruction).
It makes use of the fact that machine
code instructions for branching have a
fixed length and can be executed
extremely efficiently by most
hardware, and is most useful when
dealing with raw data values that may
be easily converted to sequential
index values. Given such data, a
branch table can be extremely
efficient; it usually consists of the
following steps: optionally validating
the input data to ensure it is
acceptable; transforming the data into
an offset into the branch table, this
usually involves multiplying or
shifting it to take into account the
instruction length; and branching to
an address made up of the base of the
table and the generated offset: this
often involves an addition of the
offset onto the program counter
register.

A jump table is described here, but briefly, it's an array of addresses the CPU should jump to based on certain conditions. As an example, a C switch statement is often implemented as a jump table where each jump entry will go to a particular "case" label.
In embedded systems, where memory usage is at a premium, many constructs are better served by using a jump table instead of more memory-intensive methods (like a massive if-else-if).

Wikipedia sums it up pretty well:
In computer programming, a branch
table (sometimes known as a jump
table) is a term used to describe an
efficient method of transferring
program control (branching) to another
part of a program (or a different
program that may have been dynamically
loaded) using a table of branch
instructions. The branch table
construction is commonly used when
programming in assembly language but
may also be generated by a compiler.
... Use of branch tables and other raw
data encoding was common in the early
days of computing when memory was
expensive, CPUs were slower and
compact data representation and
efficient choice of alternatives were
important. Nowadays, they are commonly
used in embedded programming and
operating system development.
In other words, it's a useful construct to use when your system is extremely memory and/or CPU limited, as is often the case in an embedded platform.

Jump tables, more often known as a Branch table, are usually used only by the machine.
The compiler creates a list of all labels in a assembly program and links all labels to a a memory location. A jump table pretty much is a reference card to where, a function or variable or what ever the label maybe, is stored in memory.
So as a function executes, on finishing it jumps back to it's previous memory location or jumps to the next function, etc.
And If your talking about what I think you are, you don't just need them in embedded systems but in any type of compiled/interpreted environment.
Brian Gianforcaro

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js