Related
Context
My question is twofold (really two questions) but quite basic*. But first, I will show some relevant code for some context. For the TL;DR 'meat and potatoes', skip to the bottom for the actual questions.
*(I'm assuming answerers are aware of what is happening/how a virtual machine operates fundamentally before attempting to answer).
As mentioned, I am writing a (toy) VM, which executes a custom byte code instruction set.
(ellipses here only represent omission of some cases)
Here is a snippet of my code:
for (ip = 0; (ip < _PROGRAM_SIZE || !cstackempty); ip++) {
if (breakPending) { break; }
switch (_instr) {
case INST::PUSH: {
AssertAbort(wontoverflow(1), "Stack overflow (1 byte)");
cmd_ "PUSH";
push(_incbyte);
printStack();
break;
}
...
case INST::ADD: {
AssertAbort(stackhas(2), "Can't pop stack to add 2 bytes. Stack does not contain 2 bytes");
cmd_ "ADD";
byte popped8_a = pop();
byte popped8_b = pop();
byte result = popped8_a + popped8_b;
push(result);
cmd_ " "; cmd_(byte)result;
printStack();
break;
}
case INST::ADD16: {
AssertAbort(stackhas(4), "Can't pop stack to add 4 bytes. Stack does not contain 4 bytes");
cmd_ "ADD16";
u16 popped16_a = pop16();
u16 popped16_b = pop16();
u16 result = popped16_a + popped16_b;
push16(result);
cmd << " "; cmd << (u16)result;
printStack();
break;
}
...
}
}
Only because it's relevant, I will mention that _cstack is the call stack, hence the !cstackempty macro, which checks in case the call is empty before calling quits (exiting the for loop) just because it's the last instruction being executed (because that last instruction could we be part of a function, or even a return). Also, ip (instruction pointer) is simply an unsigned long long (u64), as is _PROGRAM_SIZE (size of program in bytes). instr is a byte and is a reference to the current instruction (1 byte).
Meat and potatoes
Question 1: Since I'm initialising two new integers of variable size per block/case (segmented into blocks to avoid redeclaration errors and such), would declaring them above the for loop be helpful in terms of speed, assignment latency, program size etc?
Question 2: Would continue be faster than break in this case, and is there any faster way to execute such a conditional loop, such as some kind of goto-pointer-to-label like in this post, that is implementation agnostic, or somehow avoid the cost of either continue or break?
To summarize, my priorities are speed, then memory costs (speed, efficiency), then file size (of the VM).
Before answering the specific questions, a note: There isn't any CPU that executes C++ directly. So any question of this type of micro-optimization at the language level depends heavily on the compiler, software runtime environment and target hardware. It is entirely possible that one technique works better on the compiler you are using today, but worse on the one you use tomorrow. Similarly for hardware choices such as CPU architecture.
The only way to get a definitive answer of which is better is to benchmark it in a realistic situation, and often the only way to understand the benchmark results is to dive into the generated assembly. If this kind of optimization is important to you, consider learning a bit about the assembly language for your development architecture.
Given that, I'll pick a specific compiler (gcc) and a common architecture (x86) and answer in that context. The details will differ slightly for other choices, but I expect the broad strokes to be similar for any decent compiler and hardware combination.
Question 1
The place of declaration doesn't matter. The declaration itself, doesn't even really turn into code - it's only the definition and use that generate code.
For example, consider the two variants of a simple loop below (the external sink() method is just there to avoid optimizing way the assignment to a):
Declaration Inside Loop
int func(int* num) {
for (unsigned int i=0; i<100; i++) {
int a = *num + *num;
sink(a);
sink(a);
}
}
Declaration Outside Loop
int func(int* num) {
int a;
for (unsigned int i=0; i<100; i++) {
a = *num + *num;
sink(a);
sink(a);
}
}
We can use the godbolt compiler explorer to easily check the assembly generated for the first and second variants. They are identical - here's the loop:
.L2:
mov ebp, DWORD PTR [r12]
add ebx, 1
add ebp, ebp
mov edi, ebp
call sink(int)
mov edi, ebp
call sink(int)
cmp ebx, 100
jne .L2
Basically the declaration doesn't produce any code - only the assignment does.
Question 2
Here it is key to note that at the hardware level, there aren't instructions like "break" or "continue". You really only have jumps, either conditional or not, which are basically gotos. Both break and continue will be translated to jumps. In your case, a break inside a switch, where the break is the last statement in the loop, and a continue inside the switch have exactly the same effect, so I expect them to be compiled identically, but let's check.
Let's use this test case:
int func(unsigned int num, int iters) {
for (; iters > 0; iters--) {
switch (num) {
case 0:
sinka();
break;
case 1:
sinkb();
break;
case 2:
sinkc();
break;
case 3:
sinkd();
break;
case 4:
sinkd();
break;
}
}
}
It uses the break to exist the case. Here's the godbolt output on gcc 4.4.7 for x86, ignoring the function prologue:
.L13:
cmp ebp, 4
ja .L3
jmp [QWORD PTR [r13+r12*8]] # indirect jump
.L9:
.quad .L4
.quad .L5
.quad .L6
.quad .L7
.quad .L8
.L4:
call sinka()
jmp .L3
.L5:
call sinkb()
jmp .L3
.L6
call sinkc()
jmp .L3
.L7
call sinkd()
jmp .L3
.L8
call sinkd()
.L3:
sub ebx, 1
test ebx, ebx
jg .L13
Here, the compile has chosen a jump table approach. The value of num is used to look up a jump address (the table is the series of .quad directives), and then an indirect jump is made to one of the label L4 through L8. The breaks change into jmp .L3, which executes the loop logic.
Note that a jump table isn't the only way to compile a switch - if I used 4 or less case statements, the compile instead chose a series of branches.
Let's try the same example, but with each break replaced with a continue:
int func(unsigned int num, int iters) {
for (; iters > 0; iters--) {
switch (num) {
case 0:
sinka();
continue;
... [16 lines omitted] ...
}
}
}
As you might have guessed by now, the results are identical - at for this particular compiler and target. The continue statements and break statements imply the exact same control flow, so I'd expect this to be true for most decent compilers with optimization turned on.
For Question 2, a processor should be able to handle break reasonably well since it is effectively a branch that will always occur in assembly so it shouldn't cause too much issue. This should mean there is no pipeline flush for the reason stated as the branch prediction unit should get this one right. I believe Question 1 was answered above.
So it is possible to make the system call a custom function for pure virtual functions[1]. This raises the question what such a function can do. For GCC
Vtable for Foo
Foo::_ZTV3Foo: 5u entries
0 (int (*)(...))0
8 (int (*)(...))(& _ZTI3Foo)
16 0u
24 0u
32 (int (*)(...))__cxa_pure_virtual
And, it is placed directly in the slot for the pure virtual function. Since the function prototype void foo() does not match the true signature, is the stack still sane? In particular, can a I throw an exception and catch it somewhere and continue execution.
[1] Is there an equivilant of _set_purecall_handler() in Linux?
Read the x86-64 ABI supplement to understand what is really happening; notably about calling conventions.
In your case, the stack is safe (because calling a void foo(void) is safe instead of calling any other signature), and you probably can throw some exception.
Details are compiler and processor specific. Your hack might perhaps work -but probably not- but is really unportable (since technically an undefined behavior, IIUC).
I'm not sure it will work. Perhaps the compiler would emit an indirect jump, and you'll jump to the nil address, and that is a SIGSEGV
Notice that a virtual call is just an indirect jump; with
class Foo {
public:
virtual void bar(void) =0;
virtual ~Foo();
};
extern "C" void doit(Foo*f) {
f->bar();
}
The assembly code (produced with g++-4.9 -Wall -O -fverbose-asm -S foo.cc) is:
.type doit, #function
doit:
.LFB0:
.file 1 "foo.cc"
.loc 1 7 0
.cfi_startproc
.LVL0:
subq $8, %rsp #,
.cfi_def_cfa_offset 16
.loc 1 8 0
movq (%rdi), %rax # f_2(D)->_vptr.Foo, f_2(D)->_vptr.Foo
call *(%rax) # *_3
.LVL1:
.loc 1 9 0
addq $8, %rsp #,
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE0:
.size doit, .-doit
and I don't seen any checks against unbound virtual methods above.
It is much better, and more portable, to define the virtual methods in the base class to throw the exception.
You might customize your GCC compiler using MELT to fit your bizarre needs!
The application I am dealing with right now uses some brute-force numerical algorithm that calls many tiny functions billions of times. I was wandering how much the performance can be improved by eliminating function calls using inclining and static polymorphism.
What is the cost of calling a function relative to calling non-inline and non-intrinsic function in the following situations:
1) function call via function pointer
2) virtual function call
I know that it is hard to measure, but a very rough estimate would do it.
Thank you!
To make a member function call compiler needs to:
Fetch address of function -> Call function
To call a virtual function compiler needs to:
Fetch address of vptr -> Fetch address of the function -> Call function
Note: That virtual mechanism is compiler implementation detail, So the implementation might differ for different compilers, there may not even be a vptr or vtable for that matter. Having said So Usually, compilers implement it with vptr and vtable and then above holds true.
So there is some overhead for sure(One additional Fetch), To know precisely how much it impacts, you will have to profile your source code there is no simpler way.
It depends on your target architecture and your compiler, but one thing you can do is write a small test and check the assembly generated.
I did one to do the test:
// test.h
#ifndef FOO_H
#define FOO_H
void bar();
class A {
public:
virtual ~A();
virtual void foo();
};
#endif
// main.cpp
#include "test.h"
void doFunctionPointerCall(void (*func)()) {
func();
}
void doVirtualCall(A *a) {
a->foo();
}
int main() {
doFunctionPointerCall(bar);
A a;
doVirtualCall(&a);
return 0;
}
Note that you don't even need to write test.cpp, since you just need to check the assembly for main.cpp.
To see the compiler assembly output, with gcc use the flag -S:
gcc main.cpp -S -O3
It will create a file main.s, with the assembly output.
Now we can see what gcc generated to the calls.
doFunctionPointerCall:
.globl _Z21doFunctionPointerCallPFvvE
.type _Z21doFunctionPointerCallPFvvE, #function
_Z21doFunctionPointerCallPFvvE:
.LFB0:
.cfi_startproc
jmp *%rdi
.cfi_endproc
.LFE0:
.size _Z21doFunctionPointerCallPFvvE, .-_Z21doFunctionPointerCallPFvvE
doVirtualCall:
.globl _Z13doVirtualCallP1A
.type _Z13doVirtualCallP1A, #function
_Z13doVirtualCallP1A:
.LFB1:
.cfi_startproc
movq (%rdi), %rax
movq 16(%rax), %rax
jmp *%rax
.cfi_endproc
.LFE1:
.size _Z13doVirtualCallP1A, .-_Z13doVirtualCallP1A
Note here I'm using a x86_64, that the assembly will change for other achitectures.
Looking to the assembly, it looks like it is using two extra movq for the virtual call, it probably is some offset in the vtable. Note that in a real code, it would need to save some registers (be it function pointer or virtual call), but the virtual call would still need two extra movq over function pointer.
Just use a profiler like AMD's codeanalyst (using IBS and TBS), else you can go the more 'hardcore' route and give Agner Fog's optimization manuals a read (they will help both for precision instruction timings and optimizing your code): http://www.agner.org/optimize/
Function calls are a significant overhead if the functions are small. The CALL and RETURN while optimized on modern CPUs will still be noticeable when many many calls are made. Also the small functions could be spread across memory so the CALL/RETURN may also induce cache misses and excessive paging.
//code
int Add(int a, int b) { return a + b; }
int main() {
Add(1, Add(2, 3));
...
}
// NON-inline x86 ASM
Add:
MOV eax, [esp+4] // 1st argument a
ADD eax, [esp+8] // 2nd argument b
RET 8 // return and fix stack 2 args * 4 bytes each
// eax is the returned value
Main:
PUSH 3
PUSH 2
CALL [Add]
PUSH eax
PUSH 1
CALL [Add]
...
// INLINE x86 ASM
Main:
MOV eax, 3
ADD eax, 2
ADD eax, 1
...
If optimization is your goal and you're calling many small functions, it's always best to inline them. Sorry, I don't care for the ugly ASM syntax used by c/c++ compilers.
I have some constant data that I want to store in ROM since there is a fair amount of it and I'm working with a memory-constrained ARM7 embedded device. I'm trying to do this using structures that look something like this:
struct objdef
{
int x;
int y;
bool (*function_ptr)(int);
some_other_struct * const struct_array; // array of similar structures
const void* vp; // previously ommittted to shorten code
}
which I then create and initialize as globals:
const objdef def_instance = { 2, 3, function, array, NULL };
However, this eats up quite a bit of RAM despite the const at the beginning. More specifically, it significantly increases the amount of RW data and eventually causes the device to lock up if enough instances are created.
I'm using uVision and the ARM compiler, along with the RTX real-time kernel.
Does anybody know why this doesn't work or know a better way to store structured heterogenous data in ROM?
Update
Thank you all for your answers and my apologies for not getting back to you guys earlier. So here is the score so far and some additional observations on my part.
Sadly, __attribute__ has zero effect on RAM vs ROM and the same goes for static const. I haven't had time to try the assembly route yet.
My coworkers and I have discovered some more unusual behavior, though.
First, I must note that for the sake of simplicity I did not mention that my objdef structure contains a const void* field. The field is sometimes assigned a value from a string table defined as
char const * const string_table [ROWS][COLS] =
{
{ "row1_1", "row1_2", "row1_3" },
{ "row2_1", "row2_2", "row2_3" },
...
}
const objdef def_instance = { 2, 3, function, array, NULL };//->ROM
const objdef def_instance = { 2, 3, function, array, string_table[0][0] };//->RAM
string_table is in ROM as expected. And here's the kicker: instances of objdef get put in ROM until one of the values in string_table is assigned to that const void* field. After that the struct instance is moved to RAM.
But when string_table is changed to
char const string_table [ROWS][COLS][MAX_CHARS] =
{
{ "row1_1", "row1_2", "row1_3" },
{ "row2_1", "row2_2", "row2_3" },
...
}
const objdef def_instance = { 2, 3,function, array, NULL };//->ROM
const objdef def_instance = { 2, 3, function, array, string_table[0][0] };//->ROM
those instances of objdef are placed in ROM despite that const void* assigment. I have no idea why this should matter.
I'm beginning to suspect that Dan is right and that our configuration is messed up somewhere.
I assume you have a scatterfile that separates your RAM and ROM sections. What you want to do is to specify your structure with an attribute for what section it will be placed, or to place this in its own object file and then specify that in the section you want it to be in the scatterfile.
__attribute__((section("ROM"))) const objdef def_instance = { 2, 3, function, array };
The C "const" keyword doesn't really cause the compiler to put something in the text or const section. It only allows the compiler to warn you of attempts to modify it. It's perfectly valid to get a pointer to a const object, cast it to a non-const, and write to it, and the compiler needs to support that.
Your thinking is correct and reasonable. I've used Keil / uVision (this was v3, maybe 3 years ago?) and it always worked how you expected it to, i.e. it put const data in flash/ROM.
I'd suspect your linker configuration / script. I'll try to go back to my old work & see how I had it configured. I didn't have to add #pragma or __attribute__ directives, I just had it place .const & .text in flash/ROM. I set up the linker configuration / memory map quite a while ago, so unfortunately, my recall isn't very fresh.
(I'm a bit confused by people who are talking about casting & const pointers, etc... You didn't ask anything about that & you seem to understand how "const" works. You want to place the initialized data in flash/ROM to save RAM (not ROM->RAM copy at startup), not to mention a slight speedup at boot time, right? You're not asking if it's possible to change it or whatever...)
EDIT / UPDATE:
I just noticed the last field in your (const) struct is a some_other_struct * const (constant pointer to a some_other_struct). You might want to try making it a (constant) pointer to a constant some_other_struct [some_other_struct const * const] (assuming what it points to is indeed constant). In that case it might just work. I don't remember the specifics (see a theme here?), but this is starting to seem familiar. Even if your pointer target isn't a const item, and you can't eventually do this, try changing the struct definition & initializing it w/ a pointer to const and just see if that drops it into ROM. Even though you have it as a const pointer and it can't change once the structure is built, I seem to remember something where if the target isn't also const, the linker doesn't think it can be fully initialized at link time & defers the initialization to when the C runtime startup code is executed, incl. the ROM to RAM copy of initialized RW memory.
You could always try using assembly language.
Put in the information using DATA statements and publish (make public) the starting addresses of the data.
In my experience, large Read-Only data was declared in a source file as static const. A simple global function inside the source file would return the address of the data.
If you are doing stuff on ARM you are probably using the ELF binary format. ELF files contain an number of sections but constant data should find its way into .rodata or .text sections of the ELF binary. You should be able to check this with the GNU utility readelf or the RVCT utility fromelf.
Now assuming you symbols find themselves in the correct part of the elf file, you now need to find out how the RTX loader does its job. There is also no reason why the instances cannot share the same read only memory but this will depend on the loader. If the executable is stored in the rom, it may be run in-place but may still be loaded into RAM. This also depends on the loader.
A complete example would have been best. If I take something like this:
typedef struct
{
char a;
char b;
} some_other_struct;
struct objdef
{
int x;
int y;
const some_other_struct * struct_array;
};
typedef struct
{
int x;
int y;
const some_other_struct * struct_array;
} tobjdef;
const some_other_struct def_other = {4,5};
const struct objdef def_instance = { 2, 3, &def_other};
const tobjdef tdef_instance = { 2, 3, &def_other};
unsigned int read_write=7;
And compile it with the latest codesourcery lite
arm-none-linux-gnueabi-gcc -S struct.c
I get
.arch armv5te
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 6
.eabi_attribute 18, 4
.file "struct.c"
.global def_other
.section .rodata
.align 2
.type def_other, %object
.size def_other, 2
def_other:
.byte 4
.byte 5
.global def_instance
.align 2
.type def_instance, %object
.size def_instance, 12
def_instance:
.word 2
.word 3
.word def_other
.global tdef_instance
.align 2
.type tdef_instance, %object
.size tdef_instance, 12
tdef_instance:
.word 2
.word 3
.word def_other
.global read_write
.data
.align 2
.type read_write, %object
.size read_write, 4
read_write:
.word 7
.ident "GCC: (Sourcery G++ Lite 2010.09-50) 4.5.1"
.section .note.GNU-stack,"",%progbits
With the section marked as .rodata, which I would assume is desired. Then it is up to the linker script to make sure that ro data is put in rom. And note the read_write variable is after switching from .rodata to .data which is read/write.
So to make this a complete binary and see if it gets placed in rom or ram (.text or .data) then
start.s
.globl _start
_start:
b reset
b hang
b hang
b hang
b hang
b hang
b hang
b hang
reset:
hang: b hang
Then
# arm-none-linux-gnueabi-gcc -c -o struct.o struct.c
# arm-none-linux-gnueabi-as -o start.o start.s
# arm-none-linux-gnueabi-ld -Ttext=0 -Tdata=0x1000 start.o struct.o -o struct.elf
# arm-none-linux-gnueabi-objdump -D struct.elf > struct.list
And we get
Disassembly of section .text:
00000000 <_start>:
0: ea000006 b 20 <reset>
4: ea000008 b 2c <hang>
8: ea000007 b 2c <hang>
c: ea000006 b 2c <hang>
10: ea000005 b 2c <hang>
14: ea000004 b 2c <hang>
18: ea000003 b 2c <hang>
1c: ea000002 b 2c <hang>
00000020 <reset>:
20: e59f0008 ldr r0, [pc, #8] ; 30 <hang+0x4>
24: e5901000 ldr r1, [r0]
28: e5801000 str r1, [r0]
0000002c <hang>:
2c: eafffffe b 2c <hang>
30: 00001000 andeq r1, r0, r0
Disassembly of section .data:
00001000 <read_write>:
1000: 00000007 andeq r0, r0, r7
Disassembly of section .rodata:
00000034 <def_other>:
34: 00000504 andeq r0, r0, r4, lsl #10
00000038 <def_instance>:
38: 00000002 andeq r0, r0, r2
3c: 00000003 andeq r0, r0, r3
40: 00000034 andeq r0, r0, r4, lsr r0
00000044 <tdef_instance>:
44: 00000002 andeq r0, r0, r2
48: 00000003 andeq r0, r0, r3
4c: 00000034 andeq r0, r0, r4, lsr r0
And that achieved the desired result. The read_write variable is in ram, the structs are in the rom. Need to make sure both the const declarations are in the right places, the compiler gives no warnings about say putting a const on some pointer to another structure that it may not determine at compile time as being a const, and even with all of that getting the linker script (if you use one) to work as desired can take some effort. for example this one seems to work:
MEMORY
{
bob(RX) : ORIGIN = 0x0000000, LENGTH = 0x8000
ted(WAIL) : ORIGIN = 0x2000000, LENGTH = 0x8000
}
SECTIONS
{
.text : { *(.text*) } > bob
.data : { *(.data*) } > ted
}
I'm having trouble understanding the behavior of the MS VC compiler on this one. This line compiles fine, but the result I get is not what I'd expect at all:
this->Test((char *)&CS2 - (char *)&CS1 == sizeof(void *));
The CS1 and CS2 arguments are declared as follows:
myFunction(tCS1* CS1, tCS2* CS2) {...
tCS1 and tCS2 are structures containing one int and one __int64, resp.
This is meant to check the distance on the stack between my arguments CS1 and CS2, which are both pointers. When I break execution on this line and use the debugger to get the addresses of my two variables, I find that they indeed are 8 bytes away from each other (x64 platform).
However, the result of the comparison is false.
Here is the assembly code generated by the compiler:
mov rax,qword ptr [CS1]
mov rdi,qword ptr [CS2]
sub rdi,rax
(then it does the comparison using the result stored in rdi, and makes the call)
Yes, the compiler is comparing the values of my pointer arguments, rather than their addresses. I'm missing a level of indirection here, where did it go?
Of course I can't reproduce this in a test environment, and I have no clue where to look anymore.
I'm cross-compiling this bit of code on a 32-bits machine to an x64 platform (I have to), that's the only 'odd' thing about it. Any idea, any hint?
The assembly
mov rax,qword ptr [CS1]
mov rdi,qword ptr [CS2]
sub rdi,rax
indicates CS1 and CS2 are not really stack arguments, but rather some global symbols - if I wanted to produce similar results, I'd do something like this:
int* CS1 = NULL, *CS2 = NULL; /* or any other value...*/
#define CS1 *CS1
#define CS2 *CS2
Of course this is ugly code - but have you checked you haven't such things in your code? Also, dynamic linker might play a role in it.
And last but not least: If you attempt to write code like:
void foo()
{
int a;
int b;
printf("%d", &a-&b);
}
You should be aware that this is actually undefined behaviour, as C (and C++) only permits to subtract pointers pointing inside a single object (eg. array).
As #jpalacek and commenters observed this is undefined and the compiler may be taking advantage of it to do whatever it likes. It is pretty strange.
This code "works" on gcc:
#include
int func(int *a, int *b)
{
return (char *)&a - (char *) &b;
}
int main(void)
{
int a, b;
printf("%d", func(&a, &b));
return 0;
}
(gdb) disassemble func
Dump of assembler code for function func:
0x0 80483e4 : push %ebp
0x080483e5 : mov %esp,%ebp
=> 0x080483e7 : lea 0x8(%ebp),%edx
0x080483ea : lea 0xc(%ebp),%eax
0x080483ed : mov %edx,%ecx
0x080483ef : sub %eax,%ecx
0x080483f1 : mov %ecx,%eax
0x080483f3 : pop %ebp
0x080483f4 : ret
End of assembler dump.
and with optimization it just knows their relative addresses:
(edit: answer was truncated here for some reason)
(gdb) disassemble func
Dump of assembler code for function func:
0x08048410 : push %ebp
0x08048411 : mov $0xfffffffc,%eax
0x08048416 : mov %esp,%ebp
0x08048418 : pop %ebp
0x08048419 : ret
End of assembler dump.
The interesting thing is that with -O4 optimization it returns +4 and without it, it returns -4.
Why are you trying to do this anyhow? There's no guarantee in general that the arguments have any memory address: they may be passed in registers.