Related
I hope that this question hasn’t been asked before, but I couldn’t find any answer after googling for an hour.
I have the following problem: I do numerical mathematics and I have about 40 MB of doubles in the form of certain matrices that are compile time constants. I very frequently use these matrices throughout my program. The creation of these matrices takes a week of computation, so I do not want to compute them on the fly but use precomputed values.
What is the best way of doing this in a portable manner? Right now I have some very large CPP-files, in which I create dynamically allocated matrices with the constants as initialiser lists. Something along the lines:
data.cpp:
namespace // Anonymous
{
// math::matrix uses dynamic memory internally
const math::matrix K { 3.1337e2, 3.1415926e00, /* a couple of hundred-thousand numbers */ };
}
const math::matrix& get_K() { return K; }
data.h
const math::matrix& get_K();
I am not sure if this can cause problems with too much data on the stack, due to the initialiser list. Or if this can crash some compilers. Or if this is the right way to go about things. It does seem to be working though.
I also thought about loading the matrices at program startup from an external file, but that also seems messy to me.
Thank you very much in advance!
I am not sure if this can cause problems with too much data on the stack, due to the initialiser list.
There should not be a problem assuming it has static storage with non-dynamic initialisation. Which should be the case if math::matrix is an aggregate.
Given that the values will be compile time constant, you might consider defining them in a header, so that all translation units can take advantage of them at compile time. How beneficial that would be depends on how the values are used.
I also thought about loading the matrices at program startup from an external file
The benefit of this approach is the added flexibility that you gain because you would not need to recompile the program when you change the data. This is particularly useful if the program is expensive to compile.
A slightly cleaner approach:
// math::matrix uses dynamic memory internally
const math::matrix K {
#include "matrix_initial_values.h"
};
And, in the included header,
3.1337e2, 3.1415926e00, 1,2,3,4,5e6, 7.0,...
Considering your comment of "A few hundred thousand" float values: 1M double values takes 8,000,000 bytes, or about 7.6MB. That's not going to blow the stack. Win64 has a max stack size of 1GB, so you'd have to go really, really nuts, and that's assuming that these values are actually placed on the stack, which they should not be given that it's const.
This is probably implementation-specific, but a large block of literals is typically stored as a big chunk of code-space data that's loaded directly into the process' memory space. The identifier (K) is really just a label for that data. It doesn't exist on the stack or the heap anymore than code does.
For example, I want to create a square root table using array SQRT[i] to optimize a game, but I don't know if there is a performance difference between the following initialization when accessing the value of SQRT[i]:
Hardcode array
int SQRT[]={0,1,1,1,2,2,2,2,2,3,3,.......255,255,255}
Generate value at run time
int SQRT[65536];
int main(){
for(int i=0;i<65536;i++){
SQRT[i]=sqrt(i);
}
//other code
return 0;
}
Some example of accessing them:
if(SQRT[a*a+b*b]>something)
...
I'm not clear if the program stores or accesses a hard-code array in a different way, also I don't know if the compiler would optimize the hard-code array to speed up the access time, is there performance differences between them when accessing the array?
First, you should do the hard-coded array right:
static const int SQRT[]={0,1,1,1,2,2,2,2,2,3,3,.......255,255,255};
(also using uint8_t instead of int is probably better, so as to reduce the size of the array and making it more cache-friendly)
Doing this has one big advantage over the alternative: the compiler can easily check that the contents of the array cannot be changed.
Without this, the compiler has to be paranoid -- every function call might change the contents of SQRT, and every pointer could potentially be pointing into SQRT and thus any write through an int* or char* might be modifying the array. If the compiler cannot prove this doesn't happen, then that puts limits on the kinds of optimizations it can do, which in some cases could show a performance impact.
Another potential advantage is the ability to resolve things involving constants at compile-time.
If needed, you may be able to help the compiler figure things out with clever use of __restrict__.
In modern C++ you can get the best of both worlds; it should be possible (and in a reasonable way) to write code that would run at compile time to initialize SQRT as a constexpr. That would best be asked a new question, though.
As people said in comments:
if(SQRT[a*a+b*b]>something)
is a horrible example use-case. If that's all you need SQRT for, just square something.
As long as you can tell the compiler that SQRT doesn't alias anything, then a run-time loop will make your executable smaller, and only add a tiny amount of CPU overhead during startup. Definitely use uint8_t, not int. Loading a 32bit temporary from an 8bit memory location is no slower than from a zero-padded 32b memory location. (The extra movsx instruction on x86, instead of using a memory operand, will more than pay for itself in reduced cache pollution. RISC machines typically don't allow memory operands anyway, so you always need an instruction to load the value into a register.)
Also, sqrt is 10-21 cycle latency on Sandybridge. If you don't need it often, the int->double, sqrt, double->int chain is not much worse than an L2 cache hit. And better than going to L3 or main memory. If you need a lot of sqrt, then sure, make a LUT. The throughput will be much better, even if you're bouncing around in the table and causing L1 misses.
You could optimize the initialization by squaring instead of sqrting, with something like
uint8_t sqrt_lookup[65536];
void init_sqrt (void)
{
int idx = 0;
for (int i=0 ; i < 256 ; i++) {
// TODO: check that there isn't an off-by-one here
int iplus1_sqr = (i+1)*(i+1);
memset(sqrt_lookup+idx, i, iplus1_sqr-idx);
idx = iplus1_sqr;
}
}
You can still get the benefits of having sqrt_lookup being const (compiler knows it can't alias). Either use restrict, or lie to the compiler, so users of the table see a const array, but you actually write to it.
This might involve lying to the compiler, by having it declared extern const in most places, but not in the file that initializes it. You'd have to make sure this actually works, and doesn't create code referring to two different symbols. If you just cast away the const in the function that initializes it, you might get a segfault if the compiler placed it in rodata (or read-only bss memory if it's uninitialized, if that's possible on some platforms?)
Maybe we can avoid lying to the compiler, with:
uint8_t restrict private_sqrt_table[65536]; // not sure about this use of restrict, maybe that will do it?
const uint8_t *const sqrt_lookup = private_sqrt_table;
Actually, that's just a const pointer to const data, not a guarantee that what it's pointing to can't be changed by another reference.
The access time will be the same. When you hardcode an array, the C library routines called before the main will initialize it (in an embedded system the start up code copies the Read write data, i.e. hardcoded from ROM to RAM address where the array is located, if the array is constant, then it is accessed directly from ROM).
If the for loop is used to initialize, then there is an overhead of calling the Sqrt function.
I am a beginner programmer with some experience at c and c++ programming. I was assigned by the university to make a physics simulator, so as you might imagine there's a big emphasis on performance.
My questions are the following:
How many assembly instructions does an instance data member access
through a pointer translate to (i.e for an example vector->x )?
Is it much more then say another approach where you simply access the
memory through say a char* (at the same memory location of variable
x), or is it the same?
Is there a big impact on performance
compiler-wise if I use an object to access that memory location or
if I just access it?
Another question regarding the subject would be
whether or not accessing heap memory is faster then stack memory
access?
C++ is a compiled language. Accessing a memory location through a pointer is the same regardless of whether that's a pointer to an object or a pointer to a char* - it's one instruction in either case. There are a couple of spots where C++ adds overhead, but it always buys you some flexibility. For example, invoking a virtual function requires an extra level of indirection. However, you would need the same indirection anyway if you were to emulate the virtual function with function pointers, or you would spend a comparable number of CPU cycles if you were to emulate it with a switch or a sequence of ifs.
In general, you should not start optimizing before you know what part of your code to optimize. Usually only a small part of your code is responsible for the bulk of the CPU time used by your program. You do not know what part to optimize until you profile your code. Almost universally it's programmer's code, not the language features of C++, that is responsible for the slowdown. The only way to know for sure is to profile.
On x86, a pointer access is typically one extra instruction, above and beyond what you normally need to perform the operation (e.x. y = object->x; would be one load of the address in object, and one load of the value of x, and one store to y - in x86 assembler both loads and stores are mov instructions with memory target). Sometimes it's "zero" instructions, because the compiler can optimise away the load of the object pointer. In other architectures, it's really down to how the architecture works - some architectures have very limited ways of accessing memory and/or loading addresses to pointers, etc, making it awkward to access pointers.
Exactly the same number of instructions - this applies for all
As #2 - objects in themselves have no impact at all.
Heap memory and stack memory is the same kind. One answer says that "stack memory is always in the caceh", which is true if it's "near the top of the stack", where all the activity goes on, but if you have an object that is being passed around that was created in main, and a pointer to it is used to pass it around for several layers of function calls, and then access through the pointer, there is an obvious chance that this memory hasn't been used for a long while, so there is no real difference there either). The big difference is that "heap memory is plenty of space, stack is limited" along with "running out of heap is possible to do limited recovery, running out of stack is immediate end of execution [without tricks that aren't very portable]"
If you look at class as a synonym for struct in C (which aside from some details, they really are), then you will realize that class and objects are not really adding any extra "effort" to the code generated.
Of course, used correctly, C++ can make it much easier to write code where you deal with things that are "do this in a very similar way, but subtly differently". In C, you often end up with :
void drawStuff(Shape *shapes, int count)
{
for(i = 0; i < count; i++)
{
switch (shapes[i].shapeType)
{
case Circle:
... code to draw a circle ...
break;
case Rectangle:
... code to draw a rectangle ...
break;
case Square:
...
break;
case Triangle:
...
break;
}
}
}
In C++, we can do this at the object creation time, and your "drawStuff" becoems:
void drawStuff(std::vector<Shape*> shapes)
{
for(auto s : shapes)
{
s->Draw();
}
}
"Look Ma, no switch..." ;)
(Of course, you do need a switch or something to do the selection of which object to create, but once choice is made, assuming your objects and the surrounding architecture are well defined, everything should work "magically" like the above example).
Finally, if it's IMPORTANT with performance, then run benchmarks, run profiling and check where the code is spending it's time. Don't optimise too early (but if you have strict performance criteria for something, keep an eye on it, because deciding on the last week of a project that you need to re-organise your data and code dramatically because performance sucks due to some bad decision is also not the best of ideas!). And don't optimise for individual instructions, look at where the time is spent, and come up with better algorithms WHERE you need to. (In the above example, using const std::vector<Shape*>& shapes will effectively pass a pointer to the shapes vector passed in, instead of copying the entire thing - which may make a difference if there are a few thousand elements in shapes).
It depends on your target architecture. An struct in C (and a class in C++) is just a block of memory containing the members in sequence. An access to such a field through a pointer means adding an offset to the pointer and loading from there. Many architectures allow a load to already specify an offset to the target address, meaning that there is no performance penalty there; but even on extreme RISC machines that don't have that, adding the offset should be so cheap that the load completely shadows it.
Stack and heap memory are really the same thing. Just different areas. Their basic access speed is therefore the same. The main difference is that the stack will most likely already be in the cache no matter what, whereas heap memory might not be if it hasn't been accessed lately.
Variable. On most processors instructions are translated to something called microcode, similar to how Java bytecode are translated to processor-specific instructions before you run it. How many actual instructions you get are different between different processor manufacturers and models.
Same as above, it depends on processor internals most of us know little about.
1+2. What you should be asking are how many clock cycles these operations take. On modern platforms the answer are one. It does not matter how many instructions they are, a modern processor have optimizations to make both run on one clock cycle. I will not get into detail here. I other words, when talking about CPU load there are no difference at all.
Here you have the tricky part. While there are no difference in how many clock cycles the instruction itself take, it needs to have data from memory before it can run - this can take a HUGE ammount of clock cycles. Actually someone proved a few years ago that even with a very optimized program a x86 processor spends at least 50% of its time waiting for memory access.
When you use stack memory you are actually doing the same thing as creating an array of structs. For the data, instructions are not duplicated unless you have virtual functions. This makes data aligned and if you are going to do sequential access, you will have optimal cache hits. When you use heap memory you will create an array of pointers, and each object will have its own memory. This memory will NOT be aligned and therefore sequential access will have a lot of cache misses. And cache misses are what really will your application slower and should be avoided at all cost.
I do not know exactly what you are doing but in many cases even using objects are much slower than plain arrays. An array of objects are aligned [object1][object2] etc. If you do something like pseudocode "for each object o {o.setX() = o.getX() + 1}"... this means that you will only access one variable and your sequential access would therefore jump over the other variables in each object and get more cache misses than if your X-variables where aligned in their own array. And if you have code that use all variables in your object, standard arrays will not be slower than object array. It will just load the different arrays into different cache blocks.
While standard arrays are faster in C++ they are MUCH faster in other languages like Java, where you should NEVER store bulk data in objects - as Java objects use more memory and are always stored at the heap. This are the most common mistake that C++ programmers do in Java, and then complain that Java are slow. However if they know how to write optimal C++ programs they store data in arrays which are as fast in Java as in C++.
What I usually do are a class to store the data, that contains arrays. Even if you use the heap, its just one object which becomes as fast as using the stack. Then I have something like "class myitem { private: int pos; mydata data; public getVar1() {return data.getVar1(pos);}}". I do not write out all of the code here, just illustrating how I do this. Then when I iterate trough it the iterator class do not actually return a new myitem instance for each item, it increase the pos value and return the same object. This means you get a nice OO API while you actually only have a few objects and and nicely aligned arrays. This pattern are the fastest pattern in C++ and if you don't use it in Java you will know pain.
The fact that we get multiple function calls do not really matter. Modern processors have something called branch prediction which will remove the cost of the vast majority of those calls. Long before the code actually runs the branch predictor will have figured out what the chains of calls do and replaced them with a single call in the generated microcode.
Also even if all calls would run each would take far less clock cycles the memory access they require, which as I pointed out makes memory alignment the only issue that should bother you.
So I can fix this manually so it isn't an urgent question but I thought it was really strange:
Here is the entirety of my code before the weird thing that happens:
int main(int argc, char** arg) {
int memory[100];
int loadCounter = 0;
bool getInput = true;
print_memory(memory);
and then some other unrelated stuff.
The print memory just prints the array which should've initialized to all zero's but instead the first few numbers are:
+1606636544 +32767 +1606418432 +32767 +1856227894 +1212071026 +1790564758 +813168429 +0000 +0000
(the plus and the filler zeros are just for formatting since all the numbers are supposed to be from 0-1000 once the array is filled. The rest of the list is zeros)
It also isn't memory leaking because I tried initializing a different array variable and on the first run it also gave me a ton of weird numbers. Why is this happening?
Since you asked "What do C++ arrays init to?", the answer is they init to whatever happens to be in the memory they have been allocated at the time they come into scope.
I.e. they are not initialized.
Do note that some compilers will initialize stack variables to zero in debug builds; this can lead to nasty, randomly occurring issues once you start doing release builds.
The array you are using is stack allocated:
int memory[100];
When the particular function scope exits (In this case main) or returns, the memory will be reclaimed and it will not leak. This is how stack allocated memory works. In this case you allocated 100 integers (32 bits each on my compiler) on the stack as opposed to on the heap. A heap allocation is just somewhere else in memory hopefully far far away from the stack. Anyways, heap allocated memory has a chance for leaking. Low level Plain Old Data allocated on the stack (like you wrote in your code) won't leak.
The reason you got random values in your function was probably because you didn't initialize the data in the 'memory' array of integers. In release mode the application or the C runtime (in windows at least) will not take care of initializing that memory to a known base value. So the memory that is in the array is memory left over from last time the stack was using that memory. It could be a few milli-seconds old (most likely) to a few seconds old (less likely) to a few minutes old (way less likely). Anyways, it's considered garbage memory and it's to be avoided at all costs.
The problem is we don't know what is in your function called print_memory. But if that function doesn't alter the memory in any ways, than that would explain why you are getting seemingly random values. You need to initialize those values to something first before using them. I like to declare my stack based buffers like this:
int memory[100] = {0};
That's a shortcut for the compiler to fill the entire array with zero's.
It works for strings and any other basic data type too:
char MyName[100] = {0};
float NoMoney[100] = {0};
Not sure what compiler you are using, but if you are using a microsoft compiler with visual studio you should be just fine.
In addition to other answers, consider this: What is an array?
In managed languages, such as Java or C#, you work with high-level abstractions. C and C++ don't provide abstractions (I mean hardware abstractions, not language abstractions like OO features). They are dessigned to work close to metal that is, the language uses the hardware directly (Memory in this case) without abstractions.
That means when you declare a local variable, int a for example, what the compiler does is to say "Ok, im going to interpret the chunk of memory [A,A + sizeof(int)] as an integer, which I call 'a'" (Where A is the offset between the beginning of that chunk and the start address of function's stack frame).
As you can see, the compiler only "assigns" memory-segments to variables. It does not do any "magic", like "creating" variables. You have to understand that your code is executed in a machine, and the machine has only a memory and a CPU. There is no magic.
So what is the value of a variable when the function execution starts? The value represented with the data which the chunk of memory of the variable has. Commonly, that data has no sense from our current point of view (Could be part of the data used previously by a string, for example), so when you access that variable you get extrange values. Thats what we call "garbage": Data previously written which has no sense in our context.
The same applies to an array: An array is only a bigger chunk of memory, with enough space to fit all the values of the array: [A,A + (length of the array)*sizeof(type of array elements)]. So as in the variable case, the memory contains garbage.
Commonly you want to initialize an array with a set of values during its declaration. You could achieve that using an initialiser list:
int array[] = {1,2,3,4};
In that case, the compiler adds code to the function to initialize the memory-chunk which the array is with that values.
Sidenote: Non-POD types and static storage
The things explained above only applies to POD types such as basic types and arrays of basic types. With non-POD types like classes the compiler adds calls to the constructor of the variables, which are designed to initialise the values (attributes) of a class instance.
In addition, even if you use POD types, if variables have static storage specification, the compiler initializes its memory with a default value, because static variables are allocated at program start.
the local variable on stack is not initialized in c/c++. c/c++ is designed to be fast so it doesn't zero stack on function calls.
Before main() runs, the language runtime sets up the environment. Exactly what it's doing you'd have to discover by breaking at the load module's entry point and watching the stack pointer, but at any rate your stack space on entering main is not guaranteed clean.
Anything that needs clean stack or malloc or new space gets to clean it itself. Plenty of things don't. C[++] isn't in the business of doing unnecessary things. In C++ a class object can have non-trivial constructors that run implicitly, those guarantee the object's set up for use, but arrays and plain scalars don't have constructors, if you want an inital value you have to declare an initializer.
Suppose I have a function in a single threaded program that looks like this
void f(some arguments){
char buffer[32];
some operations on buffer;
}
and f appears inside some loop that gets called often, so I'd like to make it as fast as possible. It looks to me like the buffer needs to get allocated every time f is called, but if I declare it to be static, this wouldn't happen. Is that correct reasoning? Is that a free speed up? And just because of that fact (that it's an easy speed up), does an optimizing compiler already do something like this for me?
No, it's not a free speedup.
First, the allocation is almost free to begin with (since it consists merely of adding 32 to the stack pointer), and secondly, there are at least two reasons why a static variable might be slower
you lose cache locality. Data allocated on the stack are going to be in the CPU cache already, so accessing it is extremely cheap. Static data is allocated in a different area of memory, and so it may not be cached, and so it will cause a cache miss, and you'll have to wait hundreds of clock cycles for the data to be fetched from main memory.
you lose thread safety. If two threads execute the function simultaneously, it'll crash and burn, unless a lock is placed so only one thread at a time is allowed to execute that section of the code. And that would mean you'd lose the benefit of having multiple CPU cores.
So it's not a free speedup. But it is possible that it is faster in your case (although I doubt it).
So try it out, benchmark it, and see what works best in your particular scenario.
Incrementing 32 bytes on the stack will cost virtually nothing on nearly all systems. But you should test it out. Benchmark a static version and a local version and post back.
For implementations that use a stack for local variables, often times allocation involves advancing a register (adding a value to it), such as the Stack Pointer (SP) register. This timing is very negligible, usually one instruction or less.
However, initialization of stack variables takes a little longer, but again, not much. Check out your assembly language listing (generated by compiler or debugger) for exact details. There is nothing in the standard about the duration or number of instructions required to initialize variables.
Allocation of static local variables is usually treated differently. A common approach is to place these variables in the same area as global variables. Usually all the variables in this area are initialized before calling main(). Allocation in this case is a matter of assigning addresses to registers or storing the area information in memory. Not much execution time wasted here.
Dynamic allocation is the case where execution cycles are burned. But that is not in the scope of your question.
The way it is written now, there is no cost for allocation: the 32 bytes are on the stack. The only real work is you need to zero-initialize.
Local statics is not a good idea here. It wont be faster, and your function can't be used from multiple threads anymore, as all calls share the same buffer. Not to mention that local statics initialization is not guaranteed to be thread safe.
I would suggest that a more general approach to this problem is that if you have a function called many times that needs some local variables then consider wrapping it in a class and making these variables member functions. Consider if you needed to make the size dynamic, so instead of char buffer[32] you had std::vector<char> buffer(requiredSize). This is more expensive than an array to initialise every time through the loop
class BufferMunger {
public:
BufferMunger() {};
void DoFunction(args);
private:
char buffer[32];
};
BufferMunger m;
for (int i=0; i<1000; i++) {
m.DoFunction(arg[i]); // only one allocation of buffer
}
There's another implication of making the buffer static, which is that the function is now unsafe in a multithreaded application, as two threads may call it and overwrite the data in the buffer at the same time. On the other hand it's safe to use a separate BufferMunger in each thread that requires it.
Note that block-level static variables in C++ (as opposed to C) are initialized on first use. This implies that you'll be introducing the cost of an extra runtime check. The branch potentially could end up making performance worse, not better. (But really, you should profile, as others have mentioned.)
Regardless, I don't think it's worth it, especially since you'd be intentionally sacrificing re-entrancy.
If you are writing code for a PC, there is unlikely to be any meaningful speed advantage either way. On some embedded systems, it may be advantageous to avoid all local variables. On some other systems, local variables may be faster.
An example of the former: on the Z80, the code to set up the stack frame for a function with any local variables was pretty long. Further, the code to access local variables was limited to using the (IX+d) addressing mode, which was only available for 8-bit instructions. If X and Y were both global/static or both local variables, the statement "X=Y" could assemble as either:
; If both are static or global: 6 bytes; 32 cycles
ld HL,(_Y) ; 16 cycles
ld (_X),HL ; 16 cycles
; If both are local: 12 bytes; 56 cycles
ld E,(IX+_Y) ; 14 cycles
ld D,(IX+_Y+1) ; 14 cycles
ld (IX+_X),D ; 14 cycles
ld (IX+_X+1),E ; 14 cycles
A 100% code space penalty and 75% time penalty in addition to the code and time to set up the stack frame!
On the ARM processor, a single instruction can load a variable which is located within +/-2K of an address pointer. If a function's local variables total 2K or less, they may be accessed with a single instruction. Global variables will generally require two or more instructions to load, depending upon where they are stored.
With gcc, I do see some speedup:
void f() {
char buffer[4096];
}
int main() {
int i;
for (i = 0; i < 100000000; ++i) {
f();
}
}
And the time:
$ time ./a.out
real 0m0.453s
user 0m0.450s
sys 0m0.010s
changing buffer to static:
$ time ./a.out
real 0m0.352s
user 0m0.360s
sys 0m0.000s
Depending on what exactly the variable is doing and how its used, the speed up is almost nothing to nothing. Because (on x86 systems) stack memory is allocated for all local vars at the same time with a simple single func(sub esp,amount), thus having just one other stack var eliminates any gain. the only exception to this is really huge buffers in which case a compiler might stick in _chkstk to alloc memory(but if your buffer is that big you should re-evaluate your code). The compiler cannot turn stack memory into static memory via optimization, as it cannot assume that the function is going to be used in a single threaded enviroment, plus it would mess with object constructors & destructors etc
If there are any local automatic variables in the function at all, the stack pointer needs to be adjusted. The time taken for the adjustment is constant, and will not vary based on the number of variables declared. You might save some time if your function is left with no local automatic variables whatsoever.
If a static variable is initialized, there will be a flag somewhere to determine if the variable has already been initialized. Checking the flag will take some time. In your example the variable is not initialized, so this part can be ignored.
Static variables should be avoided if your function has any chance of being called recursively or from two different threads.
It will make the function substantially slower on most real cases. This is because the static data segment is not near the stack and you will lose cache coherency, so you will get a cache miss when you try to access it. However when you allocate a regular char[32] on the stack, it is right next to all your other needed data and costs very little to access. The initialization costs of a stack-based array of char are meaningless.
This is ignoring that statics have many other problems.
You really need to actually profile your code and see where the slowdowns are, because no profiler will tell you that allocating a statically-sized buffer of characters is a performance problem.