Is program runtime affected by where the objects reside in the memory? - c++

My program allocates all of its resources which is slightly below 1MB in startup and no more, except primitive local variables. The allocation took place originally by malloc, so on the heap, but I wondered whether there will be any difference by putting them on the stack.
In various tests with program runtime from 3 seconds to 3 minutes. Accessing the stack steadily appears to be faster up to 10%. All I changed was whether to malloc the structs or to declare them as automatic variables.
Another interesting fact I found is that when I declare the objects as static. The program will run 20~30% slower. I have no idea why. I double checked whether I made a mistake but the only difference really is whether static is there or not. Do static variables go somewhere else in the stack than automatic variables?
Before I had quite an opposite experience that in a C++ class, when I made a const member array from non-static to static, the program did run faster. The memory consumption was same because there was only one instance of that object.
Is program runtime affected by where the objects reside in the memory? Even if so, can't the compiler manage to place the objects in the right place for maximum efficiency?

Well, yeah, program performance is affected by where objects reside in memory.
The problem is, unless you have intimate knowledge of how your compiler works and how it uses features of your particular host system (operating system services, hardware, processor cache, etc), and how those things are configured, you will not be able to consistently exploit it. Even if you do succeed, small changes (e.g. upgrading a compiler, changing optimisation settings, a change of process quotas, changing amount of physical memory, changing a hard drive [e.g. that is used for swap space]) can affect performance, and it won't always be easy to predict whether a change will improve or degrade performance). Performance is sensitive to all these things - and to the interaction between them, in ways that are not always obvious without close analysis.

Is program runtime affected by where the objects reside in the memory?
Yes, program performance will be affected by the location of an object in memory, among other factors.
Whenever an object in the heap is accessed, it's done by dereferencing a pointer. Dereferencing pointers requires additional computation to find the next memory address and every additional pointer that's between you and your data (e.g. pointer-> pointer -> actual data) will make it slightly worse. In addition to this, it increases the cache miss rate for the CPU because the data that it expects to access next is not really in contiguous memory blocks. In other words, the assumption the CPU made to try and optimize its pipeline turns out to be false, and it pays the penalty for an incorrect prediction.
Even if so, can't the compiler manage to place the objects in the
right place for maximum efficiency?
The C/C++ compilers will place the objects at whatever location you tell it to. If you're using malloc, you shouldn't expect the compiler to put things on the stack, and vice-versa.
The compiler wouldn't be able to do this even in principle because malloc dynamically allocates memory at runtime, long after the compiler's job has ended. What is known at compile time is the size for the stack, but how the contents of this memory are organized depends entirely on you.
You might be able to get some benefits from compiler optimization settings, but most of your benefits in optimization efforts will be better spent improving data structures and/or algorithms being used.

Related

Understanding Memory Pools

To my understanding, a memory pool is a block, or multiple blocks of memory allocate on the stack before runtime.
By contrast, to my understanding, dynamic memory is requested from the operating system and then allocated on the heap during run time.
// EDIT //
Memory pools are evidently not necessarily allocated on the stack ie. a memory pool can be used with dynamic memory.
Non dynamic memory is evidently also not necessarily allocated on the stack, as per the answer to this question.
The topics of 'dynamic vs. static memory' and 'memory pools' are thus not really related although the answer is still relevant.
From what I can tell, the purpose of a memory pool is to provide manual management of RAM, where the memory must be tracked and reused by the programmer.
This is theoretically advantageous for performance for a number of reasons:
Dynamic memory becomes fragmented over time
The CPU can parse static blocks of memory faster than dynamic blocks
When the programmer has control over memory, they can choose to free and rebuild data when it is best to do so, according the the specific program.
4. When multithreading, separate pools allow separate threads to operate independently without waiting for the shared heap (Davislor)
Is my understanding of memory pools correct? If so, why does it seem like memory pools are not used very often?
It seems this question is thwart with XY problem and premature optimisation.
You should focus on writing legible code, then using a profiler to perform optimisations if necessary.
Is my understanding of memory pools correct?
Not quite.
... on the stack ...
... on the heap ...
Storage duration is orthogonal to the concept of pools; pools can be allocated to have any of the four storage durations (they are: static, thread, automatic and dynamic storage duration).
The C++ standard doesn't require that any of these go into a stack or a heap; it might be useful to think of all of them as though they go into the same place... after all, they all (commonly) go onto silicon chips!
... allocate ... before runtime ...
What matters is that the allocation of multiple objects occurs before (or at least less often than) those objects are first used; this saves having to allocate each object separately. I assume this is what you meant by "before runtime". When choosing the size of the allocation, the closer you get to the total number of objects required at any given time the less waste from excessive allocation and the less waste from excessive resizing.
If your OS isn't prehistoric, however, the advantages of pools will quickly diminish. You'd probably see this if you used a profiler before and after conducting your optimisation!
Dynamic memory becomes fragmented over time
This may be true for a naive operating system such as Windows 1.0. However, in this day and age objects with allocated storage duration are commonly stored in virtual memory, which periodically gets written to, and read back from disk (this is called paging). As a consequence, fragmented memory can be defragmented and objects, functions and methods that are more commonly used might even end up being united into common pages.
That is, paging forms an implicit pool (and cache prediction) for you!
The CPU can parse static blocks of memory faster than dynamic blocks
While objects allocated with static storage duration commonly are located on the stack, that's not mandated by the C++ standard. It's entirely possible that a C++ implementation may exist where-by static blocks of memory are allocated on the heap, instead.
A cache hit on a dynamic object will be just as fast as a cache hit on a static object. It just so happens that the stack is commonly kept in cache; you should try programming without the stack some time, and you might find that the cache has more room for the heap!
BEFORE you optimise you should ALWAYS use a profiler to measure the most significant bottleneck! Then you should perform the optimisation, and then run the profiler again to make sure the optimisation was a success!
This is not a machine-independent process! You need to optimise per-implementation! An optimisation for one implementation is likely a pessimisation for another.
If so, why does it seem like memory pools are not used very often?
The virtual memory abstraction described above, in conjunction with eliminating guess-work using cache profilers virtually eliminates the usefulness of pools in all but the least-informed (i.e. use a profiler) scenarios.
A customized allocator can help performance since the default allocator is optimized for a specific use case, which is infrequently allocating large chunks of memory.
But let's say for example in a simulator or game, you may have a lot of stuff happening in one frame, allocating and freeing memory very frequently. In this case the default allocator is not as good.
A simple solution can be allocating a block of memory for all the throwaway stuff happening during a frame. This block of memory can be overwritten over and over again, and the deletion can be deferred to a later time. e.g: end of a game level or whatever.
Memory pools are used to implement custom allocators.
One commonly used is a linear allocator. It only keeps a pointer seperating allocated/free memory. Allocating with it is just a matter of incrementing the pointer by the N bytes requested, and returning it's previous value. And deallocation is done by resetting the pointer to the start of the pool.

access speed to static and heap memory

There are a lot of topics about access speed to stack and heap variables, but I couldn't find question and proper answer about access speed for static and heap variables.
What should I prefer (in terms of access speed) if object lifetime is the same as program lifetime? Which is faster - using statically-allocated object or object in a heap?
I'm talking about C++, if relevant.
There is no difference. Absolutely. Once your program is loaded, CPU simply does not know what sort of memory (heap or static) it is dealing with.
The above statement is true for 98% of most common CPU architectures/implementations. Although some computers may have different areas of memory that work with different speeds. If this is the case you need to check this. How this special memory is mapped - this depends on a particular platform/configuration.
Depending on the compiler/environment programs with big static areas may load somewhat slower. But this is not an absolute rule.
It would be better to think about locality of your data (is your pieces of data stay close to each other or not) and how one value will kick out other value out of the CPU cache. Loading something to cache is 10-100 times slower than accessing something that is already in the cache. This will make VERY noticeable difference.

Allocating on heap vs allocating on stack in recursive functions

When I'm defining a recursive function, is it better/safer to allocate local variables on heap, and then clean them up before the function returns, than to allocate them on stack. The stack size on embedded systems is very limited, and there is a danger of stack overflow, when the recursion runs too deep.
The answer depends on your application domain and the specifics of your platform. "Embedded system" is a fuzzy term. A big oscilloscope running MS Windows or Linux is on one end of the spectrum. Most of it can be programmed like a normal PC. They should not fail, but if they do, one simply reboots them.
On the other end of the spectrum are controllers in safety-critical fields. Consider these Siemens switches which must react under all circumstances within milliseconds. Failure is not an option. They probably do not run Windows. Available resources are limited. Programming rules and procedures are much different here.
Now let's examine the options you have, recursion (or not!) with dynamic or automatic memory allocation.
Performance-wise the stack is much faster, so it is preferred. Dynamic allocation also involves some memory overhead for book keeping which is big if the data units are small. And there can be issues with memory fragmentation which cannot occur with automatic memory (although the scenarios which lead to fragmentation -- different lifetimes of objects -- are probably not directly solvable without dynamic allocation).
But it is true that on some systems the stack size is (much) smaller than the heap size; you must read about the memory layout on your system. The oscilloscope will have lots of memory and big stacks; the power switch will not.
If you are concerned about running out of memory I would follow Christian's advice to avoid recursion altogether and instead use a loop. Iteration may just keep the memory usage flat. Besides, recursion always uses stack space, e.g. for the return address and -value. The idea to allocate "local" variables dynamically would only make sense for larger data structures because you would still have to hold pointers to the data as automatic variables, which use up space on the stack, too (and would increase overall memory foot print).
In general, on systems with limited resources it is important to limit the maximum resource use by your program. Time is also a resource; dynamic allocation makes real time near impossible.
The application domain dictates the safety requirements. In safety-critical fields (pacemaker!) the program must not fail at all. That ideal is impossible to achieve unless the program is trivial, but great efforts are made to come close. In other fields a program may fail in a defined way under circumstances it cannot handle, but it must not fail silently or in undefined ways (for example, by overwriting data). For example, instead of allocating unknown amounts of data dynamically one would just have a pre-defined array of fixed size for data and use the elements in the array instead, with bound checks.
When I'm defining a recursive function, is it better/safer to allocate local variables on heap, and then clean them up before the function returns, than to allocate them on stack.
You have both the C and C++ tags. Its a valid question for both, but I can only comment on C++.
In C++, its better to use the heap even though its slightly less efficient. That's be cause new can fail if you run out of heap memory. In the case of failure, new will throw an exception. However, running out of stack space does not cause an exception, and its one of the reasons alloca is frowned upon in C++.
I think generally speaking you should avoid recursion in embedded systems altogether, as every time the function is called the return address is pushed to the stack. This could cause an unexpected overflow. Try switching to a loop.
Back to your question though, mallocing will be slower but safer. If there is no heap space then malloc will return an error and you can safely clean up the memory. This comes at a large cost in speed as malloc can be quite slow.
If you know how many iterations you expect ahead of time, you have the option of mallocing an array of the variables that you need. That way you will only malloc once which saves time, and won't risk unexpectedly filling up the heap or stack. Also you will only be left with one variable to free.

C program with minimum RAM

I want to understand the memory management in C and C++ programming for Application Development. Application will run on the PC.
If I want to make a program which uses RAM as less as possible while running, what are the points I need to consider while programming?
Here are two points according to what I understand, but I am not sure:
(1) Use minimum local variables in main() and other functions.
As local variables are saved in stack, which is RAM?
(2) Instead of local variables, use global variables on the top.
As global variables are saved in the uninitialized and initialized ROM area?
Thanks.
1) Generally the alternative to allocating on the stack is allocating on the heap (e.g., with malloc) which actually has a greater overhead due to bookkeeping/etc, and the stack already has memory reserved for it, so allocating on the stack where possible is often preferable. On the other hand there is less space on the stack while the heap can be close to “unlimited” on modern systems with virtual memory and 64-bit address space.
2) On PCs and other non-embedded system, everything in your program goes in RAM, i.e., it is not flashed to a ROM-like memory, so global versus local does not help in that regard. Also globals† tend to “live” as long as the application is running, while locals can be allocated and freed (either on the stack or heap) as required, and are thus preferable.
 
† More accurately, there can also be local variables with static duration, and variables with global scope that are pointers to dynamically allocated memory, so the terms local and global are used quite loosely here.
In general, modern desktop/laptop and even mobile operating systems are quite good at managing memory, so you probably shouldn't be trying to micro-optimize everything as you may actually do more harm than good.
If you really do need to bring down the memory footprint of your program, you must realize that everything in the program is stored in RAM, and so you need to work on reducing the number and size of the things you have, rather than trying to juggle their location. The other place where you can store things locally on a PC is the hard drive, so store large resources there and only load them as required (preferably only exactly the parts required). But remember that disk access is orders of magnitude slower than memory access, and that the operating system can also swap things out to disk if its memory gets full.
The program code itself is also stored in RAM, so have your compiler optimize for size (-Os or /Os option in many common compilers). Also remember that if you save a bit of space in variables by writing more complex code, the effort may be undone by the increased code size; save your optimizations for big wins (e.g., compressing large resources will require the added decompression code, but may still yield a large net win). Use of dynamically linked libraries (and other resources) also helps the overall memory footprint of the system if the same library is used by multiple programs running at the same time.
(Note that some of the above does not apply in embedded development, e.g., code and static constants may be indeed be stored in flash instead of RAM, etc.)
This is difficult because on your PC, the program will be running out of RAM unless you can somehow execute it out a ROM or Flash.
Here are the points to consider:
Reduce your code size.
Code takes up RAM.
Reduce variable quantity and size.
Variables need to live somewhere and that somewhere is in RAM.
Reduce character literals.
They too, take up space.
Reduce function call nesting.
A function may require parameters, which are placed in RAM.
A function that calls other functions needs a return path; the path is stored in RAM.
Use RAM from other devices.
Other devices, such as the Graphics Processor and your harddrive adaptor card, may have RAM you can use. If you use this RAM, you're not using the primary RAM.
Page memory to external device.
The OS is capable of virtual memory and can page memory out to an external device, such as a hard drive.
Edit 1 - Dynamic libraries
Too reduce the RAM footprint of your program, you could allocate a an area where you swap out library functions. This is similar to the DLL concept. When a function is needed, you load it from the hard drive into the reserved area.
Typically, a certain amount of space will be allocated for the stack; such space will be unavailable for other purposes whether or not it is used. If the space turns out to be inadequate, the program will die a gruesome death.
Local variables will be stored using some combination of registers and stack space. Some compilers will use the same registers or stack space for variables which are "live" at different times in a program's execution; others will not. Further, function arguments are typically pushed on the stack before calling a function and removed at the caller's convenience. In evaluating the code sequence:
function1(1,2,3,4,5);
function2(6,7,8,9,10);
the arguments for the first function will be pushed on the stack and that function will be called. At that point the compiler could remove those five values off the stack, but since a single instruction can remove any number of pushed values, many compilers will push the arguments for the second function (leaving the arguments of the first on the stack), call the second function, and then use one instruction to eliminate all ten. Normally this would be a non-issue, but in some deeply-nested recursive scenarios it could potentially be a problem.
Unless the "PC" you're developing for is tiny by today's standards, I wouldn't worry too much about trying to micro-optimize RAM usage. I've developed code for microcontrollers with only 25 bytes of RAM, and even written full-fledged games for use on a microprocessor-based console with a whopping 128 bytes (not KBytes!) of RAM, and on such system sit makes sense to worry about each individual byte. For PC applications, though, the only time it makes sense to worry about individual bytes is when they're part of a data structure which will get replicated many thousands of times in RAM.
You might want to get a book on "embedded" programming. Such a book will likely discuss ways to keep the memory footprint down, as embedded systems are more constrained than modern desktop or server systems.
When you use "local" variables, they are saved on the stack. As long as you don't use too much stack, this is basically free memory, as when the function exits the memory is returned. How much is "too much" varies... recently I had to work on a system where there is a limit of 8 KB of data for the stack per process.
When you use "global" variables or other static variables, the memory you use is tied up for the duration of the program. Thus you should minimize your use of globals, and/or find ways to share the same memory across multiple functions in your program.
I wrote a fairly elaborate "object manager" for a project I wrote a few years ago. A function can use the "get" operation to borrow an object, and then use the "release" operation when it is done borrowing the object. This means that all the functions in the system were able to share a relatively small amount of data space by taking turns using the shared objects. It's up to you to decide whether it is worth your time to build an "object manager" sort of thing or if you have enough memory to just use simple variables.
You can get much of the benefit of an "object manager" by simply calling malloc() and free() a lot. Then the heap allocator manages the shared resource, the heap memory, for you. The reason I wrote my own "object manager" was a need for speed. My system keeps using identical data objects, and it is way faster to just keep re-using the same ones than to keep freeing them and malloc-ing them again. Also, my system can be run on a DSP chip, and malloc() can be a surprisingly slow function on some DSP architectures.
Having multiple functions using the same global variables can lead you to tricky bugs, if one function tries to hold on to a global buffer while another one is overwriting the data. So your program will likely be more robust if you use malloc() and free() as long as each function only writes into data it allocated for itself. (But malloc() and free() can introduce bugs of their own: memory leaks, double-free errors, continuing to use a pointer after the data to which it points has been freed... if you use malloc() and free() be sure to use a tool such as Valgrind to check your code.)
Any variables, by definition, must be stored in read/write memory (or RAM). If you are talking about an embedded system with the code initially in ROM, then the runtime will copy the ROM image you identified into RAM to hold the values of the global variables.
Only items marked unchangeable (const) may be kept in the ROM during runtime.
Further, you need to reduce the depth of the calling structure of the program as each function call requires stack space (also in RAM) to record the return address and other values.
To minimise the use of memory, you can try to flag local variables with the register attribute, but this may not be honoured by your compiler.
Another common technique is to dynamically generate large variable data on the fly whenever it is required to avoid having to create buffers. These usually take up much more space the simple variables.
if this is a PC, then by default, you will be given a stack of a certain size ( you can make it bigger or smaller ). Using this stack is more efficient RAM wise than using global variables. because your ram usage will be the fixed stack size + globals + other stuff ( program, heap etc). The stack acts as a resuable piece of memory.

C++ Some Stack questions

Let me start by saying that I have read this tutorial and have read this question. My questions are:
How big can the stack get ? Is it
processor/architecture/compiler
dependent ?
Is there a way to know exactly how
much memory is available to my
function/class stack and how much is
currently being used in order to
avoid overflows ?
Using modern compilers (say gcc 4.5)
on a modern computer (say 6 GB ram),
do I need to worry for stack
overflows or is it a thing of the
past ?
Is the actual stack memory
physically on RAM or on CPU cache(s) ?
How much faster is stack memory
access and read compared to heap
access and read ? I realize that
times are PC specific, so a ratio is
enough.
I've read that it is not advisable
to allocate big vars/objects on the
stack. How much is too big ? This
question here is given an answer
of 1MB for a thread in win32. How
about a thread in Linux amd64 ?
I apologize if those questions have been asked and answered already, any link is welcome !
Yes, the limit on the stack size varies, but if you care you're probably doing something wrong.
Generally no you can't get information about how much memory is available to your program. Even if you could obtain such information, it would usually be stale before you could use it.
If you share access to data across threads, then yes you normally need to serialize access unless they're strictly read-only.
You can pass the address of a stack-allocated object to another thread, in which case you (again) have to serialize unless the access is strictly read-only.
You can certainly overflow the stack even on a modern machine with lots of memory. The stack is often limited to only a fairly small fraction of overall memory (e.g., 4 MB).
The stack is allocated as system memory, but usually used enough that at least the top page or two will typically be in the cache at any given time.
Being part of the stack vs. heap makes no direct difference to access speed -- the two typically reside in identical memory chips, and often even at different addresses in the same memory chip. The main difference is that the stack is normally contiguous and heavily used, do the top few pages will almost always be in the cache. Heap-based memory is typically fragmented, so there's a much greater chance of needing data that's not in the cache.
Little has changed with respect to the maximum size of object you should allocate on the stack. Even if the stack can be larger, there's little reason to allocate huge objects there.
The primary way to avoid memory leaks in C++ is RAII (AKA SBRM, Stack-based resource management).
Smart pointers are a large subject in themselves, and Boost provides several kinds. In my experience, collections make a bigger difference, but the basic idea is largely the same either way: relieve the programmer of keeping track of every circumstance when a particular object can be used or should be freed.
1.How big can the stack get ? Is it processor/architecture/compiler dependent ?
The size of the stack is limited by the amount of memory on the platform and the amount of memory allocated to the process by the operating system.
2.Is there a way to know exactly how much memory is available to my function/class stack and how much is currently being used in order to avoid overflows ?
There is no C or C++ facility for determining the amount of available memory. There may be platform specific functions for this. In general, most programs try to allocate memory, then come up with a solution for when the allocation fails.
3.Using modern compilers (say gcc 4.5) on a modern computer (say 6 GB ram), do I need to worry for stack overflows or is it a thing of the past ?
Stack Overflows can happen depending on the design of the program. Recursion is a good example of depleting the stack, regardless of the amount of memory.
4.Is the actual stack memory physically on RAM or on CPU cache(s) ?
Platform dependent. Some CPU's can load up their cache with local variables on the stack. Wide variety of scenarios on this topic. Not defined in the language specification.
5.How much faster is stack memory access and read compared to heap access and read ?
I realize that times are PC specific, so a ratio is enough.
Usuallly there is no difference in speed. Depends on how the platform organizes its memory (physically) and how the executable's memory is laid out. The heap or stack could reside in a serial access memory chip (a slow method) or even on a Flash memory chip. Not specified in the language specification.
6.I've read that it is not advisable to allocate big vars/objects on the stack. How much is too big ? This question here is given an answer of 1MB for a thread in win32. How about a thread in Linux amd64 ?
The best advice is to allocate local small variables as needed (a.k.a. via stack). Huge items are either allocted from dynamic memory (a.k.a. heap), or some kind of global (static local to function or local to translation unit or even global variable). If the size is known at compile time, use the global type allocation. Use dynamic memory when the size may change during run-time.
The stack also contains information about function addresses. This is one major reason to not allocate a lot of objects locally. Some compilers have smaller limits for stacks than for heap or global variables. The premise is that nested function calls require less memory than large data arrays or buffers.
Remember that when switching threads or tasks, the OS needs to save the state somewhere. The OS may have different rules for saving stack memory versus other types.
1-2 : On some embedded CPUs the stack may be limited to a few kbytes; on some machines it may expand to gigabytes. There's no platform-independent way to know how big the stack can get, in some measure because some platforms are capable of expanding the stack when they reach the limit; the success of such an operation cannot always be predicted in advance.
3 : The effects of nearly-simultaneous writes, or of writes in one thread that occur nearly simultaneously with reads in another, are largely unpredictable in the absence of locks, mutexes, or other such devices. Certain things can be assumed (for example, if one thread reads a heap-stored 'int' while another thread changes it from 4 to 5, the first thread may see 4 or it may see 5; on most platforms, it would be guaranteed not to see 27).
4 : Some platforms share stack address space among threads; others do not. Passing pointers to things on the stack is usually a bad idea, though, since the the foreign thread receiving the pointer will have no way of ensuring that the target is in scope and won't go out of scope.
5 : Generally one does not need to worry about stack space in any routine which is written to limit recursion to a reasonable level. One does, however, need to worry about the possibility of defective data structures causing infinite recursion, which would wipe out any stack no matter how large it might be. One should also be mindful of the possibility of nasty input which would cause a much greater stack depth than expected. For example, a compiler using a recursive-descent parser might choke if fed a file containing a billion repetitions of the sequence "1+(". Even if the machine has a gig of stack space, if each nested sub-expression uses 64 bytes of stack, the aforementioned three-gig file could kill it.
6 : Stack is stored generally in RAM and/or cache; the most-recently-accessed parts will generally be in cache, while the less-recently-accessed parts will be in main memory. The same is generally true of code, heap, and static storage areas as well.
7 : That is very system dependent; generally, "finding" something on the heap will take as much time as accessing a few things on the stack, but in many cases making multiple accesses to different parts of the same heap object can be as fast as accessing a stack object.