In which cases is alloca() useful? - c++

Why would you ever want to use alloca() when you could always allocate a fixed size buffer on the stack large enough to fit all uses? This is not a rhetorical question...

It could be useful if the size of the buffer varies at runtime, or if you only sometimes need it: this would use less stack space overall than a fixed-size buffer in each call. Particularly if the function is high up the stack or recursive.

You might want to use it if there's no way to know the maximum size you might need at compile time.
Whether you should is another question - it's not standard, and there's no way to tell whether it might cause a stack overflow.

In which cases is alloca() useful?
The only time I ever saw alloca being used was in Open Dynamics Engine.
AFAIK they were allocating HUGE matrices with it (so compiled program could require 100MB stack), which were automatically freed when function returns (looks like smartpointer ripoff to me). This was quite a while ago.
Although it was probably much faster than new/malloc, I still think it was a bad idea.
Instead of politely running out of RAM program could crash with stack overflow (i.e. misleading) when scene became too complex to handle. Not a nice behavior, IMO, especially for physics engine, where you can easily expect someone to throw few thousands bricks into scene and see what happens when they all collide at once. Plus you had to set stack size manually - i.e. on system with more RAM, program would be still limited by stack size.
a fixed size buffer on the stack large enough to fit all uses? This is not a rhetorical question...
If you need fixed-size buffer for all uses, then you could as well put it into static/global variable or use heap memory.

Using alloca() may be reasonable when you are unable to use malloc() (or new in C++, or another memory allocator) reliably, or at all, but you can assume there's more space available on your stack - that is, when you can't really do anything else.
For example, in glibc's segfault.c, we have:
/* This function is called when a segmentation fault is caught. The system
is in an unstable state now. This means especially that malloc() might
not work anymore. */
static void
catch_segfault (int signal, SIGCONTEXT ctx)
{
void **arr;
/* ... */
/* Get the backtrace. */
arr = alloca (256 * sizeof (void *));
/* ... */
}

Never - it's not part of C++, and not useful in C. However, you cannot allocate "a static buffer on the stack" - static buffers are allocated at compile time, and not on the stack.
The point of alloca() is of course that it is not fixed sized, it is on the stack, and that it is freed automatically when a function exits. Both C++ and C have better mechanisms to handle this.

The alloca() function is virtually never needed; for memory allocation purposes, you can use malloc()/free() in C (or one of the collection of possibilities in C++) and achieve pretty much the same practical effect. This has the advantage of coping better with smaller stack sizes.
However I have seen[1] one legit (if hacky!) use of it: for detecting potential stack overflow on Windows; if the allocation (of the amount of slop space you wanted to access) failed, you were out but had enough room to recover gracefully. It was wrapped in __try/__except so that it didn't crash, and needed extra assembler tricks to avoid gcc-induced trouble. As I said, a hack. But a clever one that is the only valid use for alloca() that I've ever seen.
But don't do that. Better to write the code to not need such games.
[1] It was in Tcl 8.4 (and possibly earlier versions of Tcl). It was removed in later versions. Later versions removed it because it was finicky, very tricky and deeply troubling. 8.6 uses a stackless implementation of the execution engine instead of that sort of funkiness.

Related

What are the next step to improve malloc() algorithm? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm writing my own simple malloc() function and I would like to create more faster and efficient variant. I'm writed function that use linear search and allocated sequentially and contiguously in memory.
What is the next step to improve this algorithm? What are the main shortcomings of my current version? I would be very grateful for any feedback and recommendation.
typedef struct heap_block
{
struct heap_block* next;
size_t size;
bool isfree;
}header;
#define Heap_Capacity 100000
static char heap[Heap_Capacity];
size_t heap_size;
void* malloc(size_t sz)
{
if(sz == 0 || sz > Heap_Capacity) { return NULL; }
header* block = (header*)heap;
if(heap_size == 0)
{
set_data_to_block(block, sz);
return (void*)(block+1);
}
while(block->next != NULL) { block = block->next; }
block->next = (header*)((char*)to_end_data(block) + 8);
header* new_block = block->next;
set_data_to_block(new_block, sz);
return (void*)(new_block+1);
}
void set_data_to_block(header* block, size_t sz)
{
block->size = sz;
block->isfree = false;
block->next = NULL;
heap_size += sz;
}
header* to_end_data(header* block)
{
return (header*)((size_t)(block+1) + block->size);
}
Notice that malloc is often built above lower level memory related syscalls (e.g. mmap(2) on Linux). See this answer which mentions GNU glibc and musl-libc. Look also inside tcmalloc, so study the source code of several free software malloc implementations.
Some general ideas for your malloc:
retrieve memory from the OS using mmap (and release it back to the OS kernel eventually with munmap). You certainly should not allocate a fixed size heap (since on a 64 bits computer with 128Gbytes of RAM you would want to succeed in malloc-ing a 10 billion bytes zone).
segregate small allocations from big ones, so handle differently malloc of 16 bytes from malloc of a megabyte. The typical threshold between small and large allocation is generally a small multiple of the page size (which is often 4Kbytes). Small allocations happen inside pages. Large allocations are rounded to pages. You might even handle very specially malloc of two words (like in many linked lists).
round up the requested size to some fancy number (e.g. powers of 2, or 3 times a power of 2).
manage together memory zones of similar sizes, that is of same "fancy" size.
for small memory zones, avoid reclaiming too early the memory zone, so keep previously free-d zones of same (small) size to reuse them in future calls to malloc.
you might use some tricks on the address (but your system might have ASLR), or keep near each memory zone a word of meta-data describing the chunk of which it is a member.
a significant issue is, given some address previously returned by malloc and argument of free, to find out the allocated size of that memory zone. You could manipulate the address bits, you could store that size in the word before, you could use some hash table, etc. Details are tricky.
Notice that details are tricky, and it might be quite hard to write a malloc implementation better than your system one. In practice, writing a good malloc is not a simple task. You should find many academic papers on the subject.
Look also into garbage collection techniques. Consider perhaps Boehm's conservative GC: you will replace malloc by GC_MALLOC and you won't bother about free... Learn about memory pools.
There are 3 ways to improve:
make it more robust
optimise the way memory is allocated
optimise the code that allocates the memory
Make It More Robust
There are many common programmer mistakes that could be easily detected (e.g. modifying data beyond the end of the allocated block). For a very simple (and relatively fast) example, it's possible to insert "canaries" before and after the allocated block, and detect programmer during free and realloc by checking if the canaries are correct (if they aren't then the programmer has trashed something by mistake). This only works "sometimes maybe". The problem is that (for a simple implementation of malloc) the meta-data is mixed in with the allocated blocks; so there's a chance that the meta-data has been corrupted even if the canaries haven't. To fix that it'd be a good idea to separate the meta-data from the allocated blocks. Also, merely reporting "something was corrupted" doesn't help as much as you'd hope. Ideally you'd want to have some sort of identifier for each block (e.g. the name of the function that allocated it) so that if/when problems occur you can report what was corrupted. Of course this could/should maybe be done via. a macro, so that those identifiers can be omitted when not debugging.
The main problem here is that the interface provided by malloc is lame and broken - there's simply no way to return acceptable error conditions ("failed to allocate" is the only error it can return) and no way to pass additional information. You'd want something more like int malloc(void **outPointer, size_t size, char *identifier) (with similar alterations to free and realloc to enable them to return a status code and identifier).
Optimise the Way Memory is Allocated
It's naive to assume that all memory is the same. It's not. Cache locality (including TLB locality) and other cache effects, and things like NUMA optimisation, all matter. For a simple example, imagine you're writing an application that has a structure describing a person (including a hash of their name) and a pointer to a person's name string; and both the structure and the name string are allocated via. malloc. The normal end result is that those structures and strings end up mixed together in the heap; so when you're searching through these structures (e.g. trying to find the structure that contains the correct hash) you end up pounding caches and TLBs. To optimise this properly you'd want to ensure that all the structures are close together in the heap. For that to work malloc needs to the difference between allocating 32 bytes for the structure and allocating 32 bytes for the name string. You need to introduce the concept of "memory pools" (e.g. where everything in "memory pool number 1" is kept close in the heap).
Another important optimisations include "cache colouring" (see http://en.wikipedia.org/wiki/Cache_coloring ). For NUMA systems, it can be important to know the difference between something where max. bandwidth is needed (where using memory from multiple NUMA domains increases bandwidth).
Finally, it'd be nice (to manage heap fragmentation, etc) to use different strategies for "temporary, likely to be freed soon" allocations and longer term allocations (e.g. where it's worth doing a extra to minimise fragmentation and wasted space/RAM).
Note: I'd estimate that getting all of this right can mean software running up to 20% faster in specific cases, due to far less cache misses, more bandwidth where it's needed, etc.
The main problem here is that the interface provided by malloc is lame and broken - there's simply no way to pass the additional information to malloc in the first place. You'd want something more like int malloc(void **outPointer, size_t size, char *identifier, int pool, int optimisationFlags) (with similar alterations to realloc).
Optimise the Code That Allocates the Memory
Given that you can assume memory is used more frequently than its allocated; this is the least important (e.g. less important than getting things like cache locality for the allocated blocks right).
Quite frankly, anyone that actually wants decent performance or decent debugging shouldn't be using malloc to begin with - generic solutions to specific problems are never ideal. With this in mind (and not forgetting that the interface to malloc is lame and broken and prevents everything that's important anyway) I'd recommend simply not bothering with malloc and creating something that's actually good (but non-standard) instead. For this you can adapt the algorithms used by existing implementations of malloc.

Detecting that the stack is full

When writing C++ code I've learned that using the stack to store memory is a good idea.
But recently I ran into a problem:
I had an experiment that had code that looked like this:
void fun(const unsigned int N) {
float data_1[N*N];
float data_2[N*N];
/* Do magic */
}
The code exploted with a seqmentation fault at random, and I had no idea why.
It turned out that problem was that I was trying to store things that were to big on my stack, is there a way of detecting this? Or at least detecting that it has gone wrong?
float data_1[N*N];
float data_2[N*N];
These are variable length arrays (VLA), as N is not a constant expression. The const-ness in the parameter only ensures that N is read-only. It doesn't tell the compiler that N is constant expression.
VLAs are allowed in C99 only; in other version of C, and all versions of C++ they're not allowed. However, some compilers provides VLA as compiler-extension feature. If you're compiling with GCC, then try using -pedantic option, it will tell you it is not allowed.
Now why your program gives segfault, probably because of stack-overflow due to large value of N * N:
Consider using std::vector as:
#include <vector>
void fun(const unsigned int N)
{
std::vector<float> data_1(N*N);
std::vector<float> data_2(N*N);
//your code
}
It's extremely difficult to detect that the stack is full, and not at all portable. One of the biggest problems is that stack frames are of variable size (especially when using variable-length arrays, which are really just a more standard way of doing what people were doing before with alloca()) so you can't use simple proxies like the number of stack frames.
One of the simplest methods that is mostly portable is to put a variable (probably of type char so that a pointer to it is a char*) at a known depth on the stack and to then measure the distance from that point to a variable (of the same type) in the current stack frame by simple pointer arithmetic. Add in an estimate of how much space you're about to allocate, and you can have a good guess as to wether the stack is about to blow up on you. The problems with this are that you don't know the direction that the stack is growing in (no, they don't all grow in the same direction!) and working out the size of the stack space is itself rather messy (you can try things like system limits, but they're really quite awkward). Plus the hack factor is very high.
The other trick I've seen used on 32-bit Windows only was to try to alloca() sufficient space and handle the system exception that would occur if there was insufficient room.
int have_enough_stack_space(void) {
int enough_space = 0;
__try { /* Yes, that's got a double-underscore. */
alloca(SOME_VALUE_THAT_MEANS_ENOUGH_SPACE);
enough_space = 1;
} __except (EXCEPTION_EXECUTE_HANDLER) {}
return enough_space;
}
This code is very non-portable (e.g., don't count on it working on 64-bit Windows) and building with older gcc requires some nasty inline assembler instead! Structured exception handling (which this is a use of) is amongst the blackest of black arts on Windows. (And don't return from inside the __try construct.)
Try using instead functions like malloc. It will return null explicitly, if it failed to find a block of memory of the size you requested.
Of course, in that case don't forget to free this memory in the end of function, after you are done.
Also, you can check the settings of your compiler, with what stack memory limit it generates the binaries.
One of the reasons people say it is better to use stack instead of heap memory can be because of the fact that variables allocated on top of the stack will be popped out automatically when you leave the body of the function. For storing big blocks of information it is usual to use heap memory and other data structures like linked lists or trees. Also memories allocated on the stack is limited and much more less than you can allocate in the heap space. I think it is better to manage the memory allocation and releasing more carefully instead of trying to use stack for storing big data.
You can use framework which manage your memory allocations. As well you can use VDL to check your memory leaks and memories which is not released.
is there a way of detecting this?
No, in general.
Stack size is platform depedent. Typically, Operating System decides the size of the stack. So you can check your OS (ulimit -s on linux) to see how much stack memory it allocates for your program.
If your compiler supports stackavail() then you can check it. It's better to go heap-allocated memory in situations where you are unsure whether you'd exceed the stack limit.

Stack Allocation vs Heap Allocation [duplicate]

This question may sound fairly elementary, but this is a debate I had with another developer I work with.
I was taking care to stack allocate things where I could, instead of heap allocating them. He was talking to me and watching over my shoulder and commented that it wasn't necessary because they are the same performance wise.
I was always under the impression that growing the stack was constant time, and heap allocation's performance depended on the current complexity of the heap for both allocation (finding a hole of the proper size) and de-allocating (collapsing holes to reduce fragmentation, as many standard library implementations take time to do this during deletes if I am not mistaken).
This strikes me as something that would probably be very compiler dependent. For this project in particular I am using a Metrowerks compiler for the PPC architecture. Insight on this combination would be most helpful, but in general, for GCC, and MSVC++, what is the case? Is heap allocation not as high performing as stack allocation? Is there no difference? Or are the differences so minute it becomes pointless micro-optimization.
Stack allocation is much faster since all it really does is move the stack pointer.
Using memory pools, you can get comparable performance out of heap allocation, but that comes with a slight added complexity and its own headaches.
Also, stack vs. heap is not only a performance consideration; it also tells you a lot about the expected lifetime of objects.
Stack is much faster. It literally only uses a single instruction on most architectures, in most cases, e.g. on x86:
sub esp, 0x10
(That moves the stack pointer down by 0x10 bytes and thereby "allocates" those bytes for use by a variable.)
Of course, the stack's size is very, very finite, as you will quickly find out if you overuse stack allocation or try to do recursion :-)
Also, there's little reason to optimize the performance of code that doesn't verifiably need it, such as demonstrated by profiling. "Premature optimization" often causes more problems than it's worth.
My rule of thumb: if I know I'm going to need some data at compile-time, and it's under a few hundred bytes in size, I stack-allocate it. Otherwise I heap-allocate it.
Honestly, it's trivial to write a program to compare the performance:
#include <ctime>
#include <iostream>
namespace {
class empty { }; // even empty classes take up 1 byte of space, minimum
}
int main()
{
std::clock_t start = std::clock();
for (int i = 0; i < 100000; ++i)
empty e;
std::clock_t duration = std::clock() - start;
std::cout << "stack allocation took " << duration << " clock ticks\n";
start = std::clock();
for (int i = 0; i < 100000; ++i) {
empty* e = new empty;
delete e;
};
duration = std::clock() - start;
std::cout << "heap allocation took " << duration << " clock ticks\n";
}
It's said that a foolish consistency is the hobgoblin of little minds. Apparently optimizing compilers are the hobgoblins of many programmers' minds. This discussion used to be at the bottom of the answer, but people apparently can't be bothered to read that far, so I'm moving it up here to avoid getting questions that I've already answered.
An optimizing compiler may notice that this code does nothing, and may optimize it all away. It is the optimizer's job to do stuff like that, and fighting the optimizer is a fool's errand.
I would recommend compiling this code with optimization turned off because there is no good way to fool every optimizer currently in use or that will be in use in the future.
Anybody who turns the optimizer on and then complains about fighting it should be subject to public ridicule.
If I cared about nanosecond precision I wouldn't use std::clock(). If I wanted to publish the results as a doctoral thesis I would make a bigger deal about this, and I would probably compare GCC, Tendra/Ten15, LLVM, Watcom, Borland, Visual C++, Digital Mars, ICC and other compilers. As it is, heap allocation takes hundreds of times longer than stack allocation, and I don't see anything useful about investigating the question any further.
The optimizer has a mission to get rid of the code I'm testing. I don't see any reason to tell the optimizer to run and then try to fool the optimizer into not actually optimizing. But if I saw value in doing that, I would do one or more of the following:
Add a data member to empty, and access that data member in the loop; but if I only ever read from the data member the optimizer can do constant folding and remove the loop; if I only ever write to the data member, the optimizer may skip all but the very last iteration of the loop. Additionally, the question wasn't "stack allocation and data access vs. heap allocation and data access."
Declare e volatile, but volatile is often compiled incorrectly (PDF).
Take the address of e inside the loop (and maybe assign it to a variable that is declared extern and defined in another file). But even in this case, the compiler may notice that -- on the stack at least -- e will always be allocated at the same memory address, and then do constant folding like in (1) above. I get all iterations of the loop, but the object is never actually allocated.
Beyond the obvious, this test is flawed in that it measures both allocation and deallocation, and the original question didn't ask about deallocation. Of course variables allocated on the stack are automatically deallocated at the end of their scope, so not calling delete would (1) skew the numbers (stack deallocation is included in the numbers about stack allocation, so it's only fair to measure heap deallocation) and (2) cause a pretty bad memory leak, unless we keep a reference to the new pointer and call delete after we've got our time measurement.
On my machine, using g++ 3.4.4 on Windows, I get "0 clock ticks" for both stack and heap allocation for anything less than 100000 allocations, and even then I get "0 clock ticks" for stack allocation and "15 clock ticks" for heap allocation. When I measure 10,000,000 allocations, stack allocation takes 31 clock ticks and heap allocation takes 1562 clock ticks.
Yes, an optimizing compiler may elide creating the empty objects. If I understand correctly, it may even elide the whole first loop. When I bumped up the iterations to 10,000,000 stack allocation took 31 clock ticks and heap allocation took 1562 clock ticks. I think it's safe to say that without telling g++ to optimize the executable, g++ did not elide the constructors.
In the years since I wrote this, the preference on Stack Overflow has been to post performance from optimized builds. In general, I think this is correct. However, I still think it's silly to ask the compiler to optimize code when you in fact do not want that code optimized. It strikes me as being very similar to paying extra for valet parking, but refusing to hand over the keys. In this particular case, I don't want the optimizer running.
Using a slightly modified version of the benchmark (to address the valid point that the original program didn't allocate something on the stack each time through the loop) and compiling without optimizations but linking to release libraries (to address the valid point that we don't want to include any slowdown caused by linking to debug libraries):
#include <cstdio>
#include <chrono>
namespace {
void on_stack()
{
int i;
}
void on_heap()
{
int* i = new int;
delete i;
}
}
int main()
{
auto begin = std::chrono::system_clock::now();
for (int i = 0; i < 1000000000; ++i)
on_stack();
auto end = std::chrono::system_clock::now();
std::printf("on_stack took %f seconds\n", std::chrono::duration<double>(end - begin).count());
begin = std::chrono::system_clock::now();
for (int i = 0; i < 1000000000; ++i)
on_heap();
end = std::chrono::system_clock::now();
std::printf("on_heap took %f seconds\n", std::chrono::duration<double>(end - begin).count());
return 0;
}
displays:
on_stack took 2.070003 seconds
on_heap took 57.980081 seconds
on my system when compiled with the command line cl foo.cc /Od /MT /EHsc.
You may not agree with my approach to getting a non-optimized build. That's fine: feel free modify the benchmark as much as you want. When I turn on optimization, I get:
on_stack took 0.000000 seconds
on_heap took 51.608723 seconds
Not because stack allocation is actually instantaneous but because any half-decent compiler can notice that on_stack doesn't do anything useful and can be optimized away. GCC on my Linux laptop also notices that on_heap doesn't do anything useful, and optimizes it away as well:
on_stack took 0.000003 seconds
on_heap took 0.000002 seconds
An interesting thing I learned about Stack vs. Heap Allocation on the Xbox 360 Xenon processor, which may also apply to other multicore systems, is that allocating on the Heap causes a Critical Section to be entered to halt all other cores so that the alloc doesn't conflict. Thus, in a tight loop, Stack Allocation was the way to go for fixed sized arrays as it prevented stalls.
This may be another speedup to consider if you're coding for multicore/multiproc, in that your stack allocation will only be viewable by the core running your scoped function, and that will not affect any other cores/CPUs.
You can write a special heap allocator for specific sizes of objects that is very performant. However, the general heap allocator is not particularly performant.
Also I agree with Torbjörn Gyllebring about the expected lifetime of objects. Good point!
Concerns Specific to the C++ Language
First of all, there is no so-called "stack" or "heap" allocation mandated by C++. If you are talking about automatic objects in block scopes, they are even not "allocated". (BTW, automatic storage duration in C is definitely NOT the same to "allocated"; the latter is "dynamic" in the C++ parlance.) The dynamically allocated memory is on the free store, not necessarily on "the heap", though the latter is often the (default) implementation.
Although as per the abstract machine semantic rules, automatic objects still occupy memory, a conforming C++ implementation is allowed to ignore this fact when it can prove this does not matter (when it does not change the observable behavior of the program). This permission is granted by the as-if rule in ISO C++, which is also the general clause enabling the usual optimizations (and there is also an almost same rule in ISO C). Besides the as-if rule, ISO C++ also has copy elision rules to allow omission of specific creations of objects. The constructor and destructor calls involved are thereby omitted. As a result, the automatic objects (if any) in these constructors and destructors are also eliminated, compared to naive abstract semantics implied by the source code.
On the other hand, free store allocation is definitely "allocation" by design. Under ISO C++ rules, such an allocation can be achieved by a call of an allocation function. However, since ISO C++14, there is a new (non-as-if) rule to allow merging global allocation function (i.e. ::operator new) calls in specific cases. So parts of dynamic allocation operations can also be no-op like the case of automatic objects.
Allocation functions allocate resources of memory. Objects can be further allocated based on allocation using allocators. For automatic objects, they are directly presented - although the underlying memory can be accessed and be used to provide memory to other objects (by placement new), but this does not make great sense as the free store, because there is no way to move the resources elsewhere.
All other concerns are out of the scope of C++. Nevertheless, they can be still significant.
About Implementations of C++
C++ does not expose reified activation records or some sorts of first-class continuations (e.g. by the famous call/cc), there is no way to directly manipulate the activation record frames - where the implementation need to place the automatic objects to. Once there is no (non-portable) interoperations with the underlying implementation ("native" non-portable code, such as inline assembly code), an omission of the underlying allocation of the frames can be quite trivial. For example, when the called function is inlined, the frames can be effectively merged into others, so there is no way to show what is the "allocation".
However, once interops are respected, things are getting complex. A typical implementation of C++ will expose the ability of interop on ISA (instruction-set architecture) with some calling conventions as the binary boundary shared with the native (ISA-level machine) code. This would be explicitly costly, notably, when maintaining the stack pointer, which is often directly held by an ISA-level register (with probably specific machine instructions to access). The stack pointer indicates the boundary of the top frame of the (currently active) function call. When a function call is entered, a new frame is needed and the stack pointer is added or subtracted (depending on the convention of ISA) by a value not less than the required frame size. The frame is then said allocated when the stack pointer after the operations. Parameters of functions may be passed onto the stack frame as well, depending on the calling convention used for the call. The frame can hold the memory of automatic objects (probably including the parameters) specified by the C++ source code. In the sense of such implementations, these objects are "allocated". When the control exits the function call, the frame is no longer needed, it is usually released by restoring the stack pointer back to the state before the call (saved previously according to the calling convention). This can be viewed as "deallocation". These operations make the activation record effectively a LIFO data structure, so it is often called "the (call) stack". The stack pointer effectively indicates the top position of the stack.
Because most C++ implementations (particularly the ones targeting ISA-level native code and using the assembly language as its immediate output) use similar strategies like this, such a confusing "allocation" scheme is popular. Such allocations (as well as deallocations) do spend machine cycles, and it can be expensive when the (non-optimized) calls occur frequently, even though modern CPU microarchitectures can have complex optimizations implemented by hardware for the common code pattern (like using a stack engine in implementing PUSH/POP instructions).
But anyway, in general, it is true that the cost of stack frame allocation is significantly less than a call to an allocation function operating the free store (unless it is totally optimized away), which itself can have hundreds of (if not millions of :-) operations to maintain the stack pointer and other states. Allocation functions are typically based on API provided by the hosted environment (e.g. runtime provided by the OS). Different to the purpose of holding automatic objects for functions calls, such allocations are general-purpose, so they will not have frame structure like a stack. Traditionally, they allocate space from the pool storage called the heap (or several heaps). Different from the "stack", the concept "heap" here does not indicate the data structure being used; it is derived from early language implementations decades ago. (BTW, the call stack is usually allocated with fixed or user-specified size from the heap by the environment in program/thread startup.) The nature of use cases makes allocations and deallocations from a heap far more complicated (than pushing/poppoing of stack frames), and hardly possible to be directly optimized by hardware.
Effects on Memory Access
The usual stack allocation always puts the new frame on the top, so it has a quite good locality. This is friendly to the cache. OTOH, memory allocated randomly in the free store has no such property. Since ISO C++17, there are pool resource templates provided by <memory_resource>. The direct purpose of such an interface is to allow the results of consecutive allocations being close together in memory. This acknowledges the fact that this strategy is generally good for performance with contemporary implementations, e.g. being friendly to cache in modern architectures. This is about the performance of access rather than allocation, though.
Concurrency
Expectation of concurrent access to memory can have different effects between the stack and heaps. A call stack is usually exclusively owned by one thread of execution in a typical C++ implementation. OTOH, heaps are often shared among the threads in a process. For such heaps, the allocation and deallocation functions have to protect the shared internal administrative data structure from the data race. As a result, heap allocations and deallocations may have additional overhead due to internal synchronization operations.
Space Efficiency
Due to the nature of the use cases and internal data structures, heaps may suffer from internal memory fragmentation, while the stack does not. This does not have direct impacts on the performance of memory allocation, but in a system with virtual memory, low space efficiency may degenerate overall performance of memory access. This is particularly awful when HDD is used as a swap of physical memory. It can cause quite long latency - sometimes billions of cycles.
Limitations of Stack Allocations
Although stack allocations are often superior in performance than heap allocations in reality, it certainly does not mean stack allocations can always replace heap allocations.
First, there is no way to allocate space on the stack with a size specified at runtime in a portable way with ISO C++. There are extensions provided by implementations like alloca and G++'s VLA (variable-length array), but there are reasons to avoid them. (IIRC, Linux source removes the use of VLA recently.) (Also note ISO C99 does have mandated VLA, but ISO C11 turns the support optional.)
Second, there is no reliable and portable way to detect stack space exhaustion. This is often called stack overflow (hmm, the etymology of this site), but probably more accurately, stack overrun. In reality, this often causes invalid memory access, and the state of the program is then corrupted (... or maybe worse, a security hole). In fact, ISO C++ has no concept of "the stack" and makes it undefined behavior when the resource is exhausted. Be cautious about how much room should be left for automatic objects.
If the stack space runs out, there are too many objects allocated in the stack, which can be caused by too many active calls of functions or improper use of automatic objects. Such cases may suggest the existence of bugs, e.g. a recursive function call without correct exit conditions.
Nevertheless, deep recursive calls are sometimes desired. In implementations of languages requiring support of unbound active calls (where the call depth only limited by total memory), it is impossible to use the (contemporary) native call stack directly as the target language activation record like typical C++ implementations. To work around the problem, alternative ways of the construction of activation records are needed. For example, SML/NJ explicitly allocates frames on the heap and uses cactus stacks. The complicated allocation of such activation record frames is usually not as fast as the call stack frames. However, if such languages are implemented further with the guarantee of proper tail recursion, the direct stack allocation in the object language (that is, the "object" in the language does not stored as references, but native primitive values which can be one-to-one mapped to unshared C++ objects) is even more complicated with more performance penalty in general. When using C++ to implement such languages, it is difficult to estimate the performance impacts.
I don't think stack allocation and heap allocation are generally interchangable. I also hope that the performance of both of them is sufficient for general use.
I'd strongly recommend for small items, whichever one is more suitable to the scope of the allocation. For large items, the heap is probably necessary.
On 32-bit operating systems that have multiple threads, stack is often rather limited (albeit typically to at least a few mb), because the address space needs to be carved up and sooner or later one thread stack will run into another. On single threaded systems (Linux glibc single threaded anyway) the limitation is much less because the stack can just grow and grow.
On 64-bit operating systems there is enough address space to make thread stacks quite large.
Usually stack allocation just consists of subtracting from the stack pointer register. This is tons faster than searching a heap.
Sometimes stack allocation requires adding a page(s) of virtual memory. Adding a new page of zeroed memory doesn't require reading a page from disk, so usually this is still going to be tons faster than searching a heap (especially if part of the heap was paged out too). In a rare situation, and you could construct such an example, enough space just happens to be available in part of the heap which is already in RAM, but allocating a new page for the stack has to wait for some other page to get written out to disk. In that rare situation, the heap is faster.
Aside from the orders-of-magnitude performance advantage over heap allocation, stack allocation is preferable for long running server applications. Even the best managed heaps eventually get so fragmented that application performance degrades.
Probably the biggest problem of heap allocation versus stack allocation, is that heap allocation in the general case is an unbounded operation, and thus you can't use it where timing is an issue.
For other applications where timing isn't an issue, it may not matter as much, but if you heap allocate a lot, this will affect the execution speed. Always try to use the stack for short lived and often allocated memory (for instance in loops), and as long as possible - do heap allocation during application startup.
A stack has a limited capacity, while a heap is not. The typical stack for a process or thread is around 8K. You cannot change the size once it's allocated.
A stack variable follows the scoping rules, while a heap one doesn't. If your instruction pointer goes beyond a function, all the new variables associated with the function go away.
Most important of all, you can't predict the overall function call chain in advance. So a mere 200 bytes allocation on your part may raise a stack overflow. This is especially important if you're writing a library, not an application.
It's not jsut stack allocation that's faster. You also win a lot on using stack variables. They have better locality of reference. And finally, deallocation is a lot cheaper too.
As others have said, stack allocation is generally much faster.
However, if your objects are expensive to copy, allocating on the stack may lead to an huge performance hit later when you use the objects if you aren't careful.
For example, if you allocate something on the stack, and then put it into a container, it would have been better to allocate on the heap and store the pointer in the container (e.g. with a std::shared_ptr<>). The same thing is true if you are passing or returning objects by value, and other similar scenarios.
The point is that although stack allocation is usually better than heap allocation in many cases, sometimes if you go out of your way to stack allocate when it doesn't best fit the model of computation, it can cause more problems than it solves.
I think the lifetime is crucial, and whether the thing being allocated has to be constructed in a complex way. For example, in transaction-driven modeling, you usually have to fill in and pass in a transaction structure with a bunch of fields to operation functions. Look at the OSCI SystemC TLM-2.0 standard for an example.
Allocating these on the stack close to the call to the operation tends to cause enormous overhead, as the construction is expensive. The good way there is to allocate on the heap and reuse the transaction objects either by pooling or a simple policy like "this module only needs one transaction object ever".
This is many times faster than allocating the object on each operation call.
The reason is simply that the object has an expensive construction and a fairly long useful lifetime.
I would say: try both and see what works best in your case, because it can really depend on the behavior of your code.
Stack allocation is a couple instructions whereas the fastest rtos heap allocator known to me (TLSF) uses on average on the order of 150 instructions. Also stack allocations don't require a lock because they use thread local storage which is another huge performance win. So stack allocations can be 2-3 orders of magnitude faster depending on how heavily multithreaded your environment is.
In general heap allocation is your last resort if you care about performance. A viable in-between option can be a fixed pool allocator which is also only a couple instructions and has very little per-allocation overhead so it's great for small fixed size objects. On the downside it only works with fixed size objects, is not inherently thread safe and has block fragmentation problems.
There's a general point to be made about such optimizations.
The optimization you get is proportional to the amount of time the program counter is actually in that code.
If you sample the program counter, you will find out where it spends its time, and that is usually in a tiny part of the code, and often in library routines you have no control over.
Only if you find it spending much time in the heap-allocation of your objects will it be noticeably faster to stack-allocate them.
Stack allocation will almost always be as fast or faster than heap allocation, although it is certainly possible for a heap allocator to simply use a stack based allocation technique.
However, there are larger issues when dealing with the overall performance of stack vs. heap based allocation (or in slightly better terms, local vs. external allocation). Usually, heap (external) allocation is slow because it is dealing with many different kinds of allocations and allocation patterns. Reducing the scope of the allocator you are using (making it local to the algorithm/code) will tend to increase performance without any major changes. Adding better structure to your allocation patterns, for example, forcing a LIFO ordering on allocation and deallocation pairs can also improve your allocator's performance by using the allocator in a simpler and more structured way. Or, you can use or write an allocator tuned for your particular allocation pattern; most programs allocate a few discrete sizes frequently, so a heap that is based on a lookaside buffer of a few fixed (preferably known) sizes will perform extremely well. Windows uses its low-fragmentation-heap for this very reason.
On the other hand, stack-based allocation on a 32-bit memory range is also fraught with peril if you have too many threads. Stacks need a contiguous memory range, so the more threads you have, the more virtual address space you will need for them to run without a stack overflow. This won't be a problem (for now) with 64-bit, but it can certainly wreak havoc in long running programs with lots of threads. Running out of virtual address space due to fragmentation is always a pain to deal with.
Remark that the considerations are typically not about speed and performance when choosing stack versus heap allocation. The stack acts like a stack, which means it is well suited for pushing blocks and popping them again, last in, first out. Execution of procedures is also stack-like, last procedure entered is first to be exited. In most programming languages, all the variables needed in a procedure will only be visible during the procedure's execution, thus they are pushed upon entering a procedure and popped off the stack upon exit or return.
Now for an example where the stack cannot be used:
Proc P
{
pointer x;
Proc S
{
pointer y;
y = allocate_some_data();
x = y;
}
}
If you allocate some memory in procedure S and put it on the stack and then exit S, the allocated data will be popped off the stack. But the variable x in P also pointed to that data, so x is now pointing to some place underneath the stack pointer (assume stack grows downwards) with an unknown content. The content might still be there if the stack pointer is just moved up without clearing the data beneath it, but if you start allocating new data on the stack, the pointer x might actually point to that new data instead.
class Foo {
public:
Foo(int a) {
}
}
int func() {
int a1, a2;
std::cin >> a1;
std::cin >> a2;
Foo f1(a1);
__asm push a1;
__asm lea ecx, [this];
__asm call Foo::Foo(int);
Foo* f2 = new Foo(a2);
__asm push sizeof(Foo);
__asm call operator new;//there's a lot instruction here(depends on system)
__asm push a2;
__asm call Foo::Foo(int);
delete f2;
}
It would be like this in asm. When you're in func, the f1 and pointer f2 has been allocated on stack (automated storage). And by the way, Foo f1(a1) has no instruction effects on stack pointer (esp),It has been allocated, if func wants get the member f1, it's instruction is something like this: lea ecx [ebp+f1], call Foo::SomeFunc(). Another thing the stack allocate may make someone think the memory is something like FIFO, the FIFO just happened when you go into some function, if you are in the function and allocate something like int i = 0, there no push happened.
It has been mentioned before that stack allocation is simply moving the stack pointer, that is, a single instruction on most architectures. Compare that to what generally happens in the case of heap allocation.
The operating system maintains portions of free memory as a linked list with the payload data consisting of the pointer to the starting address of the free portion and the size of the free portion. To allocate X bytes of memory, the link list is traversed and each note is visited in sequence, checking to see if its size is at least X. When a portion with size P >= X is found, P is split into two parts with sizes X and P-X. The linked list is updated and the pointer to the first part is returned.
As you can see, heap allocation depends on may factors like how much memory you are requesting, how fragmented the memory is and so on.
In general, stack allocation is faster than heap allocation as mentioned by almost every answer above. A stack push or pop is O(1), whereas allocating or freeing from a heap could require a walk of previous allocations. However you shouldn't usually be allocating in tight, performance-intensive loops, so the choice will usually come down to other factors.
It might be good to make this distinction: you can use a "stack allocator" on the heap. Strictly speaking, I take stack allocation to mean the actual method of allocation rather than the location of the allocation. If you're allocating a lot of stuff on the actual program stack, that could be bad for a variety of reasons. On the other hand, using a stack method to allocate on the heap when possible is the best choice you can make for an allocation method.
Since you mentioned Metrowerks and PPC, I'm guessing you mean Wii. In this case memory is at a premium, and using a stack allocation method wherever possible guarantees that you don't waste memory on fragments. Of course, doing this requires a lot more care than "normal" heap allocation methods. It's wise to evaluate the tradeoffs for each situation.
Never do premature assumption as other application code and usage can impact your function. So looking at function is isolation is of no use.
If you are serious with application then VTune it or use any similar profiling tool and look at hotspots.
Ketan
Naturally, stack allocation is faster. With heap allocation, the allocator has to find the free memory somewhere. With stack allocation, the compiler does it for you by simply giving your function a larger stack frame, which means the allocation costs no time at all. (I'm assuming you're not using alloca or anything to allocate a dynamic amount of stack space, but even then it's very fast.)
However, you do have to be wary of hidden dynamic allocation. For example:
void some_func()
{
std::vector<int> my_vector(0x1000);
// Do stuff with the vector...
}
You might think this allocates 4 KiB on the stack, but you'd be wrong. It allocates the vector instance on the stack, but that vector instance in turn allocates its 4 KiB on the heap, because vector always allocates its internal array on the heap (at least unless you specify a custom allocator, which I won't get into here). If you want to allocate on the stack using an STL-like container, you probably want std::array, or possibly boost::static_vector (provided by the external Boost library).
I'd like to say actually code generate by GCC (I remember VS also) doesn't have overhead to do stack allocation.
Say for following function:
int f(int i)
{
if (i > 0)
{
int array[1000];
}
}
Following is the code generate:
__Z1fi:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $**3880**, %rsp <--- here we have the array allocated, even the if doesn't excited.
Ltmp2:
movl %edi, -4(%rbp)
movl -8(%rbp), %eax
addq $3880, %rsp
popq %rbp
ret
Leh_func_end1:
So whatevery how much local variable you have (even inside if or switch), just the 3880 will change to another value. Unless you didn't have local variable, this instruction just need to execute. So allocate local variable doesn't have overhead.

C++: High speed stack

As far as I assume, std::stack and all such 'handmade' stacks work much slower than stack which is applications one.
Maybe there's a good low-level 'bicycle' already? (Stack realization).
Or it's a good idea to create new thread and use it's own stack?
And how can I work directly with application stack? (asm {} only?)
std::stack is a collection of c++ objects that have stack semantics. It has nothing to do with a thread's stack or the push and pop intructions in assembler code.
Which one are you trying to do
The 'assembler' stack is usually maintained by the hardware and required by various calling conventions, so you have no choice about how to 'allocate' it or 'manage' it. Some architectures have highly configurable stacks but you dont say what arch you are on
If you want a collection with stack semantics and you are writing in c++ then std::stack is your choice unless you can prove that its not fast enough
The only way in which std::stack is significantly slower than the processor stack is that it has to allocate memory from the free store. By default, it uses std::deque for storage, which allocates memory in chunks as needed. As long as you don't keep destroying and recreating the stack, it will keep that memory and not need to allocate more unless it grows bigger than before. So structure code like this:
std::stack<int> stack;
for (int i = 0; i < HUGE_NUMBER; ++i)
do_lots_of_work(stack); // uses stack
rather than:
for (int i = 0; i < HUGE_NUMBER; ++i)
do_lots_of_work(); // creates its own stack
If, after profiling, you find that it's still spending too long allocating memory, then you could preallocate a large block so you only need a single allocation when your program starts up (assuming you can find an upper limit for the stack size). You need to get into the innards of the stack to do this, but it is possible by deriving your own stack type. Something like this (not tested):
class PreallocatedStack : public std::stack< int, std::vector<int> >
{
public:
explicit PreallocatedStack(size_t size) { c.reserve(size); }
};
EDIT: this is quite a gruesome hack, but it is supported by the C++ Standard. More tasteful would be to initialise a stack with a reserved vector, at the cost of an extra allocation. And don't try to use this class polymorphically - STL containers aren't designed for that.
Using the processor stack won't be portable, and on some platforms might make it impossible to use local variables after pushing something - you might end up having to code everything in assembly. (That is an option, if you really need to count every last cycle and don't need portability, but make sure you use a profiler to check that it really is worthwhile). There's no way to use another thread's stack that will be faster than a stack container.
MInner, are you sure that stack operations are/can be bottlenecks of our application? if not, and i can bet for it, just use std::stack and forget about it.
The basic idea that a "handmade" stack is necessarily slower than the one used for function calls is fundamentally flawed. The two work sufficiently similarly that they will typically be close to the same speed. The biggest point that favors the hardware stack is that it's used often enough that the data at or close to the top of that stack will almost always be in the cache. Another stack that you create usually won't be used as often, so there's a much better chance that any given reference will end up going to main memory instead of the cache.
In the other direction, you have a bit more flexibility in allocating memory for your stack. You can create a specialized allocator just for your stack. When the hardware stack overflows, it normally allocates memory using the kernel allocator. The kernel allocator is usually tuned quite carefully, so it's usually pretty efficient -- but it's also extremely general purpose. It can't be written just to do stack allocation really well; it has to be written to do any kind of allocation at least reasonably well. In the process, its ability to do one thing exceptionally well often suffers a bit.
It's certainly possible to create a stack that's arbitrarily slow, but there's no fundamental reason that your stack can't be just as fast (or possibly even faster) than the one provided by the (usual) hardware. I'll repeat though: the single biggest reason for it to be slower is cache allocation, which simply reflects usage.
It depends on your requirement.
If you want to push a userdefined data type on stack, you will need 'handmade' stacks
for others say, you want to push integers, chars, or pointers of objects you can use
asm
{
push
pop
}
but dont mess it up

Which is faster: Stack allocation or Heap allocation

This question may sound fairly elementary, but this is a debate I had with another developer I work with.
I was taking care to stack allocate things where I could, instead of heap allocating them. He was talking to me and watching over my shoulder and commented that it wasn't necessary because they are the same performance wise.
I was always under the impression that growing the stack was constant time, and heap allocation's performance depended on the current complexity of the heap for both allocation (finding a hole of the proper size) and de-allocating (collapsing holes to reduce fragmentation, as many standard library implementations take time to do this during deletes if I am not mistaken).
This strikes me as something that would probably be very compiler dependent. For this project in particular I am using a Metrowerks compiler for the PPC architecture. Insight on this combination would be most helpful, but in general, for GCC, and MSVC++, what is the case? Is heap allocation not as high performing as stack allocation? Is there no difference? Or are the differences so minute it becomes pointless micro-optimization.
Stack allocation is much faster since all it really does is move the stack pointer.
Using memory pools, you can get comparable performance out of heap allocation, but that comes with a slight added complexity and its own headaches.
Also, stack vs. heap is not only a performance consideration; it also tells you a lot about the expected lifetime of objects.
Stack is much faster. It literally only uses a single instruction on most architectures, in most cases, e.g. on x86:
sub esp, 0x10
(That moves the stack pointer down by 0x10 bytes and thereby "allocates" those bytes for use by a variable.)
Of course, the stack's size is very, very finite, as you will quickly find out if you overuse stack allocation or try to do recursion :-)
Also, there's little reason to optimize the performance of code that doesn't verifiably need it, such as demonstrated by profiling. "Premature optimization" often causes more problems than it's worth.
My rule of thumb: if I know I'm going to need some data at compile-time, and it's under a few hundred bytes in size, I stack-allocate it. Otherwise I heap-allocate it.
Honestly, it's trivial to write a program to compare the performance:
#include <ctime>
#include <iostream>
namespace {
class empty { }; // even empty classes take up 1 byte of space, minimum
}
int main()
{
std::clock_t start = std::clock();
for (int i = 0; i < 100000; ++i)
empty e;
std::clock_t duration = std::clock() - start;
std::cout << "stack allocation took " << duration << " clock ticks\n";
start = std::clock();
for (int i = 0; i < 100000; ++i) {
empty* e = new empty;
delete e;
};
duration = std::clock() - start;
std::cout << "heap allocation took " << duration << " clock ticks\n";
}
It's said that a foolish consistency is the hobgoblin of little minds. Apparently optimizing compilers are the hobgoblins of many programmers' minds. This discussion used to be at the bottom of the answer, but people apparently can't be bothered to read that far, so I'm moving it up here to avoid getting questions that I've already answered.
An optimizing compiler may notice that this code does nothing, and may optimize it all away. It is the optimizer's job to do stuff like that, and fighting the optimizer is a fool's errand.
I would recommend compiling this code with optimization turned off because there is no good way to fool every optimizer currently in use or that will be in use in the future.
Anybody who turns the optimizer on and then complains about fighting it should be subject to public ridicule.
If I cared about nanosecond precision I wouldn't use std::clock(). If I wanted to publish the results as a doctoral thesis I would make a bigger deal about this, and I would probably compare GCC, Tendra/Ten15, LLVM, Watcom, Borland, Visual C++, Digital Mars, ICC and other compilers. As it is, heap allocation takes hundreds of times longer than stack allocation, and I don't see anything useful about investigating the question any further.
The optimizer has a mission to get rid of the code I'm testing. I don't see any reason to tell the optimizer to run and then try to fool the optimizer into not actually optimizing. But if I saw value in doing that, I would do one or more of the following:
Add a data member to empty, and access that data member in the loop; but if I only ever read from the data member the optimizer can do constant folding and remove the loop; if I only ever write to the data member, the optimizer may skip all but the very last iteration of the loop. Additionally, the question wasn't "stack allocation and data access vs. heap allocation and data access."
Declare e volatile, but volatile is often compiled incorrectly (PDF).
Take the address of e inside the loop (and maybe assign it to a variable that is declared extern and defined in another file). But even in this case, the compiler may notice that -- on the stack at least -- e will always be allocated at the same memory address, and then do constant folding like in (1) above. I get all iterations of the loop, but the object is never actually allocated.
Beyond the obvious, this test is flawed in that it measures both allocation and deallocation, and the original question didn't ask about deallocation. Of course variables allocated on the stack are automatically deallocated at the end of their scope, so not calling delete would (1) skew the numbers (stack deallocation is included in the numbers about stack allocation, so it's only fair to measure heap deallocation) and (2) cause a pretty bad memory leak, unless we keep a reference to the new pointer and call delete after we've got our time measurement.
On my machine, using g++ 3.4.4 on Windows, I get "0 clock ticks" for both stack and heap allocation for anything less than 100000 allocations, and even then I get "0 clock ticks" for stack allocation and "15 clock ticks" for heap allocation. When I measure 10,000,000 allocations, stack allocation takes 31 clock ticks and heap allocation takes 1562 clock ticks.
Yes, an optimizing compiler may elide creating the empty objects. If I understand correctly, it may even elide the whole first loop. When I bumped up the iterations to 10,000,000 stack allocation took 31 clock ticks and heap allocation took 1562 clock ticks. I think it's safe to say that without telling g++ to optimize the executable, g++ did not elide the constructors.
In the years since I wrote this, the preference on Stack Overflow has been to post performance from optimized builds. In general, I think this is correct. However, I still think it's silly to ask the compiler to optimize code when you in fact do not want that code optimized. It strikes me as being very similar to paying extra for valet parking, but refusing to hand over the keys. In this particular case, I don't want the optimizer running.
Using a slightly modified version of the benchmark (to address the valid point that the original program didn't allocate something on the stack each time through the loop) and compiling without optimizations but linking to release libraries (to address the valid point that we don't want to include any slowdown caused by linking to debug libraries):
#include <cstdio>
#include <chrono>
namespace {
void on_stack()
{
int i;
}
void on_heap()
{
int* i = new int;
delete i;
}
}
int main()
{
auto begin = std::chrono::system_clock::now();
for (int i = 0; i < 1000000000; ++i)
on_stack();
auto end = std::chrono::system_clock::now();
std::printf("on_stack took %f seconds\n", std::chrono::duration<double>(end - begin).count());
begin = std::chrono::system_clock::now();
for (int i = 0; i < 1000000000; ++i)
on_heap();
end = std::chrono::system_clock::now();
std::printf("on_heap took %f seconds\n", std::chrono::duration<double>(end - begin).count());
return 0;
}
displays:
on_stack took 2.070003 seconds
on_heap took 57.980081 seconds
on my system when compiled with the command line cl foo.cc /Od /MT /EHsc.
You may not agree with my approach to getting a non-optimized build. That's fine: feel free modify the benchmark as much as you want. When I turn on optimization, I get:
on_stack took 0.000000 seconds
on_heap took 51.608723 seconds
Not because stack allocation is actually instantaneous but because any half-decent compiler can notice that on_stack doesn't do anything useful and can be optimized away. GCC on my Linux laptop also notices that on_heap doesn't do anything useful, and optimizes it away as well:
on_stack took 0.000003 seconds
on_heap took 0.000002 seconds
An interesting thing I learned about Stack vs. Heap Allocation on the Xbox 360 Xenon processor, which may also apply to other multicore systems, is that allocating on the Heap causes a Critical Section to be entered to halt all other cores so that the alloc doesn't conflict. Thus, in a tight loop, Stack Allocation was the way to go for fixed sized arrays as it prevented stalls.
This may be another speedup to consider if you're coding for multicore/multiproc, in that your stack allocation will only be viewable by the core running your scoped function, and that will not affect any other cores/CPUs.
You can write a special heap allocator for specific sizes of objects that is very performant. However, the general heap allocator is not particularly performant.
Also I agree with Torbjörn Gyllebring about the expected lifetime of objects. Good point!
Concerns Specific to the C++ Language
First of all, there is no so-called "stack" or "heap" allocation mandated by C++. If you are talking about automatic objects in block scopes, they are even not "allocated". (BTW, automatic storage duration in C is definitely NOT the same to "allocated"; the latter is "dynamic" in the C++ parlance.) The dynamically allocated memory is on the free store, not necessarily on "the heap", though the latter is often the (default) implementation.
Although as per the abstract machine semantic rules, automatic objects still occupy memory, a conforming C++ implementation is allowed to ignore this fact when it can prove this does not matter (when it does not change the observable behavior of the program). This permission is granted by the as-if rule in ISO C++, which is also the general clause enabling the usual optimizations (and there is also an almost same rule in ISO C). Besides the as-if rule, ISO C++ also has copy elision rules to allow omission of specific creations of objects. The constructor and destructor calls involved are thereby omitted. As a result, the automatic objects (if any) in these constructors and destructors are also eliminated, compared to naive abstract semantics implied by the source code.
On the other hand, free store allocation is definitely "allocation" by design. Under ISO C++ rules, such an allocation can be achieved by a call of an allocation function. However, since ISO C++14, there is a new (non-as-if) rule to allow merging global allocation function (i.e. ::operator new) calls in specific cases. So parts of dynamic allocation operations can also be no-op like the case of automatic objects.
Allocation functions allocate resources of memory. Objects can be further allocated based on allocation using allocators. For automatic objects, they are directly presented - although the underlying memory can be accessed and be used to provide memory to other objects (by placement new), but this does not make great sense as the free store, because there is no way to move the resources elsewhere.
All other concerns are out of the scope of C++. Nevertheless, they can be still significant.
About Implementations of C++
C++ does not expose reified activation records or some sorts of first-class continuations (e.g. by the famous call/cc), there is no way to directly manipulate the activation record frames - where the implementation need to place the automatic objects to. Once there is no (non-portable) interoperations with the underlying implementation ("native" non-portable code, such as inline assembly code), an omission of the underlying allocation of the frames can be quite trivial. For example, when the called function is inlined, the frames can be effectively merged into others, so there is no way to show what is the "allocation".
However, once interops are respected, things are getting complex. A typical implementation of C++ will expose the ability of interop on ISA (instruction-set architecture) with some calling conventions as the binary boundary shared with the native (ISA-level machine) code. This would be explicitly costly, notably, when maintaining the stack pointer, which is often directly held by an ISA-level register (with probably specific machine instructions to access). The stack pointer indicates the boundary of the top frame of the (currently active) function call. When a function call is entered, a new frame is needed and the stack pointer is added or subtracted (depending on the convention of ISA) by a value not less than the required frame size. The frame is then said allocated when the stack pointer after the operations. Parameters of functions may be passed onto the stack frame as well, depending on the calling convention used for the call. The frame can hold the memory of automatic objects (probably including the parameters) specified by the C++ source code. In the sense of such implementations, these objects are "allocated". When the control exits the function call, the frame is no longer needed, it is usually released by restoring the stack pointer back to the state before the call (saved previously according to the calling convention). This can be viewed as "deallocation". These operations make the activation record effectively a LIFO data structure, so it is often called "the (call) stack". The stack pointer effectively indicates the top position of the stack.
Because most C++ implementations (particularly the ones targeting ISA-level native code and using the assembly language as its immediate output) use similar strategies like this, such a confusing "allocation" scheme is popular. Such allocations (as well as deallocations) do spend machine cycles, and it can be expensive when the (non-optimized) calls occur frequently, even though modern CPU microarchitectures can have complex optimizations implemented by hardware for the common code pattern (like using a stack engine in implementing PUSH/POP instructions).
But anyway, in general, it is true that the cost of stack frame allocation is significantly less than a call to an allocation function operating the free store (unless it is totally optimized away), which itself can have hundreds of (if not millions of :-) operations to maintain the stack pointer and other states. Allocation functions are typically based on API provided by the hosted environment (e.g. runtime provided by the OS). Different to the purpose of holding automatic objects for functions calls, such allocations are general-purpose, so they will not have frame structure like a stack. Traditionally, they allocate space from the pool storage called the heap (or several heaps). Different from the "stack", the concept "heap" here does not indicate the data structure being used; it is derived from early language implementations decades ago. (BTW, the call stack is usually allocated with fixed or user-specified size from the heap by the environment in program/thread startup.) The nature of use cases makes allocations and deallocations from a heap far more complicated (than pushing/poppoing of stack frames), and hardly possible to be directly optimized by hardware.
Effects on Memory Access
The usual stack allocation always puts the new frame on the top, so it has a quite good locality. This is friendly to the cache. OTOH, memory allocated randomly in the free store has no such property. Since ISO C++17, there are pool resource templates provided by <memory_resource>. The direct purpose of such an interface is to allow the results of consecutive allocations being close together in memory. This acknowledges the fact that this strategy is generally good for performance with contemporary implementations, e.g. being friendly to cache in modern architectures. This is about the performance of access rather than allocation, though.
Concurrency
Expectation of concurrent access to memory can have different effects between the stack and heaps. A call stack is usually exclusively owned by one thread of execution in a typical C++ implementation. OTOH, heaps are often shared among the threads in a process. For such heaps, the allocation and deallocation functions have to protect the shared internal administrative data structure from the data race. As a result, heap allocations and deallocations may have additional overhead due to internal synchronization operations.
Space Efficiency
Due to the nature of the use cases and internal data structures, heaps may suffer from internal memory fragmentation, while the stack does not. This does not have direct impacts on the performance of memory allocation, but in a system with virtual memory, low space efficiency may degenerate overall performance of memory access. This is particularly awful when HDD is used as a swap of physical memory. It can cause quite long latency - sometimes billions of cycles.
Limitations of Stack Allocations
Although stack allocations are often superior in performance than heap allocations in reality, it certainly does not mean stack allocations can always replace heap allocations.
First, there is no way to allocate space on the stack with a size specified at runtime in a portable way with ISO C++. There are extensions provided by implementations like alloca and G++'s VLA (variable-length array), but there are reasons to avoid them. (IIRC, Linux source removes the use of VLA recently.) (Also note ISO C99 does have mandated VLA, but ISO C11 turns the support optional.)
Second, there is no reliable and portable way to detect stack space exhaustion. This is often called stack overflow (hmm, the etymology of this site), but probably more accurately, stack overrun. In reality, this often causes invalid memory access, and the state of the program is then corrupted (... or maybe worse, a security hole). In fact, ISO C++ has no concept of "the stack" and makes it undefined behavior when the resource is exhausted. Be cautious about how much room should be left for automatic objects.
If the stack space runs out, there are too many objects allocated in the stack, which can be caused by too many active calls of functions or improper use of automatic objects. Such cases may suggest the existence of bugs, e.g. a recursive function call without correct exit conditions.
Nevertheless, deep recursive calls are sometimes desired. In implementations of languages requiring support of unbound active calls (where the call depth only limited by total memory), it is impossible to use the (contemporary) native call stack directly as the target language activation record like typical C++ implementations. To work around the problem, alternative ways of the construction of activation records are needed. For example, SML/NJ explicitly allocates frames on the heap and uses cactus stacks. The complicated allocation of such activation record frames is usually not as fast as the call stack frames. However, if such languages are implemented further with the guarantee of proper tail recursion, the direct stack allocation in the object language (that is, the "object" in the language does not stored as references, but native primitive values which can be one-to-one mapped to unshared C++ objects) is even more complicated with more performance penalty in general. When using C++ to implement such languages, it is difficult to estimate the performance impacts.
I don't think stack allocation and heap allocation are generally interchangable. I also hope that the performance of both of them is sufficient for general use.
I'd strongly recommend for small items, whichever one is more suitable to the scope of the allocation. For large items, the heap is probably necessary.
On 32-bit operating systems that have multiple threads, stack is often rather limited (albeit typically to at least a few mb), because the address space needs to be carved up and sooner or later one thread stack will run into another. On single threaded systems (Linux glibc single threaded anyway) the limitation is much less because the stack can just grow and grow.
On 64-bit operating systems there is enough address space to make thread stacks quite large.
Usually stack allocation just consists of subtracting from the stack pointer register. This is tons faster than searching a heap.
Sometimes stack allocation requires adding a page(s) of virtual memory. Adding a new page of zeroed memory doesn't require reading a page from disk, so usually this is still going to be tons faster than searching a heap (especially if part of the heap was paged out too). In a rare situation, and you could construct such an example, enough space just happens to be available in part of the heap which is already in RAM, but allocating a new page for the stack has to wait for some other page to get written out to disk. In that rare situation, the heap is faster.
Aside from the orders-of-magnitude performance advantage over heap allocation, stack allocation is preferable for long running server applications. Even the best managed heaps eventually get so fragmented that application performance degrades.
Probably the biggest problem of heap allocation versus stack allocation, is that heap allocation in the general case is an unbounded operation, and thus you can't use it where timing is an issue.
For other applications where timing isn't an issue, it may not matter as much, but if you heap allocate a lot, this will affect the execution speed. Always try to use the stack for short lived and often allocated memory (for instance in loops), and as long as possible - do heap allocation during application startup.
A stack has a limited capacity, while a heap is not. The typical stack for a process or thread is around 8K. You cannot change the size once it's allocated.
A stack variable follows the scoping rules, while a heap one doesn't. If your instruction pointer goes beyond a function, all the new variables associated with the function go away.
Most important of all, you can't predict the overall function call chain in advance. So a mere 200 bytes allocation on your part may raise a stack overflow. This is especially important if you're writing a library, not an application.
It's not jsut stack allocation that's faster. You also win a lot on using stack variables. They have better locality of reference. And finally, deallocation is a lot cheaper too.
As others have said, stack allocation is generally much faster.
However, if your objects are expensive to copy, allocating on the stack may lead to an huge performance hit later when you use the objects if you aren't careful.
For example, if you allocate something on the stack, and then put it into a container, it would have been better to allocate on the heap and store the pointer in the container (e.g. with a std::shared_ptr<>). The same thing is true if you are passing or returning objects by value, and other similar scenarios.
The point is that although stack allocation is usually better than heap allocation in many cases, sometimes if you go out of your way to stack allocate when it doesn't best fit the model of computation, it can cause more problems than it solves.
I think the lifetime is crucial, and whether the thing being allocated has to be constructed in a complex way. For example, in transaction-driven modeling, you usually have to fill in and pass in a transaction structure with a bunch of fields to operation functions. Look at the OSCI SystemC TLM-2.0 standard for an example.
Allocating these on the stack close to the call to the operation tends to cause enormous overhead, as the construction is expensive. The good way there is to allocate on the heap and reuse the transaction objects either by pooling or a simple policy like "this module only needs one transaction object ever".
This is many times faster than allocating the object on each operation call.
The reason is simply that the object has an expensive construction and a fairly long useful lifetime.
I would say: try both and see what works best in your case, because it can really depend on the behavior of your code.
Stack allocation is a couple instructions whereas the fastest rtos heap allocator known to me (TLSF) uses on average on the order of 150 instructions. Also stack allocations don't require a lock because they use thread local storage which is another huge performance win. So stack allocations can be 2-3 orders of magnitude faster depending on how heavily multithreaded your environment is.
In general heap allocation is your last resort if you care about performance. A viable in-between option can be a fixed pool allocator which is also only a couple instructions and has very little per-allocation overhead so it's great for small fixed size objects. On the downside it only works with fixed size objects, is not inherently thread safe and has block fragmentation problems.
There's a general point to be made about such optimizations.
The optimization you get is proportional to the amount of time the program counter is actually in that code.
If you sample the program counter, you will find out where it spends its time, and that is usually in a tiny part of the code, and often in library routines you have no control over.
Only if you find it spending much time in the heap-allocation of your objects will it be noticeably faster to stack-allocate them.
Stack allocation will almost always be as fast or faster than heap allocation, although it is certainly possible for a heap allocator to simply use a stack based allocation technique.
However, there are larger issues when dealing with the overall performance of stack vs. heap based allocation (or in slightly better terms, local vs. external allocation). Usually, heap (external) allocation is slow because it is dealing with many different kinds of allocations and allocation patterns. Reducing the scope of the allocator you are using (making it local to the algorithm/code) will tend to increase performance without any major changes. Adding better structure to your allocation patterns, for example, forcing a LIFO ordering on allocation and deallocation pairs can also improve your allocator's performance by using the allocator in a simpler and more structured way. Or, you can use or write an allocator tuned for your particular allocation pattern; most programs allocate a few discrete sizes frequently, so a heap that is based on a lookaside buffer of a few fixed (preferably known) sizes will perform extremely well. Windows uses its low-fragmentation-heap for this very reason.
On the other hand, stack-based allocation on a 32-bit memory range is also fraught with peril if you have too many threads. Stacks need a contiguous memory range, so the more threads you have, the more virtual address space you will need for them to run without a stack overflow. This won't be a problem (for now) with 64-bit, but it can certainly wreak havoc in long running programs with lots of threads. Running out of virtual address space due to fragmentation is always a pain to deal with.
Remark that the considerations are typically not about speed and performance when choosing stack versus heap allocation. The stack acts like a stack, which means it is well suited for pushing blocks and popping them again, last in, first out. Execution of procedures is also stack-like, last procedure entered is first to be exited. In most programming languages, all the variables needed in a procedure will only be visible during the procedure's execution, thus they are pushed upon entering a procedure and popped off the stack upon exit or return.
Now for an example where the stack cannot be used:
Proc P
{
pointer x;
Proc S
{
pointer y;
y = allocate_some_data();
x = y;
}
}
If you allocate some memory in procedure S and put it on the stack and then exit S, the allocated data will be popped off the stack. But the variable x in P also pointed to that data, so x is now pointing to some place underneath the stack pointer (assume stack grows downwards) with an unknown content. The content might still be there if the stack pointer is just moved up without clearing the data beneath it, but if you start allocating new data on the stack, the pointer x might actually point to that new data instead.
class Foo {
public:
Foo(int a) {
}
}
int func() {
int a1, a2;
std::cin >> a1;
std::cin >> a2;
Foo f1(a1);
__asm push a1;
__asm lea ecx, [this];
__asm call Foo::Foo(int);
Foo* f2 = new Foo(a2);
__asm push sizeof(Foo);
__asm call operator new;//there's a lot instruction here(depends on system)
__asm push a2;
__asm call Foo::Foo(int);
delete f2;
}
It would be like this in asm. When you're in func, the f1 and pointer f2 has been allocated on stack (automated storage). And by the way, Foo f1(a1) has no instruction effects on stack pointer (esp),It has been allocated, if func wants get the member f1, it's instruction is something like this: lea ecx [ebp+f1], call Foo::SomeFunc(). Another thing the stack allocate may make someone think the memory is something like FIFO, the FIFO just happened when you go into some function, if you are in the function and allocate something like int i = 0, there no push happened.
It has been mentioned before that stack allocation is simply moving the stack pointer, that is, a single instruction on most architectures. Compare that to what generally happens in the case of heap allocation.
The operating system maintains portions of free memory as a linked list with the payload data consisting of the pointer to the starting address of the free portion and the size of the free portion. To allocate X bytes of memory, the link list is traversed and each note is visited in sequence, checking to see if its size is at least X. When a portion with size P >= X is found, P is split into two parts with sizes X and P-X. The linked list is updated and the pointer to the first part is returned.
As you can see, heap allocation depends on may factors like how much memory you are requesting, how fragmented the memory is and so on.
In general, stack allocation is faster than heap allocation as mentioned by almost every answer above. A stack push or pop is O(1), whereas allocating or freeing from a heap could require a walk of previous allocations. However you shouldn't usually be allocating in tight, performance-intensive loops, so the choice will usually come down to other factors.
It might be good to make this distinction: you can use a "stack allocator" on the heap. Strictly speaking, I take stack allocation to mean the actual method of allocation rather than the location of the allocation. If you're allocating a lot of stuff on the actual program stack, that could be bad for a variety of reasons. On the other hand, using a stack method to allocate on the heap when possible is the best choice you can make for an allocation method.
Since you mentioned Metrowerks and PPC, I'm guessing you mean Wii. In this case memory is at a premium, and using a stack allocation method wherever possible guarantees that you don't waste memory on fragments. Of course, doing this requires a lot more care than "normal" heap allocation methods. It's wise to evaluate the tradeoffs for each situation.
Never do premature assumption as other application code and usage can impact your function. So looking at function is isolation is of no use.
If you are serious with application then VTune it or use any similar profiling tool and look at hotspots.
Ketan
Naturally, stack allocation is faster. With heap allocation, the allocator has to find the free memory somewhere. With stack allocation, the compiler does it for you by simply giving your function a larger stack frame, which means the allocation costs no time at all. (I'm assuming you're not using alloca or anything to allocate a dynamic amount of stack space, but even then it's very fast.)
However, you do have to be wary of hidden dynamic allocation. For example:
void some_func()
{
std::vector<int> my_vector(0x1000);
// Do stuff with the vector...
}
You might think this allocates 4 KiB on the stack, but you'd be wrong. It allocates the vector instance on the stack, but that vector instance in turn allocates its 4 KiB on the heap, because vector always allocates its internal array on the heap (at least unless you specify a custom allocator, which I won't get into here). If you want to allocate on the stack using an STL-like container, you probably want std::array, or possibly boost::static_vector (provided by the external Boost library).
I'd like to say actually code generate by GCC (I remember VS also) doesn't have overhead to do stack allocation.
Say for following function:
int f(int i)
{
if (i > 0)
{
int array[1000];
}
}
Following is the code generate:
__Z1fi:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
subq $**3880**, %rsp <--- here we have the array allocated, even the if doesn't excited.
Ltmp2:
movl %edi, -4(%rbp)
movl -8(%rbp), %eax
addq $3880, %rsp
popq %rbp
ret
Leh_func_end1:
So whatevery how much local variable you have (even inside if or switch), just the 3880 will change to another value. Unless you didn't have local variable, this instruction just need to execute. So allocate local variable doesn't have overhead.