Increase stack size to use alloca()? - c++

This is two questions overlapping- I wish to try out alloca() for large arrays instead of assigning dynamically-sized arrays on the heap. This is so that I can increase performance without having to make heap allocations. However, I get the impression stack sizes are usually quite small? Are there any disadvantages to increasing the size of my stack so that I can take full advantage of alloca()? Is it The more RAM I have, the larger I can proportionally increase my stack size?
EDIT1: Preferably Linux
EDIT2: I don't have a specified size in mind- I would rather know how to judge what determines the limit/boundaries.

Stack sizes are (by default) 8MB on most unix-y platforms, and 1MB on Windows (namely, because Windows has a deterministic way for recovering from out of stack problems, while unix-y platforms usually throw a generic SIGSEGV signal).
If your allocations are large, you won't see much of a performance difference between allocating on the heap versus allocating on the stack. Sure, the stack is slightly more efficient per allocation, but if your allocations are large the number of allocations is likely to be small.
If you want a larger stack-like structure you can always write your own allocator which obtains a large block from malloc and then handles allocation/deallocation in a stack-like fashion.
#include <stdexcept>
#include <cstddef>
class StackLikeAllocator
{
std::size_t usedSize;
std::size_t maximumSize;
void *memory;
public:
StackLikeAllocator(std::size_t backingSize)
{
memory = new char[backingSize];
usedSize = 0;
maximumSize = backingSize;
}
~StackLikeAllocator()
{
delete[] memory;
}
void * Allocate(std::size_t desiredSize)
{
// You would have to make sure alignment was correct for your
// platform (Exercise to the reader)
std::size_t newUsedSize = usedSize + desiredSize;
if (newUsedSize > maximumSize)
{
throw std::bad_alloc("Exceeded maximum size for this allocator.");
}
void* result = static_cast<void*>(static_cast<char*>(memory) + usedSize);
usedSize = newUsedSize;
return result;
}
// If you need to support deallocation then modifying this shouldn't be
// too difficult
}

The default stack size that the main thread of a program gets is a compiler-specific (and/or OS-specific) thing and you should see the appropriate documentation to find out how to enlarge the stack.
It may happen that you may be unable to enlarge the program's default stack to an arbitrarily large size.
You may, however, as it's been pointed out, be able to create a thread at run time with a stack of the size you want.
In any event, there's not much benefit of alloca() over a once allocated large buffer. You don't need to free and reallocate it many times.

The most important difference between alloca() and new / malloc() is that all memory allocated with alloca() will be gone when you return from the current function.
alloca() is only useful for small temporary data structures.
It is only useful for small data structures since big data structures will destroy the cache locality of your stack which will give you a rather big performance hit. The same goes for arrays as local variables.
Use alloca() only in very specific circumstances. If unsure, do not use it at all.
The general rule is: Do not put big data structures (>= 1k) on the stack. The stack does not scale. It is a very limited resource.

To answer the first question: The stack size is typically small relative to the heap size (this would hold true in most Linux applications).
If the allocations you are planning are large relative to the actual default stack size, then I think it would be better to use dynamic allocation from the heap (rather than trying to increase stack sizes). The cost of using the memory (filling it, reading it, manipulating it) is probably going to far exceed the cost of the allocation. It is unlikely that you would see a measurable benefit by allocating from the stack in this scenario.

Related

Dynamic allocation store data in random location in the heap?

I know that local variables will be stored on the stack orderly.
but, when i dynamically allocate variable in the heap memory in c++ like this.
int * a = new int{1};
int * a2 = new int{2};
int * a3 = new int{3};
int * a4 = new int{4};
Question 1 : are these variable stored in contiguous memory location?
Question 2 : if not, is it because dynamic allocation store variables in random location in the heap memory?
Question3 : so does dynamic allocation increase possibility of cache miss and has low spatial locality?
Part 1: Are separate allocations contiguous?
The answer is probably not. How dynamic allocation occurs is implementation dependent. If you allocate memory like in the above example, two separate allocations might be contiguous, but there is no guarantee of this happening (and it should never be relied on to occur).
Different implementations of c++ use different algorithms for deciding how memory is allocated.
Part 2: Is allocation random?
Somewhat; but not entirely. Memory doesn’t get allocated in an intentionally random fashion. Oftentimes memory allocators will try to allocate blocks of memory near each other in order to minimize page faults and cache misses, but it’s not always possible to do so.
Allocation happens in two stages:
The allocator asks for a large chunk of memory from the OS
The takes pieces of that large chunk and returns them whenever you call new, until you ask for more memory than it has to give, in which case it asks for another large chunk from the OS.
This second stage is where an implementation can make attempts to give things you memory that’s near other recent allocations, however it has little control over the first stage (and the OS usually just provides whatever memory is available, without any knowledge of other allocations by your program).
Part 3: avoiding cache misses
If cache misses are a bottleneck in your code,
Try to reduce the amount of indirection (by having arrays store objects by value, rather than by pointer);
Ensure that the memory you’re operating on is as contiguous as the design permits (so use a std::array or std::vector, instead of a linked list, and prefer a few big allocations to lots of small ones); and
Try to design the algorithm so that it has to jump around in memory as little as possible.
A good general principle is to just use a std::vector of objects, unless you have a good reason to use something fancier. Because they have better cache locality, std::vector is faster at inserting and deleting elements than std::list, even up to dozens or even hundreds of elements.
Finally: try to take advantage of the stack. Unless there’s a good reason for something to be a pointer, just declare as a variable that lives on the stack. When possible,
Prefer to use MyClass x{}; instead of MyClass* x = new MyClass{};, and
Prefer std::vector<MyClass> instead of std::vector<MyClass*>.
By extension, if you can use static polymorphism (i.e, templates), use that instead of dynamic polymorphism.
IMHO this is Operating System specific / C++ standard library implementation.
new ultimately uses lower-level virtual memory allocation services and allocating several pages at once, using system calls like mmap and munmap. The implementation of new could reuse previously freed memory space when relevant.
The implementation of new could use various and different strategies for "large" and "small" allocations.
In the example you gave the first new results in a system call for memory allocation (usually several pages), the allocated memory could be large enough so that subsequent new calls results in contiguous allocation..But this depends on the implementation
In short:
not at all (there is padding due to alignment, heap housekeeping data, allocated chunks may be reused, etc.),
not at all (AFAIK, heap algorithms are deterministic without any randomness),
generally yes (e.g., memory pooling might help here).

Vector object that allocates on stack for small size, or on heap for larger ones

In real-time applications, e.g. audio programming, you should avoid allocating memory in the heap during callbacks, because execution time is unbounded. Indeed, if your executable has run out of memory, you'll need to wait for the OS to allocate a new chunk, which can take longer than the next callback call. I could store memory on the stack, e.g. using variable-length arrays (VLA) or alloca(), but if the array is too large you get a stack overflow.
Instead, I was thinking of defining a class with an interface similar to std::vector, but that internally uses the stack if the size is smaller than a certain threshold, and the heap otherwise (I prefer a possible unbounded operation to a certain stack overflow). For the heap part I could use an std::vector or new/delete. But what about the stack? VLAs and alloca() are deallocated when they get out of scope. What alternative could I use?
I could use an std::array<T, threshold>, but that would waste memory. I expect my threshold to be in the order of 2048.
If there is a possibility that data might need to be stored on the stack, and whether the size is greater than the threshold is only determined at run-time, then your type will have to include some stack container big enough to hold something of size threshold (whether std::array or something else), making your concern
I could use an std::array, but that would waste memory.
I expect my threshold to be in the order of 2048.
unavoidable. I don't think there is any way around this. E.g. in code like
uint32_t N = code_that_determines_size_at_runtime();
ThresholdContainer container(N);
the compiler cannot know whether N is above or below the threshold. So for this to work, the memory layout for ThresholdContainer has to contain the stack memory for data up to size threshold, which would be unused and wasted when N > threshold. (Stitching together the stack and heap memory with some iterator that goes between the two would be horrendous, and you probably want contiguous memory).
If on the other hand, the size versus threshold is known at compile time, you could define a class templated on the size N that essentially holds an std::array if N< threshold or a vector otherwise, and provides a common interface.

When is heap memory prefered over stack memory

I know that local arrays are created on the stack, and have automatic storage duration, since they are destroyed when the function they're in ends. They necessarily have a fixed size:
{
int foo[16];
}
Arrays created with operator new[] have dynamic storage duration and are stored on the heap. They can have varying sizes.
{
const int size = 16;
int* foo = new int[size];
// do something with foo
delete[] foo;
}
The size of the stack is fixed and limited for every process.
My question is:
Is there a rule of thumb when to switch from stack memory to heap memory, in order to reduce the stack memory consumption?
Example:
double a[2] is perfectly reasoable;
double a[1000000000] will most likely result in a stack overflow, if the stack size is 1mb
Where is a reasonable limit to switch to dynamic allocation?
See this answer for a discussion about heap allocation.
Where is a reasonable limit to switch to dynamic allocation?
In several cases, including:
too large automatic variables. As a rule of thumb, I recommend avoiding call frames of more than a few kilobytes (and a call stack of more than a megabytes). That limit might be increased if you are sure that your function is not usable recursively. On many small embedded systems, the stack is much more limited (e.g. to a few kilobytes) so you need to limit even more each call frame (e.g. to only a hundred bytes). BTW, on some systems, you can increase the call stack limit much more (perhaps to several gigabytes), but this is also a sysadmin issue.
non LIFO allocation discipline, which happens quite often.
Notice that most C++ standard containers allocate their data in the heap, even if the container is on the stack. For example, an automatic variable of vector type, e.g. a local std::vector<double> autovec; has its data heap allocated (and released when the vector is destroyed). Read more about RAII.

What are the next step to improve malloc() algorithm? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm writing my own simple malloc() function and I would like to create more faster and efficient variant. I'm writed function that use linear search and allocated sequentially and contiguously in memory.
What is the next step to improve this algorithm? What are the main shortcomings of my current version? I would be very grateful for any feedback and recommendation.
typedef struct heap_block
{
struct heap_block* next;
size_t size;
bool isfree;
}header;
#define Heap_Capacity 100000
static char heap[Heap_Capacity];
size_t heap_size;
void* malloc(size_t sz)
{
if(sz == 0 || sz > Heap_Capacity) { return NULL; }
header* block = (header*)heap;
if(heap_size == 0)
{
set_data_to_block(block, sz);
return (void*)(block+1);
}
while(block->next != NULL) { block = block->next; }
block->next = (header*)((char*)to_end_data(block) + 8);
header* new_block = block->next;
set_data_to_block(new_block, sz);
return (void*)(new_block+1);
}
void set_data_to_block(header* block, size_t sz)
{
block->size = sz;
block->isfree = false;
block->next = NULL;
heap_size += sz;
}
header* to_end_data(header* block)
{
return (header*)((size_t)(block+1) + block->size);
}
Notice that malloc is often built above lower level memory related syscalls (e.g. mmap(2) on Linux). See this answer which mentions GNU glibc and musl-libc. Look also inside tcmalloc, so study the source code of several free software malloc implementations.
Some general ideas for your malloc:
retrieve memory from the OS using mmap (and release it back to the OS kernel eventually with munmap). You certainly should not allocate a fixed size heap (since on a 64 bits computer with 128Gbytes of RAM you would want to succeed in malloc-ing a 10 billion bytes zone).
segregate small allocations from big ones, so handle differently malloc of 16 bytes from malloc of a megabyte. The typical threshold between small and large allocation is generally a small multiple of the page size (which is often 4Kbytes). Small allocations happen inside pages. Large allocations are rounded to pages. You might even handle very specially malloc of two words (like in many linked lists).
round up the requested size to some fancy number (e.g. powers of 2, or 3 times a power of 2).
manage together memory zones of similar sizes, that is of same "fancy" size.
for small memory zones, avoid reclaiming too early the memory zone, so keep previously free-d zones of same (small) size to reuse them in future calls to malloc.
you might use some tricks on the address (but your system might have ASLR), or keep near each memory zone a word of meta-data describing the chunk of which it is a member.
a significant issue is, given some address previously returned by malloc and argument of free, to find out the allocated size of that memory zone. You could manipulate the address bits, you could store that size in the word before, you could use some hash table, etc. Details are tricky.
Notice that details are tricky, and it might be quite hard to write a malloc implementation better than your system one. In practice, writing a good malloc is not a simple task. You should find many academic papers on the subject.
Look also into garbage collection techniques. Consider perhaps Boehm's conservative GC: you will replace malloc by GC_MALLOC and you won't bother about free... Learn about memory pools.
There are 3 ways to improve:
make it more robust
optimise the way memory is allocated
optimise the code that allocates the memory
Make It More Robust
There are many common programmer mistakes that could be easily detected (e.g. modifying data beyond the end of the allocated block). For a very simple (and relatively fast) example, it's possible to insert "canaries" before and after the allocated block, and detect programmer during free and realloc by checking if the canaries are correct (if they aren't then the programmer has trashed something by mistake). This only works "sometimes maybe". The problem is that (for a simple implementation of malloc) the meta-data is mixed in with the allocated blocks; so there's a chance that the meta-data has been corrupted even if the canaries haven't. To fix that it'd be a good idea to separate the meta-data from the allocated blocks. Also, merely reporting "something was corrupted" doesn't help as much as you'd hope. Ideally you'd want to have some sort of identifier for each block (e.g. the name of the function that allocated it) so that if/when problems occur you can report what was corrupted. Of course this could/should maybe be done via. a macro, so that those identifiers can be omitted when not debugging.
The main problem here is that the interface provided by malloc is lame and broken - there's simply no way to return acceptable error conditions ("failed to allocate" is the only error it can return) and no way to pass additional information. You'd want something more like int malloc(void **outPointer, size_t size, char *identifier) (with similar alterations to free and realloc to enable them to return a status code and identifier).
Optimise the Way Memory is Allocated
It's naive to assume that all memory is the same. It's not. Cache locality (including TLB locality) and other cache effects, and things like NUMA optimisation, all matter. For a simple example, imagine you're writing an application that has a structure describing a person (including a hash of their name) and a pointer to a person's name string; and both the structure and the name string are allocated via. malloc. The normal end result is that those structures and strings end up mixed together in the heap; so when you're searching through these structures (e.g. trying to find the structure that contains the correct hash) you end up pounding caches and TLBs. To optimise this properly you'd want to ensure that all the structures are close together in the heap. For that to work malloc needs to the difference between allocating 32 bytes for the structure and allocating 32 bytes for the name string. You need to introduce the concept of "memory pools" (e.g. where everything in "memory pool number 1" is kept close in the heap).
Another important optimisations include "cache colouring" (see http://en.wikipedia.org/wiki/Cache_coloring ). For NUMA systems, it can be important to know the difference between something where max. bandwidth is needed (where using memory from multiple NUMA domains increases bandwidth).
Finally, it'd be nice (to manage heap fragmentation, etc) to use different strategies for "temporary, likely to be freed soon" allocations and longer term allocations (e.g. where it's worth doing a extra to minimise fragmentation and wasted space/RAM).
Note: I'd estimate that getting all of this right can mean software running up to 20% faster in specific cases, due to far less cache misses, more bandwidth where it's needed, etc.
The main problem here is that the interface provided by malloc is lame and broken - there's simply no way to pass the additional information to malloc in the first place. You'd want something more like int malloc(void **outPointer, size_t size, char *identifier, int pool, int optimisationFlags) (with similar alterations to realloc).
Optimise the Code That Allocates the Memory
Given that you can assume memory is used more frequently than its allocated; this is the least important (e.g. less important than getting things like cache locality for the allocated blocks right).
Quite frankly, anyone that actually wants decent performance or decent debugging shouldn't be using malloc to begin with - generic solutions to specific problems are never ideal. With this in mind (and not forgetting that the interface to malloc is lame and broken and prevents everything that's important anyway) I'd recommend simply not bothering with malloc and creating something that's actually good (but non-standard) instead. For this you can adapt the algorithms used by existing implementations of malloc.

C++ stack array limit?

I'm running some code which may be pointing out I don't understand the difference between the heap and stack that well. Below I have some example code, where I either declare an array on the stack or the heap of 1234567 elements. Both work.
int main(int argc, char** argv){
int N = 1234567;
int A[N];
//int* A = new int[N];
}
But if we take N to be 12345678, I get a seg fault with int A[N], whereas the heap declaration still works fine. (I'm using g++ O3 -std=c++0x if that matters). What madness is this? Does the stack have a (rather small) array size limit?
This is because the stack is of a much smaller size than the heap. The heap can occupy all memory available to the program. By default VC++ compiles the stack with a size of 1 MB. The stack offers better performance but is for smaller quantities of data. In general it is not used for large data structures. This is why functions accepting lists/arrays/dictionaries/ect in c++ generally take a pointer or reference to that structure. Parameters passed by value are copied onto the stack and passing such structures would frequently cause programs to crash.
In your example you're using N int's, an int is 4 bytes. That makes the size of A[N] ~4.7 MB, much larger than the size of your stack.
The heap grows dynamically with allocation through malloc and co. The stack grows with each function call made in the course of running a program. The return address, arguments, local variables are usually stored in the stack (except that in certain processor architectures a handful of these are stored in registers instead). It is also possible (but not common) to allocate stack space dynamically.
The heap and the stack compete for the use of the same memory. You can think on one growing left to right and the other growing right to left. There is a possibility that, if left unchecked, they may collide. The stack is typically restrained from growing beyond a certain bound. This is relatively small because it is expected that it will use only a few bytes for most calls and only a few stack levels will be used. The limit is small but sufficient for most tasks. You can expand this limit by changing your build settings (not for Linux ELF binaries though) or by calling setrlimit. The OS may also impose a limit which you can change. There may be soft and hard limits (http://www.nics.tennessee.edu/node/327).
Going into greater detail about the limits falls outside the scope of the question. The bottomline is that the stack is limited and it is quite small because it competes with the heap for actual memory and for typical applications it need not be bigger.
http://en.wikipedia.org/wiki/Call_stack