C++ Some Stack questions

C++ Some Stack questions - c++

Let me start by saying that I have read this tutorial and have read this question. My questions are:
How big can the stack get ? Is it
processor/architecture/compiler
dependent ?
Is there a way to know exactly how
much memory is available to my
function/class stack and how much is
currently being used in order to
avoid overflows ?
Using modern compilers (say gcc 4.5)
on a modern computer (say 6 GB ram),
do I need to worry for stack
overflows or is it a thing of the
past ?
Is the actual stack memory
physically on RAM or on CPU cache(s) ?
How much faster is stack memory
access and read compared to heap
access and read ? I realize that
times are PC specific, so a ratio is
enough.
I've read that it is not advisable
to allocate big vars/objects on the
stack. How much is too big ? This
question here is given an answer
of 1MB for a thread in win32. How
about a thread in Linux amd64 ?
I apologize if those questions have been asked and answered already, any link is welcome !

Yes, the limit on the stack size varies, but if you care you're probably doing something wrong.
Generally no you can't get information about how much memory is available to your program. Even if you could obtain such information, it would usually be stale before you could use it.
If you share access to data across threads, then yes you normally need to serialize access unless they're strictly read-only.
You can pass the address of a stack-allocated object to another thread, in which case you (again) have to serialize unless the access is strictly read-only.
You can certainly overflow the stack even on a modern machine with lots of memory. The stack is often limited to only a fairly small fraction of overall memory (e.g., 4 MB).
The stack is allocated as system memory, but usually used enough that at least the top page or two will typically be in the cache at any given time.
Being part of the stack vs. heap makes no direct difference to access speed -- the two typically reside in identical memory chips, and often even at different addresses in the same memory chip. The main difference is that the stack is normally contiguous and heavily used, do the top few pages will almost always be in the cache. Heap-based memory is typically fragmented, so there's a much greater chance of needing data that's not in the cache.
Little has changed with respect to the maximum size of object you should allocate on the stack. Even if the stack can be larger, there's little reason to allocate huge objects there.
The primary way to avoid memory leaks in C++ is RAII (AKA SBRM, Stack-based resource management).
Smart pointers are a large subject in themselves, and Boost provides several kinds. In my experience, collections make a bigger difference, but the basic idea is largely the same either way: relieve the programmer of keeping track of every circumstance when a particular object can be used or should be freed.

1.How big can the stack get ? Is it processor/architecture/compiler dependent ?
The size of the stack is limited by the amount of memory on the platform and the amount of memory allocated to the process by the operating system.
2.Is there a way to know exactly how much memory is available to my function/class stack and how much is currently being used in order to avoid overflows ?
There is no C or C++ facility for determining the amount of available memory. There may be platform specific functions for this. In general, most programs try to allocate memory, then come up with a solution for when the allocation fails.
3.Using modern compilers (say gcc 4.5) on a modern computer (say 6 GB ram), do I need to worry for stack overflows or is it a thing of the past ?
Stack Overflows can happen depending on the design of the program. Recursion is a good example of depleting the stack, regardless of the amount of memory.
4.Is the actual stack memory physically on RAM or on CPU cache(s) ?
Platform dependent. Some CPU's can load up their cache with local variables on the stack. Wide variety of scenarios on this topic. Not defined in the language specification.
5.How much faster is stack memory access and read compared to heap access and read ?
I realize that times are PC specific, so a ratio is enough.
Usuallly there is no difference in speed. Depends on how the platform organizes its memory (physically) and how the executable's memory is laid out. The heap or stack could reside in a serial access memory chip (a slow method) or even on a Flash memory chip. Not specified in the language specification.
6.I've read that it is not advisable to allocate big vars/objects on the stack. How much is too big ? This question here is given an answer of 1MB for a thread in win32. How about a thread in Linux amd64 ?
The best advice is to allocate local small variables as needed (a.k.a. via stack). Huge items are either allocted from dynamic memory (a.k.a. heap), or some kind of global (static local to function or local to translation unit or even global variable). If the size is known at compile time, use the global type allocation. Use dynamic memory when the size may change during run-time.
The stack also contains information about function addresses. This is one major reason to not allocate a lot of objects locally. Some compilers have smaller limits for stacks than for heap or global variables. The premise is that nested function calls require less memory than large data arrays or buffers.
Remember that when switching threads or tasks, the OS needs to save the state somewhere. The OS may have different rules for saving stack memory versus other types.

1-2 : On some embedded CPUs the stack may be limited to a few kbytes; on some machines it may expand to gigabytes. There's no platform-independent way to know how big the stack can get, in some measure because some platforms are capable of expanding the stack when they reach the limit; the success of such an operation cannot always be predicted in advance.
3 : The effects of nearly-simultaneous writes, or of writes in one thread that occur nearly simultaneously with reads in another, are largely unpredictable in the absence of locks, mutexes, or other such devices. Certain things can be assumed (for example, if one thread reads a heap-stored 'int' while another thread changes it from 4 to 5, the first thread may see 4 or it may see 5; on most platforms, it would be guaranteed not to see 27).
4 : Some platforms share stack address space among threads; others do not. Passing pointers to things on the stack is usually a bad idea, though, since the the foreign thread receiving the pointer will have no way of ensuring that the target is in scope and won't go out of scope.
5 : Generally one does not need to worry about stack space in any routine which is written to limit recursion to a reasonable level. One does, however, need to worry about the possibility of defective data structures causing infinite recursion, which would wipe out any stack no matter how large it might be. One should also be mindful of the possibility of nasty input which would cause a much greater stack depth than expected. For example, a compiler using a recursive-descent parser might choke if fed a file containing a billion repetitions of the sequence "1+(". Even if the machine has a gig of stack space, if each nested sub-expression uses 64 bytes of stack, the aforementioned three-gig file could kill it.
6 : Stack is stored generally in RAM and/or cache; the most-recently-accessed parts will generally be in cache, while the less-recently-accessed parts will be in main memory. The same is generally true of code, heap, and static storage areas as well.
7 : That is very system dependent; generally, "finding" something on the heap will take as much time as accessing a few things on the stack, but in many cases making multiple accesses to different parts of the same heap object can be as fast as accessing a stack object.

Related

How to calculate remaining size of the stack? [duplicate]

I'm architecting a small software engine and I'd like to make expensive use of the stack for rapid iterations of large number sets. But then it occurred to me that this might be a bad idea since the stack isn't as large a memory store as the heap. But I am attracted to the stack's speed and lack of dynamic allocation coding practices.
Is there a way to find out how far I can push the stack on a given platform? I am looking mainly at mobile devices but the issue could come up on any platform.

On *nix, use getrlimit:
RLIMIT_STACK
The maximum size of the process stack, in bytes. Upon
reaching this limit, a SIGSEGV signal is generated. To handle
this signal, a process must employ an alternate signal stack
(sigaltstack(2)).
On Windows, use VirtualQuery:
For the first call, pass it the address of any value on the stack to
get the base address and size, in bytes, of the committed stack space.
On an x86 machine where the stack grows downwards, subtract the size
from the base address and VirtualQuery again: this will give you the
size of the space reserved for the stack (assuming you're not
precisely on the limit of stack size at the time). Summing the two
naturally gives you the total stack size.
There is no platform-independent method since stack size is left to the implementation and host system logically - on an embedded mini-SOC there are less resources to distribute than on a 128GB RAM server. You can however influence the stack size of a specific thread on all OS'es as well with API-specific calls.

A possible portable solution is to write an allocator yourself.
You do not have to make use of the process stack, just simulate it in the heap.
Allocate a large amount of memory in the beginning, and write a stack allocator on top of it to use it while allocating.
Google 'Allocator Requirements' for information on how to achieve it in C++.
I'm not sure if the term 'Stack Allocator' is canonical, but I mean that you have to put stack like restrictions on where the allocation or deallocation has to happen.
Since you said that your algorithm is suited to this pattern, I think it'd be easy.

In standard C++, definitely not. In a portable way, probably not. In a particular OS, sometimes. If nothing else, you could open your own executable size and inspect the headers of the executable file to see it's stacksize. [The next problem is of course "how much of the stack was used before this bit of code" - which can be difficult to determine].
If you run the code in a separate thread, many of the (low level) thread interfaces allow you to specify a stack (or stacksize), E.g Posix threads pthread_set_stacksize or MS _beginthread. Again, you don't know EXACTLY how much space has been used up before it gets to the actual thread code - but it's probably not a huge amount.
Of course, in an embedded system (e.g. mobile phone), the stacksize is typically quite small, 4K, 12K or 64KB is very much normal - sometimes even a lot smaller than that in some systems.
Another potential problem is that you can't really know how much space is ACTUALLY used on the stack - you can measure after the fact in a compiled system, and of course, if you have a stack local array of int array[25];, we can know it takes up at least 25 * sizeof(int) - but there may be padding, the compiler saves registers on the stack, etc, etc.
Edit, as an afterthought:
I also don't really see much benefit in having two code-paths:
if (enough_stack_space_for_something)
use_stack_based_algorithm();
else
use_heap_based_algorithm();
This would add a fair amount of extra overhead, and more code is generally not a good plan in an embedded/mobile system.
Edit2: Also, if allocating memory is a major part of the runtime, perhaps looking at why that is, for example block-creation of objects would help?

To expand on the answers already given about why there is no portable way to do this, the entire concept of an actual stack is not part of the standard. You could write a C or C++ runtime that doesn't use a stack at all other than the function call records (which might internally be a linked list or something else).
The stack is an implementation detail of a particular machine/OS/compiler. Hence any technique to access stack metrics will be specific to machine/OS/compiler.
While not an actual answer to your specific question (Niels covered that quite well) but as advice to your problem domain: just allocate a large chunk of memory in the heap. There's no reason aside from convenience that the "real" stack is any different. Highly recursive (non-tail-recursive) algorithms often need to do this to ensure that they have a virtually unbounded "stack." Scripting languages that want to ensure they give a runtime error/exception rather than crashing the host application also often do this. To be efficient about things, you can either implement a "split stack" (like a std::deque would give you) or you can just be sure to preallocate a stack big enough for your needs.

There's no standard way to do it from within the language. I'm not even aware of a documented extension that is able to query.
However some compilers have options to set the stack size. And platform may specify what it does when launching a process, and/or provide ways to set stack size of a new thread, maybe even manipulate existing one.
For small platforms it's usual to know the whole memory size, have all the data segments on one end, a set size arena for the heap (may be 0), and the rest is stack, approaching from the other side.

Max amount one should allocate on the stack

I've been searching Stack Overflow for a guideline on the max amount of memory one should allocate on the stack.
I see best practices for stack vs. heap allocation but nothing has numbers on a guideline on how much should be allocated on the stack and how much should be allocated on the heap.
Any ideas/numbers I can use as a guideline? When should I allocate on the stack vs. the heap and how much is too much?

In a typical case, the stack is limited to around 1-4 megabytes. To leave space for other parts of the code, you typically want to limit a single stack frame to no more than a few tens of kilobytes or so if possible. When/if recursion gets (or might get) involved, you typically want to limit it quite a bit more than that.

The answer here depends on the environment in which the code is running. On a small embedded system, the whole stack may be a few kilobytes. On a large system running on a desktop, the stack is typically in the megabytes.
For desktop/big embedded system, a few kilobytes is typically fine. For small embedded systems, that may not work well at all.
On the other hand, excessive use of the heap can lead to excessive overhead when calling new/delete frequently. So in a typical situation, you shouldn't use heap allocation for very small objects - unless necessary from other design criteria (e.g. you need a pointer to store permanently somewhere, and stack won't work for that as you are returning from the current function before the object has been finished with).
Of course, it's the overall design that matters. If you have a very simple application, with a few functions, none of which are recursive, it could be fine to allocate a few hundred kilobytes in main or a level above. On the other hand, if you are making a library for generic use, using more than a few kilobytes will probably not make you popular with the developers using the library. And if the library is being developed to run on low memory systems (in a washing machine, old style mobile phone, etc) then using more than a couple of hundred bytes is probably a bad idea.

Allocate on the stack as small as possible. Use the heap for datasets or else the stack allocation will carry through the scope's life, possibly thrashing the cache.

C program with minimum RAM

I want to understand the memory management in C and C++ programming for Application Development. Application will run on the PC.
If I want to make a program which uses RAM as less as possible while running, what are the points I need to consider while programming?
Here are two points according to what I understand, but I am not sure:
(1) Use minimum local variables in main() and other functions.
As local variables are saved in stack, which is RAM?
(2) Instead of local variables, use global variables on the top.
As global variables are saved in the uninitialized and initialized ROM area?
Thanks.

1) Generally the alternative to allocating on the stack is allocating on the heap (e.g., with malloc) which actually has a greater overhead due to bookkeeping/etc, and the stack already has memory reserved for it, so allocating on the stack where possible is often preferable. On the other hand there is less space on the stack while the heap can be close to “unlimited” on modern systems with virtual memory and 64-bit address space.
2) On PCs and other non-embedded system, everything in your program goes in RAM, i.e., it is not flashed to a ROM-like memory, so global versus local does not help in that regard. Also globals† tend to “live” as long as the application is running, while locals can be allocated and freed (either on the stack or heap) as required, and are thus preferable.
 
† More accurately, there can also be local variables with static duration, and variables with global scope that are pointers to dynamically allocated memory, so the terms local and global are used quite loosely here.
In general, modern desktop/laptop and even mobile operating systems are quite good at managing memory, so you probably shouldn't be trying to micro-optimize everything as you may actually do more harm than good.
If you really do need to bring down the memory footprint of your program, you must realize that everything in the program is stored in RAM, and so you need to work on reducing the number and size of the things you have, rather than trying to juggle their location. The other place where you can store things locally on a PC is the hard drive, so store large resources there and only load them as required (preferably only exactly the parts required). But remember that disk access is orders of magnitude slower than memory access, and that the operating system can also swap things out to disk if its memory gets full.
The program code itself is also stored in RAM, so have your compiler optimize for size (-Os or /Os option in many common compilers). Also remember that if you save a bit of space in variables by writing more complex code, the effort may be undone by the increased code size; save your optimizations for big wins (e.g., compressing large resources will require the added decompression code, but may still yield a large net win). Use of dynamically linked libraries (and other resources) also helps the overall memory footprint of the system if the same library is used by multiple programs running at the same time.
(Note that some of the above does not apply in embedded development, e.g., code and static constants may be indeed be stored in flash instead of RAM, etc.)

This is difficult because on your PC, the program will be running out of RAM unless you can somehow execute it out a ROM or Flash.
Here are the points to consider:
Reduce your code size.
Code takes up RAM.
Reduce variable quantity and size.
Variables need to live somewhere and that somewhere is in RAM.
Reduce character literals.
They too, take up space.
Reduce function call nesting.
A function may require parameters, which are placed in RAM.
A function that calls other functions needs a return path; the path is stored in RAM.
Use RAM from other devices.
Other devices, such as the Graphics Processor and your harddrive adaptor card, may have RAM you can use. If you use this RAM, you're not using the primary RAM.
Page memory to external device.
The OS is capable of virtual memory and can page memory out to an external device, such as a hard drive.
Edit 1 - Dynamic libraries
Too reduce the RAM footprint of your program, you could allocate a an area where you swap out library functions. This is similar to the DLL concept. When a function is needed, you load it from the hard drive into the reserved area.

Typically, a certain amount of space will be allocated for the stack; such space will be unavailable for other purposes whether or not it is used. If the space turns out to be inadequate, the program will die a gruesome death.
Local variables will be stored using some combination of registers and stack space. Some compilers will use the same registers or stack space for variables which are "live" at different times in a program's execution; others will not. Further, function arguments are typically pushed on the stack before calling a function and removed at the caller's convenience. In evaluating the code sequence:
function1(1,2,3,4,5);
function2(6,7,8,9,10);
the arguments for the first function will be pushed on the stack and that function will be called. At that point the compiler could remove those five values off the stack, but since a single instruction can remove any number of pushed values, many compilers will push the arguments for the second function (leaving the arguments of the first on the stack), call the second function, and then use one instruction to eliminate all ten. Normally this would be a non-issue, but in some deeply-nested recursive scenarios it could potentially be a problem.
Unless the "PC" you're developing for is tiny by today's standards, I wouldn't worry too much about trying to micro-optimize RAM usage. I've developed code for microcontrollers with only 25 bytes of RAM, and even written full-fledged games for use on a microprocessor-based console with a whopping 128 bytes (not KBytes!) of RAM, and on such system sit makes sense to worry about each individual byte. For PC applications, though, the only time it makes sense to worry about individual bytes is when they're part of a data structure which will get replicated many thousands of times in RAM.

You might want to get a book on "embedded" programming. Such a book will likely discuss ways to keep the memory footprint down, as embedded systems are more constrained than modern desktop or server systems.
When you use "local" variables, they are saved on the stack. As long as you don't use too much stack, this is basically free memory, as when the function exits the memory is returned. How much is "too much" varies... recently I had to work on a system where there is a limit of 8 KB of data for the stack per process.
When you use "global" variables or other static variables, the memory you use is tied up for the duration of the program. Thus you should minimize your use of globals, and/or find ways to share the same memory across multiple functions in your program.
I wrote a fairly elaborate "object manager" for a project I wrote a few years ago. A function can use the "get" operation to borrow an object, and then use the "release" operation when it is done borrowing the object. This means that all the functions in the system were able to share a relatively small amount of data space by taking turns using the shared objects. It's up to you to decide whether it is worth your time to build an "object manager" sort of thing or if you have enough memory to just use simple variables.
You can get much of the benefit of an "object manager" by simply calling malloc() and free() a lot. Then the heap allocator manages the shared resource, the heap memory, for you. The reason I wrote my own "object manager" was a need for speed. My system keeps using identical data objects, and it is way faster to just keep re-using the same ones than to keep freeing them and malloc-ing them again. Also, my system can be run on a DSP chip, and malloc() can be a surprisingly slow function on some DSP architectures.
Having multiple functions using the same global variables can lead you to tricky bugs, if one function tries to hold on to a global buffer while another one is overwriting the data. So your program will likely be more robust if you use malloc() and free() as long as each function only writes into data it allocated for itself. (But malloc() and free() can introduce bugs of their own: memory leaks, double-free errors, continuing to use a pointer after the data to which it points has been freed... if you use malloc() and free() be sure to use a tool such as Valgrind to check your code.)

Any variables, by definition, must be stored in read/write memory (or RAM). If you are talking about an embedded system with the code initially in ROM, then the runtime will copy the ROM image you identified into RAM to hold the values of the global variables.
Only items marked unchangeable (const) may be kept in the ROM during runtime.
Further, you need to reduce the depth of the calling structure of the program as each function call requires stack space (also in RAM) to record the return address and other values.
To minimise the use of memory, you can try to flag local variables with the register attribute, but this may not be honoured by your compiler.
Another common technique is to dynamically generate large variable data on the fly whenever it is required to avoid having to create buffers. These usually take up much more space the simple variables.

if this is a PC, then by default, you will be given a stack of a certain size ( you can make it bigger or smaller ). Using this stack is more efficient RAM wise than using global variables. because your ram usage will be the fixed stack size + globals + other stuff ( program, heap etc). The stack acts as a resuable piece of memory.

Is it possible to determine how much space is available on the stack?

I'm architecting a small software engine and I'd like to make expensive use of the stack for rapid iterations of large number sets. But then it occurred to me that this might be a bad idea since the stack isn't as large a memory store as the heap. But I am attracted to the stack's speed and lack of dynamic allocation coding practices.
Is there a way to find out how far I can push the stack on a given platform? I am looking mainly at mobile devices but the issue could come up on any platform.

On *nix, use getrlimit:
RLIMIT_STACK
The maximum size of the process stack, in bytes. Upon
reaching this limit, a SIGSEGV signal is generated. To handle
this signal, a process must employ an alternate signal stack
(sigaltstack(2)).
On Windows, use VirtualQuery:
For the first call, pass it the address of any value on the stack to
get the base address and size, in bytes, of the committed stack space.
On an x86 machine where the stack grows downwards, subtract the size
from the base address and VirtualQuery again: this will give you the
size of the space reserved for the stack (assuming you're not
precisely on the limit of stack size at the time). Summing the two
naturally gives you the total stack size.
There is no platform-independent method since stack size is left to the implementation and host system logically - on an embedded mini-SOC there are less resources to distribute than on a 128GB RAM server. You can however influence the stack size of a specific thread on all OS'es as well with API-specific calls.

A possible portable solution is to write an allocator yourself.
You do not have to make use of the process stack, just simulate it in the heap.
Allocate a large amount of memory in the beginning, and write a stack allocator on top of it to use it while allocating.
Google 'Allocator Requirements' for information on how to achieve it in C++.
I'm not sure if the term 'Stack Allocator' is canonical, but I mean that you have to put stack like restrictions on where the allocation or deallocation has to happen.
Since you said that your algorithm is suited to this pattern, I think it'd be easy.

To expand on the answers already given about why there is no portable way to do this, the entire concept of an actual stack is not part of the standard. You could write a C or C++ runtime that doesn't use a stack at all other than the function call records (which might internally be a linked list or something else).
The stack is an implementation detail of a particular machine/OS/compiler. Hence any technique to access stack metrics will be specific to machine/OS/compiler.
While not an actual answer to your specific question (Niels covered that quite well) but as advice to your problem domain: just allocate a large chunk of memory in the heap. There's no reason aside from convenience that the "real" stack is any different. Highly recursive (non-tail-recursive) algorithms often need to do this to ensure that they have a virtually unbounded "stack." Scripting languages that want to ensure they give a runtime error/exception rather than crashing the host application also often do this. To be efficient about things, you can either implement a "split stack" (like a std::deque would give you) or you can just be sure to preallocate a stack big enough for your needs.

There's no standard way to do it from within the language. I'm not even aware of a documented extension that is able to query.
However some compilers have options to set the stack size. And platform may specify what it does when launching a process, and/or provide ways to set stack size of a new thread, maybe even manipulate existing one.
For small platforms it's usual to know the whole memory size, have all the data segments on one end, a set size arena for the heap (may be 0), and the rest is stack, approaching from the other side.

Why is stack memory size so limited?

When you allocate memory on the heap, the only limit is free RAM (or virtual memory). It makes Gb of memory.
So why is stack size so limited (around 1 Mb)? What technical reason prevents you to create really big objects on the stack?
Update: My intent might not be clear, I do not want to allocate huge objects on the stack and I do not need a bigger stack. This question is just pure curiosity!

My intuition is the following. The stack is not as easy to manage as the heap. The stack need to be stored in continuous memory locations. This means that you cannot randomly allocate the stack as needed, but you need to at least reserve virtual addresses for that purpose. The larger the size of the reserved virtual address space, the fewer threads you can create.
For example, a 32-bit application generally has a virtual address space of 2GB. This means that if the stack size is 2MB (as default in pthreads), then you can create a maximum of 1024 threads. This can be small for applications such as web servers. Increasing the stack size to, say, 100MB (i.e., you reserve 100MB, but do not necessarily allocated 100MB to the stack immediately), would limit the number of threads to about 20, which can be limiting even for simple GUI applications.
A interesting question is, why do we still have this limit on 64-bit platforms. I do not know the answer, but I assume that people are already used to some "stack best practices": be careful to allocate huge objects on the heap and, if needed, manually increase the stack size. Therefore, nobody found it useful to add "huge" stack support on 64-bit platforms.

One aspect that nobody has mentioned yet:
A limited stack size is an error detection and containment mechanism.
Generally, the main job of the stack in C and C++ is to keep track of the call stack and local variables, and if the stack grows out of bounds, it is almost always an error in the design and/or the behaviour of the application.
If the stack would be allowed to grow arbitrarily large, these errors (like infinite recursion) would be caught very late, only after the operating systems resources are exhausted. This is prevented by setting an arbitrary limit to the stack size. The actual size is not that important, apart from it being small enough to prevent system degradation.

It is just a default size. If you need more, you can get more - most often by telling the linker to allocate extra stack space.
The downside to having large stacks is that if you create many threads, they will need one stack each. If all the stacks are allocating multi-MBs, but not using it, the space will be wasted.
You have to find the proper balance for your program.
Some people, like #BJovke, believe that virtual memory is essentially free. It is true that you don't need to have physical memory backing all the virtual memory. You do have to be able to at least give out addresses to the virtual memory.
However, on a typical 32-bit PC the size of the virtual memory is the same as the size of the physical memory - because we only have 32 bits for any address, virtual or not.
Because all threads in a process share the same address space, they have to divide it between them. And after the operating system has taken its part, there is "only" 2-3 GB left for an application. And that size is the limit for both the physical and the virtual memory, because there just aren't any more addresses.

For one thing, the stack is continuous, so if you allocate 12MB, you must remove 12MB when you want to go below whatever you created. Also moving objects around becomes much harder. Here is a real world example that may make things easier to understand:
Say you are stacking boxes around a room. Which is easier to manage:
stacking boxes of any weight on top of each other, but when you need to get something on the bottom you have to undo your entire pile. If you want to take a item out of the pile and give it to someone else you must take off all of the boxes and move the box to the other person's pile (Stack only)
You put all of your boxes (except for really small boxes) over in a special area where you do not stack stuff on top of other stuff and write down where you put it on a piece of paper (a pointer) and put the paper on the pile. If you need to give the box to someone else you just hand them the slip of paper from your pile, or just give them a photocopy of the paper and leave the original where it was in your pile. (Stack + heap)
Those two examples are gross generalizations and there are some points that are blatantly wrong in the analogy but it is close enough that it hopefully will help you see the advantages in both cases.

Think of the stack in the order of near to far. Registers are close to the CPU (fast), the stack is a bit further (but still relatively close) and the heap is far away (slow access).
The stack lives on the heap ofcourse, but still, since it's being used continuously, it probably never leaves the CPU cache(s), making it faster than just the average heap access.
This is a reason to keep the stack reasonably sized; to keep it cached as much as possible. Allocating big stack objects (possibly automatically resizing the stack as you get overflows) goes against this principle.
So it's a good paradigm for performance, not just a left-over from old times.

Allocating large objects in a, say, 100MB stack would make it impossible on most machines to have them loaded at once into cache, which pretty much defeats the purpose of the stack.
The point of the stack is to have small objects that belong to the same scope (and are, therefore, usually needed together or close to each other) stored together in contiguous memory addresses, so that the program can have them all loaded into cache at the same time, minimizing cache misses and, in general, the time CPU has to wait until it gets some missing piece of data from the slower RAM.
A 50MB object stored in the stack would not fit into the cache, meaning after every cache line there would be a CPU waiting time until the next piece of data is brought from RAM, meaning one would be clogging the call stack and not getting any significant benefit (in terms of speed) as compared to loading from the heap.

Many of the things you think you need a big stack for, can be done some other way.
Sedgewick's "Algorithms" has a couple good examples of "removing" recursion from recursive algorithms such as QuickSort, by replacing the recursion with iteration. In reality, the algorithm is still recursive, and there is still as stack, but you allocate the sorting stack on the heap, rather than using the runtime stack.
(I favor the second edition, with algorithms given in Pascal. It can be had used for eight bucks.)
Another way to look at it, is if you think you need a big stack, your code is inefficient. There is a better way that uses less stack.

If you could have an infinte stack then every virtual address could potentially be used by the stack. If the stack can use evey address, then there is no place for the heap to go. Every address you picked for a heap variable could be overwritten by a growing stack.
To put it another way, variables on the stack and variables on the heap occupy the same virtual address space. We need some way of preventing the heap allocator from allocating data where the stack might grow into. A stack size is an easy way to do it. The heap allocator knows that the stack addresses are taken and so it uses something else.

I don't think there is any technical reason, but it would be a strange app that just created just one huge super-object on the stack. Stack objects lack flexibility that becomes more problematic with increasing size - you cannot return without destroying them and you cannot queue them to other threads.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js