I am trying to write a basic OS to better understand OS fundamentals and I am running into a strange problem. After switching to protected mode I jump into my kernel. In my kernel.cpp file I declare the following global variables (where IdtPointer_t and IdtEntry_t are both structs.)
IdtPointer_t idtPtr;
IdtEntry_t idtEntries[256];
This creates the idtPtr and idtEntries variables in the bss section.
Then later in my code when I do the following
IdtEntry_t* entry = &idtEntries[0];
the value returned by &idtEntries[0] isn't the correct address. Using GDB I have done the following
p &idtEntries[0]
(IdtEntry_t *) 0x87a0 <idtEntries>
p entry
(IdtEntry_t *) 0x87e0 <idtEntries+64>
There is a 64 byte difference between the locations of the two variables. Why does referencing the variable return a different address than where the variable is stored in memory?
Also, I am running this using the qemu i386 emulator.
Why does referencing the variable return a different address than where the variable is stored in memory? It does not. I strongly suspect that what GDB is displaying is not what you think it is displaying (although I'm no GDB expert).
Assuming you are developing this on a linux system, I suggest supplementing your observations with the output of 'nm' (or it's cross-compiler relative).
nm -n <elf file>
This will reliably give you a list of all the symbols in your kernel/OS and their addresses (sorted by numerical order). Then compare the addresses of 'idtEntries' and 'entry' against what you got in GDB.
I'm trying to calculate maximum stack usage of an embedded program using static analysis.
I've used the compiler flag -fstack-usage to get the maximum stack usage for each function and the flag -fdump-rtl-expand to generate a graph of all function calls.
The last missing ingredient is stack usage of built-in functions. (at the moment it's only memset)
I guess I could measure it some other way and put a constant into my script. However, I don't want a situation where the implementation of the built-in function changes in a new version of GCC and the value in my script stays the same.
Maybe there is some way to compile built-in functions with the flag -fstack-usage? Or some other way to measure their stack usage via static analysis?
Edit:
This question is not a duplicate of Stack Size Estimation. The other question is about estimating stack usage of an entire program while I asked how to estimate it for a single built-in library function. The other question doesn't even mention built-in library functions nor any of the answers for it does.
Approach 1 (dynamic analysis)
You could determine stack size at runtime by filling stack with a predefined pattern, executing memset and then checking how many bytes have been modified. This is slower and more involved as you need to compile a sample program, upload it to target (unless you have a simulator) and collect results. You'll also need to be careful about test data that you supply to the function as execution path may change depending on size, data alignment, etc.
For a real-world example of this approach check Abseil's code.
Approach 2 (static analysis)
In general static analysis of binary code is tricky (even disassembling it isn't trivial) and you'd need sophisticated symbolic execution machinery to deal with it (e.g. miasm). But in most cases you can safely rely on detecting patterns which your compiler uses to allocate frames. E.g. for x86_64 GCC you could do something like:
objdump -d /lib64/libc.so.6 | sed -ne '/<__memset_x86_64>:/,/^$/p' > memset.d
NUM_PUSHES=$(grep -c pushq memset.d)
LOCALS=$(sed -ne '/sub .*%rsp/{ s/.*sub \+\$\([^,]\+\),%rsp.*/\1/; p }' memset.d)
LOCALS=$(printf '%d' $LOCALS) # Unhex
echo $(( LOCALS + 8 * NUM_PUSHES ))
Note that this simple approach produces a conservative estimate (getting more precise result is doable but would require a path-sensitive analysis which requires proper parsing, building control-flow graph, etc.) and does not handle nested function calls (can be easily added but should probly be done in a language more expressive than shell).
AVR assembly is in general more complicated because you can't easily detect allocation of space for local variables (modification of stack pointer is split across several in, out and adiw instructions so would require non-trivial parsing in e.g. Python). Simple functions like memset or memcpy don't use local variables so you can still get away with simple greps:
NUM_PUSHES=$(grep -c 'push ' memset.d)
NUM_RCALLS=$(grep -c 'rcall \+\.+0' memset.d)
# A safety check for functions which we can't handle
if grep -qi 'out \+0x3[de]' memset.d; then
echo >&2 'Unable to parse stack modification'
exit 1
fi
echo $((NUM_PUSHES + 2 * NUM_RCALLS))
This is not a great answer but it still may be useful.
Many of built-in functions are very simple. For example memset can be implemented just as a simple loop. From my observation it appears that compiler avoid using stack if it can just use registers (which makes perfect sense). Only very long function need more stack. All that shorter ones need is the return address for ret instruction.
It is relatively safe to assume that simple built-in functions don't use stack at all aside from instructions call and ret, so the amount of memory is equal to size of pointer to a function. (2 bytes in my case)
Keep in mind that embedded systems don't always have Von Neumann architecture and they often store instructions and data in separate memories. Size of pointers to function and data may be different.
i can create this array:
int Array[490000000];
cout << "Array Byte= " << sizeof(Array) << endl;
Array byte = 1,960,000,000 byte and convert gb = 1,96 GB about 2 gb whatever.
but i cant create same time these:
int Array[490000000];
int Array2[490000000];
it give error why ? sorry for bad englisgh :)
Also i checked my compiler like this:
printf("%d\n", sizeof(char *));
it gives me 8.
C++ programs are not usually compiled to have 2Gb+ of stack space, regardless of whether it is compiled in 32-bit mode or 64-bit mode. Stack space can be increased as part of the compiler options, but even in the scenario where it is permissible to set the stack size that high, it's still not an ideomatic solution or recommended.
If you need an array of 2Gb, you should use std::vector<int> Array(490'000'000); (strongly recommended) or a manually created array, i.e. int* Array = new int[490'000'000]; (remember that manually allocated memory must be manually deallocated with delete[]), either of which will allocate dynamic memory. You'll still want to be compiling in 64-bit mode, since this will brush up against the maximum memory limit of your application if you don't, but in your scenario, it's not strictly necessary, since 2Gb is less than the maximum memory of a 32-bit application.
But still I can't use more than 2 GB :( why?
The C++ language does not have semantics to modify (nor report) how much automatic memory is available (or at least I have not seen it.) The compilers rely on the OS to provide some 'useful' amount. You will have to search (google? your hw documents, user's manuals, etc) for how much. This limit is 'machine' dependent, in that some machines do not have as much memory as you may want.
On Ubuntu, for the last few releases, the Posix function ::pthread_attr_getstacksize(...) reports 8 M Bytes per thread. (I am not sure of the proper terminology, but) what linux calls 'Stack' is the resource that the C++ compiler uses for the automatic memory. For this release of OS and compiler, the limit for automatic var's is thus 8M (much smaller than 2G).
I suppose that because the next machine might have more memory, the compiler might be given a bigger automatic memory, and I've seen no semantics that will limit the size of your array based on memory size of the machine performing the compile ...
there can can be no compile-time-report that the stack will overflow.
I see Posix has a function suggesting a way to adjust size of stack. I've not tried it.
I have also found Ubuntu commands that can report and adjust size of various memory issues.
From https://www.nics.tennessee.edu/:
The command to modify limits varies by shell. The C shell (csh) and
its derivatives (such as tcsh) use the limit command to modify limits.
The Bourne shell (sh) and its derivatives (such as ksh and bash) use
the ulimit command. The syntax for these commands varies slightly and
is shown below. More detailed information can be found in the man page
for the shell you are using.
One minor experiment ... the command prompt
& dtb_chimes
launches this work-in-progress app which uses Posix and reports 8 MByte stack (automatic var)
With the ulimit prefix command
$ ulimit -S -s 131072 ; dtb_chimes
the app now reports 134,217,728
./dtb_chimes_ut
default Stack size: 134,217,728
argc: 1
1 ./dtb_chimes_ut
But I have not confirmed the actual allocation ... and this is still a lot smaller than 1.96 GBytes ... but, maybe you can get there.
Note: I strongly recommend std::vector versus big array.
On my Ubuntu desktop, there is 4 GByte total dram (I have memory test utilities), and my dynamic memory is limited to about 3.5 GB. Again, the amount of dynamic memory is machine dependent.
64 bits address a lot more memory than I can afford.
I'm have an Arduino Uno R3. I'm making logical objects for each of my sensors using C++. The Arduino has very limited on-board memory 32KB*, and, on average, my compiled objects are coming out around 6KB*.
I am already using the smallest possible data types required, in an attempt to minimize my memory footprint. Is there a compiler flag to minimize the size of the binary, or do I need to use shorter variable and function names, less functions, etc. to minimize my code base?
Also, any other tips or words of advice for minimizing binary size would be appreciated.
*It may not be measured in KB (as I don't have it sitting in front of me), but 1 object is approximately 1/5 of my total memory size, which is prompting my concern.
There are lots of techniques to reduce binary size in addition to what us2012 and others mentioned in the comments, summing them up with some points of my own:
Use -Os to make gcc/g++ optimize for size.
Use -ffunction-sections -fdata-sections to separate each function or data into distinct sections within the translation unit. Combine it with the linker option -Wl,--gc-sections to get rid of any unreferenced sections.
Run strip with at least the following options: -s -R .comment -R .gnu.version. It can be combined with --strip-unneeded to remove all symbols that are not necessary for relocation processing.
If your code does not contain c++-exception-handling you can save a lot of space (up to 30k after all optimize steps mentioned by Tuxdude).
Therefore you have to provide the following flag:
-fno-exceptions
But even if you don't use exceptions, the exception handling can be included!
Check the following steps:
no usage of new, delete. If you really need it replace them by malloc/free wrappers. For an example search for "tinynew.cpp"
provide function for pure virtual call, e.g.extern "C" void __cxa_pure_virtual() { while(1); }
overwrite __gnu_cxx::__verbose_terminate_handler(). It handles unhandled exceptions and does name demangling, which is quite large! (e.g d_print_comp.part.10 with 9.5k or d_type with 1.8k)
Cheers
Flo
I want to do DFS on a 100 X 100 array. (Say elements of array represents graph nodes) So assuming worst case, depth of recursive function calls can go upto 10000 with each call taking upto say 20 bytes. So is it feasible means is there a possibility of stackoverflow?
What is the maximum size of stack in C/C++?
Please specify for gcc for both
1) cygwin on Windows
2) Unix
What are the general limits?
In Visual Studio the default stack size is 1 MB i think, so with a recursion depth of 10,000 each stack frame can be at most ~100 bytes which should be sufficient for a DFS algorithm.
Most compilers including Visual Studio let you specify the stack size. On some (all?) linux flavours the stack size isn't part of the executable but an environment variable in the OS. You can then check the stack size with ulimit -s and set it to a new value with for example ulimit -s 16384.
Here's a link with default stack sizes for gcc.
DFS without recursion:
std::stack<Node> dfs;
dfs.push(start);
do {
Node top = dfs.top();
if (top is what we are looking for) {
break;
}
dfs.pop();
for (outgoing nodes from top) {
dfs.push(outgoing node);
}
} while (!dfs.empty())
Stacks for threads are often smaller.
You can change the default at link time,
or change at run time also.
For reference, some defaults are:
glibc i386, x86_64: 7.4 MB
Tru64 5.1: 5.2 MB
Cygwin: 1.8 MB
Solaris 7..10: 1 MB
MacOS X 10.5: 460 KB
AIX 5: 98 KB
OpenBSD 4.0: 64 KB
HP-UX 11: 16 KB
Platform-dependent, toolchain-dependent, ulimit-dependent, parameter-dependent.... It is not at all specified, and there are many static and dynamic properties that can influence it.
Yes, there is a possibility of stack overflow. The C and C++ standard do not dictate things like stack depth, those are generally an environmental issue.
Most decent development environments and/or operating systems will let you tailor the stack size of a process, either at link or load time.
You should specify which OS and development environment you're using for more targeted assistance.
For example, under Ubuntu Karmic Koala, the default for gcc is 2M reserved and 4K committed but this can be changed when you link the program. Use the --stack option of ld to do that.
I just ran out of stack at work, it was a database and it was running some threads, basically the previous developer had thrown a big array on the stack, and the stack was low anyway. The software was compiled using Microsoft Visual Studio 2015.
Even though the thread had run out of stack, it silently failed and continued on, it only stack overflowed when it came to access the contents of the data on the stack.
The best advice i can give is to not declare arrays on the stack - especially in complex applications and particularly in threads, instead use heap. That's what it's there for ;)
Also just keep in mind it may not fail immediately when declaring the stack, but only on access. My guess is that the compiler declares stack under windows "optimistically", i.e. it will assume that the stack has been declared and is sufficiently sized until it comes to use it and then finds out that the stack isn't there.
Different operating systems may have different stack declaration policies. Please leave a comment if you know what these policies are.
I am not sure what you mean by doing a depth first search on a rectangular array, but I assume you know what you are doing.
If the stack limit is a problem you should be able to convert your recursive solution into an iterative solution that pushes intermediate values onto a stack which is allocated from the heap.
(Added 26 Sept. 2020)
On 24 Oct. 2009, as #pixelbeat first pointed out here, Bruno Haible empirically discovered the following default thread stack sizes for several systems. He said that in a multithreaded program, "the default thread stack size is" as follows. I added in the "Actual" size column because #Peter.Cordes indicates in his comments below my answer, however, that the odd tested numbers shown below do not include all of the thread stack, since some of it was used in initialization. If I run ulimit -s to see "the maximum stack size" that my Linux computer is configured for, it outputs 8192 kB, which is exactly 8 MB, not the odd 7.4 MB listed in the table below for my x86-64 computer with the gcc compiler and glibc. So, you can probably add a little to the numbers in the table below to get the actual full stack size for a given thread.
Note also that the below "Tested" column units are all in MB and KB (base 1000 numbers), NOT MiB and KiB (base 1024 numbers). I've proven this to myself by verifying the 7.4 MB case.
Thread stack sizes
System and std library Tested Actual
---------------------- ------ ------
- glibc i386, x86_64 7.4 MB 8 MiB (8192 KiB, as shown by `ulimit -s`)
- Tru64 5.1 5.2 MB ?
- Cygwin 1.8 MB ?
- Solaris 7..10 1 MB ?
- MacOS X 10.5 460 KB ?
- AIX 5 98 KB ?
- OpenBSD 4.0 64 KB ?
- HP-UX 11 16 KB ?
Bruno Haible also stated that:
32 KB is more than you can safely allocate on the stack in a multithreaded program
And he said:
And the default stack size for sigaltstack, SIGSTKSZ, is
only 16 KB on some platforms: IRIX, OSF/1, Haiku.
only 8 KB on some platforms: glibc, NetBSD, OpenBSD, HP-UX, Solaris.
only 4 KB on some platforms: AIX.
Bruno
He wrote the following simple Linux C program to empirically determine the above values. You can run it on your system today to quickly see what your maximum thread stack size is, or you can run it online on GDBOnline here: https://onlinegdb.com/rkO9JnaHD.
Explanation: It simply creates a single new thread, so as to check the thread stack size and NOT the program stack size, in case they differ, then it has that thread repeatedly allocate 128 bytes of memory on the stack (NOT the heap), using the Linux alloca() call, after which it writes a 0 to the first byte of this new memory block, and then it prints out how many total bytes it has allocated. It repeats this process, allocating 128 more bytes on the stack each time, until the program crashes with a Segmentation fault (core dumped) error. The last value printed is the estimated maximum thread stack size allowed for your system.
Important note: alloca() allocates on the stack: even though this looks like dynamic memory allocation onto the heap, similar to a malloc() call, alloca() does NOT dynamically allocate onto the heap. Rather, alloca() is a specialized Linux function to "pseudo-dynamically" (I'm not sure what I'd call this, so that's the term I chose) allocate directly onto the stack as though it was statically-allocated memory. Stack memory used and returned by alloca() is scoped at the function-level, and is therefore "automatically freed when the function that called alloca() returns to its caller." That's why its static scope isn't exited and memory allocated by alloca() is NOT freed each time a for loop iteration is completed and the end of the for loop scope is reached. See man 3 alloca for details. Here's the pertinent quote (emphasis added):
DESCRIPTION
The alloca() function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca() returns to its caller.
RETURN VALUE
The alloca() function returns a pointer to the beginning of the allocated space. If the allocation causes stack overflow, program behavior is undefined.
Here is Bruno Haible's program from 24 Oct. 2009, copied directly from the GNU mailing list here:
Again, you can run it live online here.
// By Bruno Haible
// 24 Oct. 2009
// Source: https://lists.gnu.org/archive/html/bug-coreutils/2009-10/msg00262.html
// =============== Program for determining the default thread stack size =========
#include <alloca.h>
#include <pthread.h>
#include <stdio.h>
void* threadfunc (void*p) {
int n = 0;
for (;;) {
printf("Allocated %d bytes\n", n);
fflush(stdout);
n += 128;
*((volatile char *) alloca(128)) = 0;
}
}
int main()
{
pthread_t thread;
pthread_create(&thread, NULL, threadfunc, NULL);
for (;;) {}
}
When I run it on GDBOnline using the link above, I get the exact same results each time I run it, as both a C and a C++17 program. It takes about 10 seconds or so to run. Here are the last several lines of the output:
Allocated 7449856 bytes
Allocated 7449984 bytes
Allocated 7450112 bytes
Allocated 7450240 bytes
Allocated 7450368 bytes
Allocated 7450496 bytes
Allocated 7450624 bytes
Allocated 7450752 bytes
Allocated 7450880 bytes
Segmentation fault (core dumped)
So, the thread stack size is ~7.45 MB for this system, as Bruno mentioned above (7.4 MB).
I've made a few changes to the program, mostly just for clarity, but also for efficiency, and a bit for learning.
Summary of my changes:
[learning] I passed in BYTES_TO_ALLOCATE_EACH_LOOP as an argument to the threadfunc() just for practice passing in and using generic void* arguments in C.
Note: This is also the required function prototype, as required by the pthread_create() function, for the callback function (threadfunc() in my case) passed to pthread_create(). See: https://www.man7.org/linux/man-pages/man3/pthread_create.3.html.
[efficiency] I made the main thread sleep instead of wastefully spinning.
[clarity] I added more-verbose variable names, such as BYTES_TO_ALLOCATE_EACH_LOOP and bytes_allocated.
[clarity] I changed this:
*((volatile char *) alloca(128)) = 0;
to this:
volatile uint8_t * byte_buff =
(volatile uint8_t *)alloca(BYTES_TO_ALLOCATE_EACH_LOOP);
byte_buff[0] = 0;
Here is my modified test program, which does exactly the same thing as Bruno's, and even has the same results:
You can run it online here, or download it from my repo here. If you choose to run it locally from my repo, here's the build and run commands I used for testing:
Build and run it as a C program:
mkdir -p bin && \
gcc -Wall -Werror -g3 -O3 -std=c11 -pthread -o bin/tmp \
onlinegdb--empirically_determine_max_thread_stack_size_GS_version.c && \
time bin/tmp
Build and run it as a C++ program:
mkdir -p bin && \
g++ -Wall -Werror -g3 -O3 -std=c++17 -pthread -o bin/tmp \
onlinegdb--empirically_determine_max_thread_stack_size_GS_version.c && \
time bin/tmp
It takes < 0.5 seconds to run locally on a fast computer with a thread stack size of ~7.4 MB.
Here's the program:
// =============== Program for determining the default thread stack size =========
// Modified by Gabriel Staples, 26 Sept. 2020
// Originally by Bruno Haible
// 24 Oct. 2009
// Source: https://lists.gnu.org/archive/html/bug-coreutils/2009-10/msg00262.html
#include <alloca.h>
#include <pthread.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <unistd.h> // sleep
/// Thread function to repeatedly allocate memory within a thread, printing
/// the total memory allocated each time, until the program crashes. The last
/// value printed before the crash indicates how big a thread's stack size is.
///
/// Note: passing in a `uint32_t` as a `void *` type here is for practice,
/// to learn how to pass in ANY type to a func by using a `void *` parameter.
/// This is also the required function prototype, as required by the
/// `pthread_create()` function, for the callback function (this function)
/// passed to `pthread_create()`. See:
/// https://www.man7.org/linux/man-pages/man3/pthread_create.3.html
void* threadfunc(void* bytes_to_allocate_each_loop)
{
const uint32_t BYTES_TO_ALLOCATE_EACH_LOOP =
*(uint32_t*)bytes_to_allocate_each_loop;
uint32_t bytes_allocated = 0;
while (true)
{
printf("bytes_allocated = %u\n", bytes_allocated);
fflush(stdout);
// NB: it appears that you don't necessarily need `volatile` here,
// but you DO definitely need to actually use (ex: write to) the
// memory allocated by `alloca()`, as we do below, or else the
// `alloca()` call does seem to get optimized out on some systems,
// making this whole program just run infinitely forever without
// ever hitting the expected segmentation fault.
volatile uint8_t * byte_buff =
(volatile uint8_t *)alloca(BYTES_TO_ALLOCATE_EACH_LOOP);
byte_buff[0] = 0;
bytes_allocated += BYTES_TO_ALLOCATE_EACH_LOOP;
}
}
int main()
{
const uint32_t BYTES_TO_ALLOCATE_EACH_LOOP = 128;
pthread_t thread;
pthread_create(&thread, NULL, threadfunc,
(void*)(&BYTES_TO_ALLOCATE_EACH_LOOP));
while (true)
{
const unsigned int SLEEP_SEC = 10000;
sleep(SLEEP_SEC);
}
return 0;
}
Sample output (same results as Bruno Haible's original program):
bytes_allocated = 7450240
bytes_allocated = 7450368
bytes_allocated = 7450496
bytes_allocated = 7450624
bytes_allocated = 7450752
bytes_allocated = 7450880
Segmentation fault (core dumped)