As I was reviewing memory organisation and storage in C/C++ I came upon this:
"Initialized data segment, usually called simply the Data Segment. A data segment is a portion of virtual address space of a program, which contains the global variables and static variables that are initialized by the programmer.
Note that, data segment is not read-only, since the values of the variables can be altered at run time."
(found in http://www.geeksforgeeks.org/memory-layout-of-c-program/ )
I was under the impression that a static and/or global variable remained immutable throughout an application, I thought this was the point of their existence. Can they really be altered at run time?
Can they really be altered at run time?
Yes. Unless you declare them as const, of course.
I was under the impression that a static and/or global variable
remained immutable throughout an application, I thought this was the
point of their existence.
No, you're describing constants. Variables with so-called static storage duration have, how the name implies, a different lifetime. [basic.stc.static]:
All variables which do not have dynamic storage duration, do not have
thread storage duration, and are not local have static storage
duration. The storage for these entities shall last for the duration
of the program (3.6.2, 3.6.3).
Just think about cout, a global stream object that you modify by inserting data into it.
You'll generally find better documentation on a site that more people take an interest in updating, for example - from Wikipedia:
In computing, a data segment (often denoted .data) is a portion of an object file or the corresponding virtual address space of a program that contains initialized static variables, that is, global variables and static local variables. The size of this segment is determined by the size of the values in the program's source code, and does not change at run time.
The data segment is read-write, since the values of variables can be altered at run time. This is in contrast to the read-only data segment (rodata segment or .rodata), which contains static constants rather than variables; it also contrasts to the code segment, also known as the text segment, which is read-only on many architectures. Uninitialized data, both variables and constants, is instead in the BSS segment.
So, it's just a matter of definition:
the data segment holds the read-write variables
the "read only" data segment holds the constants
On some old/hockey systems they might not both with a read only data segment and just lump it all together - the main thing with the read only segment is that it means a few more bugs are reported more dramatically, rather than letting the program corrupt that data and potentially spew bogus results. That's probably why .data is general and sometime later - as OS/compiler writers had time and motivation to care - .rodata ended up being contrasted with it, but .data wasn't renamed to e.g. .rwdata. These names - .data, .rodata, test, BSS etc. were and are often used in assembly languages to denote where variables should be located.
As far as things go... global variables and static variables are similar in that the [possibly virtual] memory address for them - and indeed their total size - can typically be calculated (at least relative to some supporting CPU "segment" register that's left at a convenient value most of the time) at compile time. That's in contrast to automatic (stack) and dynamic (heap) variables, where the memory's transient. Most systems only have control over write-access to memory on a per-page basis (e.g. 4k, 8k), so it's far less practical to keep granting and removing write access to put transient const automatic and heap-based variables into memory that seems read-only to the process, and it's impractical when you consider the race conditions in a threaded application. That's why this whole distinction between read-write and read-only memory's normally discussed in the context of global and static variables.
Related
Different sources say different things for me - some StackOverflow answers say that it is allocated at compile time - others say it is "defined" at compile time, and allocated at the very beginning of runtime ("load time" is what some called it), while others say it is allocated at compile time. When is static memory exactly allocated in C/C++? (if it is to do with "defining" variables - can someone tell me what it means to "define" a variable on the memory level - that would be highly appreciated!)
Also, how would you during runtime set a pointer to the start of the allocated static memory?
In typical tools, memory with static storage duration is arranged in multiple steps:
The compiler generates data in object modules (likely passing through some form of assembly code) that describes needs for various kinds of memory: memory initialized to zero, memory initialized to particular values and is read-only thereafter, memory initialized to particular values and may be modified, memory that does not need to be initialized, and possibly others. The compiler also includes initial data as necessary, information about symbols that refer to various places in the required memory, and other information. At this point, the allocation of memory is in forms roughly like “8 bytes are needed in the constant data section, and a symbol called foo should be set to their address.”
The linker combines this information into similar information in an executable file. It also resolves some or all information about symbols. At this point, the allocation of memory is in forms like “The initialized non-constant data section requires 3048 bytes, and here is the initial data for it. When it is assigned a virtual address, the following symbols should be adjusted: bar is at offset 124 from the start of the section, baz is at offset 900…”
The program loader reads this information, allocates locations in the virtual address space for it, and may read some of the data from the executable file into memory or inform the operating system where the data is to be found when it is needed. At this point, the places in the code that refer to various symbols have been modified according to the final values of those symbols.
The operating system allocates physical memory for the virtual addresses. Often, this is done “on demand” in pieces (memory pages) when a process attempts to access the memory in a specific page, rather than being done at the time the program is initially loaded.
All-in-all, static memory is not allocated at any particular time. It is a combination of many activities. The effect on the program is largely that it occurs the same as if it were all allocated when the program started, but the physical memory might only be allocated just before an instruction actually executes. (The physical memory can even be taken away from the process and restored to it later.)
The C standard says only this:
C11 5.1.2p1
[...]All objects with static storage duration shall be initialized (set to their initial values) before program startup. The manner and timing of such initialization are otherwise unspecified.
and
C11 6.2.4p2-3
2 The lifetime of an object is the portion of program execution during which storage is guaranteed to be reserved for it. An object exists, has a constant address,33) and retains its last-stored value throughout its lifetime.34) If an object is referred to outside of its lifetime, the behavior is undefined. The value of a pointer becomes indeterminate when the object it points to (or just past) reaches the end of its lifetime.
3 An object whose identifier is declared without the storage-class specifier _Thread_local, and either with external or internal linkage or with the storage-class specifier static, has static storage duration. Its lifetime is the entire execution of the program and its stored value is initialized only once, prior to program startup.
But... this is further made complicated by the as-if rule, the actual implementation need to do this only as far as observable side effects go.
In fact in Linux for example, one could argue that the variables with static storage duration are initialized and allocated by the compiler and the linker when producing executable file. When the program is run, the dynamic linker (ld.so) then prepares the program segments so that the initialized data is memory-mapped (mmap) from the executable image to RAM, default (zero-initialized) data is mapped from zeroed pages.
While the virtual memory is allocated by the compiler, linker and dynamic linker, the actual writable RAM page frames are allocated only when you write to a variable on a page for the first time...
but you do not need to know about this in basic cases. It is as if the memory for variables with static storage duration were allocated and initialized just before main was entered, even though this is not actually the case.
Static memory is allocated in two steps.
Step 1 is carried out by the linker as it lays out the executable image and says where the static variables live in relative address space.
Step 2 is carried out by the loader when the process memory is actually allocated.
In C++, static objects are initialized before entering main. If you're not careful with your code you can see objects that are still zeros even though they have constructors that would always change that. (The compiler does as much constant evaluation as it can, so toy examples won't show it.)
I was asked this question and I am not quite sure about the answer. I know that the value (content) of local variables is on the stack and those allocated on heap (in C/C++ language). But:
1- Where are the addresses of those local variables stored? How the program knows where on the stack it should look for each of the local variables? Are these references (addresses of each variable) saved on data segment? How about the address of other variable types (global, pointer, ...)
2- Am I right that programs directly (not using pop/push) read/write to different addresses in stack segment when dealing with local variables?
The compiler will track where, relative to the top of the stack, each argument and local variable is located. And if possible, the compiler will use registers for "important" variables (such as loop counters) - it will use statistics of how many times each variable is used to see which ones are "hot" (used a lot) and which are "cold" (not used much).
Note that "addresses of local variables" doesn't always apply. Registers have no (direct) address [except in the TI TMS9900 processor and a few others, where registers and memory have slightly blurred lines].
The compiler will know where each of the things are - it's what compilers do - just like it knows WHICH variable has been stored where in the data section. Exactly how this is done is the subject of a small book. For now, just trust that the compiler does this.
Yes, nearly all processors today allow reads and writes from stack + offset (where offset is typically negative, so further down the stack, as the stack normally grows towards zero).
Although the stack sometimes counts as the "data segment", it's typically its own section of memory on modern machines - and if you have multiple threads, each thread will have its own stack.
First, for the protocol, let's just note that the answers to both questions are subjected to compiler implementation and are not dictated by the language standard (nor C neither C++).
Where are the addresses of those local variables stored?
The symbols (names of functions and variables) are translated into addresses during compilation, i.e., they are not stored anywhere in the memory of the executed program:
Addresses of functions are in the code-segment of the executable image
They are constant throughout the execution of the program
Addresses of static and/or global variables are in the data-segment of the executable image
They are constant throughout the execution of the program
Addresses of non-static local variables are in the stack of the executable image
They may be different each time the function where these variables are declared is invoked
Am I right that programs directly (not using pop/push) read/write to different addresses in stack segment when dealing with local variables?
Depends on your platform (underlying HW architecture + designated compiler).
Re #2: "the stack" is of stack frames, not separate local variables. While individual values are sometimes pushed and popped (e.g., return addresses), local variables are normally created all at once by simply adjusting the stack pointer. That "creates" an area on the stack without storing any particular values there (which is why uninitialized variables might have any value), and then offsets from the stack pointer are used to find the individual variables.
See also a similar CS SE question.
Since global and static variables are initialized to 0 by default, why are local variables not initialized to 0 by default as well?
Because such zero-initializations take execution time. It would make your program significantly slower. Each time you call a function, the program would have to execute pointless overhead code, which sets the variables to zero.
Static variables persist for the whole lifetime of the program, so there you can afford the luxuary to zero-initialize them, because they are only initialized once. While locals are initialized in runtime.
It is not uncommon in realtime systems to enable a compiler option which stops the zero initialization of static storage objects as well. Such an option makes the program non-standard, but also makes it start up faster.
This is because global and static variables live in different memory regions than local variables.
uninitialized static and global variables live in the .bss segment, which is a memory region that is guaranteed to be initialized to zero on program startup, before the program enters `main'
explicitly initialized static and global variables are part of the actual application file, their value is determined at compile-time and loaded into memory together with the application
local variables are dynamically generated at runtime, by growing the stack. If your stack grows over a memory region that holds garbage, then your uninitialized local variables will contain garbage (garbage in, garbage out).
Because that would take time, and it's not always the case that you need them to be zero.
The allocation of local variables (typically on the CPU's hardware stack) is very fast, much less than one instruction per variable and basically independent of the size of the variables.
So any initialization code (which generally would not be independent of the size of the variables) would add a relatively massive amount of overhead, compared to the allocation, and since you cannot be sure that the initialization is needed, it would be very disruptive when optimizing for performance.
Global/static variables are different, they generally live in a segment of the program's binary that is set to 0 by the program loader anyway, so you get that "for free".
Mainly historical. Back when C was being defined, the zero
initialization of static variables was handled automatically by
the OS, and would occur anyway, where as the zero initialization
of local variables would require runtime. The first is still
true today on a lot of systmes (including all Unix and Windows).
The second is far less an issue, however; most compilers would
detect a superfluous initialization in most cases, and skip it,
and in the cases where the compiler couldn't do so, the rest of
the code would be complicated enough that the time required for
the initialization wouldn't be measurable. You can still
construct special cases where this wouldn't be the case, but
they're certainly very rare. However, the original C was
specified like this, and none of the committees have reviewed
the issue since.
Global and static variables are stored at Data Segment [Data in the uninitialised data segment is initialized by the kernel to arithmetic 0 before the program starts executing], while local variables are stored at call stack.
The global or static variables that are initialized by you explicitly will be stored in the .data segment (initialized data) and the uninitialized global or static variables are stored in the .bss (uninitialized data).
This .bss is not stored in the compiled .obj files because there is no data available for these variables (remember you have not initialized them with any certain data).
Now, when the OS loads an exe, it just looks at the size of the .bss segment, allocates that much memory, and zero-initializes it for you (exec). That's why it is necessary to initialize .bss segment to zero.
The local variables are not initialized because there is no such need to initialize them. They are stored at the stack level, so loading an exe will automatically load that amount of memory needed by the local variables. So, why to do extra initialization of local variables and make our program slower.
Suppose you need to call a function 100 times and if there is a local variable and suppose if it were
to initialise to 0 every time...oops there will be extra overhead and wastage of time.
on the other hand global variables are initialised only once.so we can afford its default initialising to 0.
A Four bytes memory slot is reserved for every defined integer. Uninitialised variable maintains the old value of that slot. hence, the initial value is somehow randomised.
int x = 5; // definition with initialisation
This fact in most C++ compilers as far as I know holds for scoped variables. But, when it comes to global variables. a value of zero will be set.
int x; // uninitialised definition
Why does the C++ Compiler behave differently regarding to the initial value of the global and scoped variables.
Is it fundamental?
The namespace level variables (which means global) belong to static storage duration, and as per the Standard, all variables with static storage duration are statically initialized, which means all bits are set 0:
§3.6.2/2 from the C++ Standard (n3242) says,
Variables with static storage duration (3.7.1) or thread storage duration (3.7.2) shall be zero-initialized (8.5) before any other initialization takes place.
In case of local variables with automatic storage duration, the Standard imposes no such requirement on the compilers. So the automatic variables are usually left uninitialized for performance reason — almost all major compilers choose this approach, though there might be a compiler which initializes the automatic variables also.
"A Four bytes memory slot is reserved for every defined integer.".
No, it isn't. Disregarding the "4 bytes" size, the main problem with the statement is that modern compilers often find a new location for a variable each time it's assigned to. This can be a register or some place in memory. There's a lot of smartness involved.
An uninitialized variable isn't written to, so in general there's not even a place assigned for it. Trying to read "it" might not produce a value at all; the compiler can fail outright to generate code for that.
Now globals are another matter. Since they can be read and written from anywhere, the compiler can't just find new places for them on each write. They necessarily have to stick to one place, and it can't realistically be a register. Often they're all allocated together in one chunk of memory. Zeroing that chunk can typically be done very efficiently. that's why globals are different.
As you might expect there are efficiency driven reasons behind this behavior as well.
Stack space is generally "allocated" simply by adjusting the stack pointer.
If you have 32 bytes of simple variables in a function then the compiler emits an instruction equivalent to "sp = sp - 32"
Any initialization of those variables would take additional code and execution time - hence they end up being initialized to apparently random values.
Global variables are another beast entirely.
Simple variables are effectively allocated by the program loader and can be located in what is commonly called "BSS". These variables take almost no space at all in the executable file. All of them can be merged together into a single block - so the executable image needs only specify the size of the block. Since the OS must ensure that a new process doesn't get to see any left-over data in memory from some now dead process, the memory needs to be filled with something - and you might as well fill it with zeros.
Global variables that are initialized to non-zeros actually do take up space in the executable file, they appear as a block of data and just get loaded into memory - there is no code in the executable to initialize these.
C++ also allows global variables that require code to be executed to initialize, C doesn't allow this.
For example "int x = rand();"
Get initialized at run time by code in the executable.
Try adding this global variable
int x[1024 * 1024];
and see if it makes a difference to the executable size.
Now try:
int x[1024 * 1024] = {1,2,3};
And see what difference that makes.
Where are variables in C++ stored?
Inside the RAM or the processor's cache?
Named variables are stored:
On the stack, if they're function-local variables.
C++ calls this "automatic storage"1 and doesn't require it to actually be the asm call stack, and in some rare implementations it isn't. But in mainstream implementations it is.
In a per-process data area if they are global or static.
C++ calls this "static storage class"; it's implemented in asm by putting / reserving bytes in section .data, .bss, .rodata, or similar.
If the variable is a pointer initialized with int *p = new int[10]; or similar, the pointer variable p will go in automatic storage or static storage as above. The pointed-to object in memory is:
On the heap (what C++ calls dynamic storage), allocated with new or malloc, etc.
In asm, this means calling an allocator function, which may ultimately get new memory from the OS via some kind of system call if its free-list is empty. "The heap" isn't a single contiguous region in modern OSes / C++ implementations.
C and C++ don't do automatic garbage collection, and named variables can't themselves be in dynamic storage ("the heap"). Objects in dynamic storage are anonymous, other than being pointed-to by other objects, some of which may be proper variables. (An object of struct or class type, as opposed to primitive types like int, can let you refer to named class members in this anonymous object. In a member function they even look identical.)
This is why you can't (safely/usefully) return a pointer or reference to a local variable.
This is all in RAM, of course. Caching is transparent to userspace processes, though it may visibly affect performance.
Compilers may optimize code to store variables in registers. This is highly compiler and code-dependent, but good compilers will do so aggressively.
Footnote 1: Fun fact: auto in C++03 and earlier, and still in C, meant automatic storage-class, but now (C++11) it infers types.
For C++ in general, the proper answer is "wherever your compiler decides to put them". You should not make assumptions otherwise, unless you somehow direct your compiler otherwise. Some variables can be stored entirely in registers, and some might be totally optimized away and replaced by a literal somewhere. With some compilers on some platforms, constants might actually end up in ROM.
The part of your question about "the processor's cache" is a bit confused. There are some tools for directing how the processor handles its cache, but in general that is the processor's business and should be invisible to you. You can think of the cache as your CPU's window into RAM. Pretty much any memory access goes through the cache.
On the other end of the equation, unused RAM sometimes will get swapped out to disk on most OSes. So its possible (but unlikely) that at some moments your variables are actually being stored on disk. :-)
Variables are usually stored in RAM. This is either on the Heap (e.g. global variables, static variables in methods/functions) or on the Stack (e.g. non-static variables declared within a method/function). Stack and Heap are both RAM, just different locations.
Pointers are a bit special. Pointers themselves follow the rules above but the data they point to is typically stored on the Heap (memory blocks created with malloc, objects created with new). Yet you can create pointers pointing to stack memory: int a = 10; int * b = &a;; b points to the memory of a and a is stored on the stack.
What goes into CPU cache is beyond compilers control, the CPU decides itself what to cache and how to long to cache it (depending on factors like "Has this data been recently used?" or "Is it to be expected that the data is used pretty soon again?") and of course the size of the cache has a big influence as well.
The compiler can only decide which data goes into a CPU register. Usually data is kept there if it's accessed very often in a row since register access is faster than cache and much faster than RAM. Some operations on certain systems can actually only be performed if the data is in a register, in that case the compiler must move data to a register before performing the operation and can only decide when to move the data back to RAM.
Compilers will always try to keep the most often accessed data in a register. When a method/function is called, usually all register values are written back to RAM, unless the compiler can say for sure that the called function/method will not access the memory where the data came from. Also on return of a method/function it must write all register data back to RAM, otherwise the new values would be lost. The return value itself is passed in a register on some CPU architectures, it is passed via stack otherwise.
Variables in C++ are stored either on the stack or the heap.
stack:
int x;
heap:
int *p = new int;
That being said, both are structures built in RAM.
If your RAM usage is high though windows can swap this out to disk.
When computation is done on variables, the memory will be copied to registers.
C++ is not aware of your processor's cache.
When you are running a program, written in C++ or any other language, your CPU will keep a copy of "popular" chunks of RAM in a cache. That's done at the hardware level.
Don't think of CPU cache as "other" or "more" memory...it's just a mechanism to keep some chunks of RAM close by.
I think you are mixing up two concepts. One, how does the C++ language store variables in memory. Two, how does the computer and operating system manage that memory.
In C++, variables can be allocated on the stack, which is memory that is reserved for the program's use and is fixed in size at thread start or in dynamic memory which can be allocated on the fly using new. A compiler can also choose to store the variables on registers in the processor if analysis of the code will allow it. Those variables would never see the system memory.
If a variable ends up in memory, the OS and the processor chip set take over. Both stack based addresses and dynamic addresses are virtual. That means that they may or may not be resident in system memory at any given time. The in memory variable may be stored in the systems memory, paged onto disk or may be resident in a cache on or near the processor. So, it's hard to know where that data is actually living. If a program hasn't been idle for a time or two programs are competing for memory resources, the value can be saved off to disk in the page file and restored when it is the programs turn to run. If the variable is local to some work being done, it could be modified in the processors cache several times before it is finally flushed back to the system memory. The code you wrote would never know this happened. All it knows is that it has an address to operate on and all of the other systems take care of the rest.
Variables can be held in a number of different places, sometimes in more than one place. Most variables are placed in RAM when a program is loaded; sometimes variables which are declared const are instead placed in ROM. Whenever a variable is accessed, if it is not in the processor's cache, a cache miss will result, and the processor will stall while the variable is copied from RAM/ROM into the cache.
If you have any halfway decent optimizing compiler, local variables will often instead be stored in a processor's register file. Variables will move back and forth between RAM, the cache, and the register file as they are read and written, but they will generally always have a copy in RAM/ROM, unless the compiler decides that's not necessary.
The C++ language supports two kinds of memory allocation through the variables in C++ programs:
Static allocation is what happens when you declare a static or global variable. Each static or global variable defines one block of space, of a fixed size. The space is allocated once, when your program is started (part of the exec operation), and is never freed.
Automatic allocation happens when you declare an automatic variable, such as a function argument or a local variable. The space for an automatic variable is allocated when the compound statement containing the declaration is entered, and is freed when that compound statement is exited. The size of the automatic storage can be an expression that varies. In other CPP implementations, it must be a constant.
A third important kind of memory allocation, dynamic allocation, is not supported by C++ variables but is available Library functions.
Dynamic Memory Allocation
Dynamic memory allocation is a technique in which programs determine as they are running where to store some information. You need dynamic allocation when the amount of memory you need, or how long you continue to need it, depends on factors that are not known before the program runs.
For example, you may need a block to store a line read from an input file; since there is no limit to how long a line can be, you must allocate the memory dynamically and make it dynamically larger as you read more of the line.
Or, you may need a block for each record or each definition in the input data; since you can't know in advance how many there will be, you must allocate a new block for each record or definition as you read it.
When you use dynamic allocation, the allocation of a block of memory is an action that the program requests explicitly. You call a function or macro when you want to allocate space, and specify the size with an argument. If you want to free the space, you do so by calling another function or macro. You can do these things whenever you want, as often as you want.
Dynamic allocation is not supported by CPP variables; there is no storage class “dynamic”, and there can never be a CPP variable whose value is stored in dynamically allocated space. The only way to get dynamically allocated memory is via a system call , and the only way to refer to dynamically allocated space is through a pointer. Because it is less convenient, and because the actual process of dynamic allocation requires more computation time, programmers generally use dynamic allocation only when neither static nor automatic allocation will serve.
For example, if you want to allocate dynamically some space to hold a struct foobar, you cannot declare a variable of type struct foobar whose contents are the dynamically allocated space. But you can declare a variable of pointer type struct foobar * and assign it the address of the space. Then you can use the operators ‘*’ and ‘->’ on this pointer variable to refer to the contents of the space:
{
struct foobar *ptr
= (struct foobar *) malloc (sizeof (struct foobar));
ptr->name = x;
ptr->next = current_foobar;
current_foobar = ptr;
}
depending on how they are declared, they will either be stored in the "heap" or the "stack"
The heap is a dynamic data structure that the application can use.
When the application uses data it has to be moved to the CPU's registers right before they are consumed, however this is very volatile and temporary storage.