Where is a ordinary variable defined inside a __device__ function placed? - c++

In CUDA, I understand that the variable would be placed in shared memory if it was defined as __ shared __ and one would be placed in constant memory if it was defined as __ constant __.Also, those being allocated memory using cudamalloc() are put in GPU global memory. But
where are those variable without prefixs like __ shared __ , __ constant __ and register placed? For example, the variable i as follow:
__device__ void func(){
int i=0;
return;
}

Automatic variables, i.e. variables without memory space specification within the scope of functions, are placed in one of the following locations:
When optimized away:
1.1 Nowhere - if the variable isn't actually necessary. This actually happens a lot, since CUDA functions are often inlined, with some variables becoming copies of a variable in the calling function. Example (note the x from foo() in the compilation of bar() - completely gone).
1.2 Immediate values in the program's compiled code - if the variable's value is constant, and doesn't get updated, its value may simply be "plugged" into the code. Here's an example with two variables taking constants, which are replaced with the constant which is their sum.
When not optimized away:
2.1 Registers - If your variable can't be optimized-away, the better alternative is to keep it in a hardware register on (one of the symmetric multiprocessor core on) the GPU. Example (the variables x and y are placed in registers %r1 and %r2).
the best and most performant option, which the compiler
2.2 'Local' memory - The 'local memory' of a CUDA thread is an area in global device memory which is (in principle) accessible only by that thread.
Now, obviously, local memory is much slower to use. When will the compiler choose it, then?
The CUDA Programming Guide gives us the answer:
When the automatic variable is too large to fit in the register file for the current thread (each thread typically gets between 63 and 255 4-byte registers).
When the automatic variable is an array, or has an array member, which is indexed with a non-constant offset. Unfortunately, NVIDIA GPU multiprocessors don't support register indexing.
When the kernel is overusing its available quota of registers is already full with other variables or uses by the compiled code - even if the variable itself is very small. This is referred to as register spilling.

Local variables are either placed in hardware registers or local memory (which is effectively global memory).
In your example, however, variable i will be removed by the compiler because it is unused.

GPUs have a bunch of space dedicated for many registers stored directly in GPU computing units (ie. Streaming Multiprocessors). Registers are not stored in memory unless there is some register spilling happening (typically when you use too many registers in a given kernel). Register have no address unlike all memory bytes. The same thing happen for CPU except that the number of CPU registers is usually much smaller than on GPU. For example, an Intel Skylake core has 180 integer register and 168 vector registers while the instruction set architecture is limited to something like 16 integer/vector registers. Note that in case of register spilling, the value of registers is temporary stored in local memory (typically in the L1 cache if possible). Here is the overall memory hierarchy of a basic (Nvidia Fermi) GPU:
For more information, consider reading: Local Memory and Register Spilling.

Related

What is stored in Code memory and Data memory

Can some one please explain difference between Code and Data memory. I know code is stored in Flash and Data is stored in RAM but i am confused.
#include <iostream>
using namespace std;
int main()
{
int a =10, b=20;
int c = a+b;
return 0;
}
Here a,b,c are stored in data memory(RAM), but whats get stored in Code memory? Is this entire code is stored in Code memory? if yes, then does this mean we are storing a,b,c in both data and code memory.
In your example, many scenarios based on the optimization level of your compiler.
Constants placed in "code memory"
In the code below:
int a =10, b=20;
int c = a+b;
return 0;
The variables a and b are constants, they don't change. A compiler could optimize this and optimize them to be:
int c = 10 + 20;
So the values 10 and 20 can be placed into code memory, eliminating the variables a and b.
Registers not Memory
The compiler is allowed to assign the variables a and b to registers. Registers are within the processor, so don't take up any RAM or memory space. Registers are not part of the code space either.
(This can happen because there are no statements that require the addresses of a or b).
All code dropped
On higher optimization settings, the compiler can delete all your code and replace with a return 0.
The variables a and b are not changed.
The variable c is changed but not used by any other statements.
Your program has no effect (nothing is printed, there are no external actions like writing to hardware).
Thus your program can be reduced down to return 0;.
Code Memory vs. Data Memory
In general, processor instructions are placed in a segment you will call "code memory". This may actually reside in RAM and not in Flash or ROM. For example, on a PC, your code could be loaded from the hard drive into RAM and executed in RAM. Similarly with Flash, your code could be loaded from Flash into RAM and executed in RAM.
Constants, like numbers, can be placed into a Read-Only segment or in the Code Segment. Many processors can load constants from the Code Segment (see ARM and Intel assembly instructions). The Read-Only segment can live on a Read Only device, (ROM or Flash) or may live in RAM (or on a device like hard drive). All you can guarantee, is that the code will not write to the Read-Only segment.
Data Memory is different. The C++ language has at least 3 areas of "data" memory (where variables live): 1) Local (a.k.a. stack), where short lifetime variables reside; 2) Dynamic memory (a.k.a. heap), allocated by using new or malloc and 3) Automatic/Global variables. These memory areas can be placed anywhere, as long as the memory has read and write capabilities. They don't need to be fast, only read & write (for example, the hard drive can be used as data memory).
Memory organization is more complicated than having Code, Stack and Heap. In the embedded systems world, memory can be place in non-standard locations and there may be a need to have more detailed memory segments so they can be placed in different areas. For example, an embedded system may want to place the constants into Flash so that they can be changed easily (even though they may be more efficiently accessed in the Code Segment). Some code may want to be placed into the Boot Area of the processor (which is programmed by the processor manufacturer). Some embedded systems may have non-volatile memory (e.g. battery backed memory), which can behave like Read-Only memory.
Trust Your Compiler
Trust in your compiler to place code, data and variables in the most efficient areas as possible. Your compiler knows your platform and will make the best decisions for you. If you need to change your compiler's settings, you can, but you should really know what you are doing and why you need to change them. Most PC platforms load code from a hard drive (or SSD) into RAM and execute the code from RAM. Embedded systems are different and depend on the hardware devices. Code may be run from flash because the platform has minimal RAM. Some may store the code compressed in a serial access read-only device and have to decompress into RAM before executing. In these situations, the compilers are configured for these specializations. So, trust in your compiler and let it place the code and data into the correct segments and locations.
A quick oversimplification:
Code memory stores the sequence of machine language instructions compiled from your C++ piece of code (the ROM).
Actual data that is created and manipulated by the program is instead stored in the RAM, which can be understood as made of stack and heap: data is stored in the slower, but larger heap, while its addresses there are retained by pointers in the stack. The stack is hosted in the faster memory registers.
Pointers in the stack retrieve data in the heap when told so by the current instruction line in the ROM, or more in general when needed.

CUDA programming: Memory access speed and memory usage: thread-local variables vs. shared-memory variables vs. numeric literals? [duplicate]

This question already has answers here:
Using constants with CUDA
(2 answers)
Closed 5 years ago.
Suppose I have an array with several fixed numerical values that would be accessed multiple times by multiple threads within the same block, what are some pros and cons in terms of access speed and memory usage if I store these values in:
thread-local memory: double x[3] = {1,2,3};
shared memory: __shared__ double x[3] = {1,2,3};
numeric literals: directly hardcode these values in the expression where they appear
Thanks!
TL;DR
use __constant__ double x[3]; // ... initialization ...
First, know where a variable actually resides
In your question:
thread-local memory: double x[3] = {1,2,3};
This is imprecise. Depends on how your code access x[], x[] can reside in either registers or local memory.
Since there is no type qualifiers, the compiler will try best to put things in register,
An automatic variable declared in device code without any of the __device__, __shared__ and __constant__ qualifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory,
but when it can't, it will put them in local memory:
Arrays for which it cannot determine that they are indexed with constant quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available (this is also known as register spilling).
You really don't want x to be in local memory, it's slow. In your situation,
an array with several fixed numerical values that would be accessed multiple times by multiple threads within the same block
Both __constant__ and __shared__ can be a good choice.
For a complete description on this topic, check: CUDA Toolkit Documentation: variable-type-qualifiers
Then, consider speed & availability
Hardcode
The number will be embedded in instructions. You may expect some performance improvement. Better benchmark your program before and after doing this.
Register
It's fast, but scarce. Consider a block with 16x16 threads, with a maximum 64k registers per block, each thread can use 256 registers. (Well, maybe not that scarce, should be enough for most kernels)
Local Memory
It's slow. However, a thread can use up to 512KB local memory.
The local memory space resides in device memory, so local memory accesses have same high latency and low bandwidth as global memory accesses...
Shared Memory
It's fast, but scarce. Typically 48KB per block (less than registers!).
Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.
Constant Memory
It's fast in a different way (see below), which highly depends on cache, and cache is scarce. Typically 8KB ~ 10KB cache per multiprocessor.
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
read: CUDA Toolkit Documentation: device-memory-accesses

C++ how are variables accessed in memory?

When I create a new variable in a C++ program, eg a char:
char c = 'a';
how does C++ then have access to this variable in memory? I would imagine that it would need to store the memory location of the variable, but then that would require a pointer variable, and this pointer would again need to be accessed.
See the docs:
When a variable is declared, the memory needed to store its value is
assigned a specific location in memory (its memory address).
Generally, C++ programs do not actively decide the exact memory
addresses where its variables are stored. Fortunately, that task is
left to the environment where the program is run - generally, an
operating system that decides the particular memory locations on
runtime. However, it may be useful for a program to be able to obtain
the address of a variable during runtime in order to access data cells
that are at a certain position relative to it.
You can also refer this article on Variables and Memory
The Stack
The stack is where local variables and function parameters reside. It
is called a stack because it follows the last-in, first-out principle.
As data is added or pushed to the stack, it grows, and when data is
removed or popped it shrinks. In reality, memory addresses are not
physically moved around every time data is pushed or popped from the
stack, instead the stack pointer, which as the name implies points to
the memory address at the top of the stack, moves up and down.
Everything below this address is considered to be on the stack and
usable, whereas everything above it is off the stack, and invalid.
This is all accomplished automatically by the operating system, and as
a result it is sometimes also called automatic memory. On the
extremely rare occasions that one needs to be able to explicitly
invoke this type of memory, the C++ key word auto can be used.
Normally, one declares variables on the stack like this:
void func () {
int i; float x[100];
...
}
Variables that are declared on the stack are only valid within the
scope of their declaration. That means when the function func() listed
above returns, i and x will no longer be accessible or valid.
There is another limitation to variables that are placed on the stack:
the operating system only allocates a certain amount of space to the
stack. As each part of a program that is being executed comes into
scope, the operating system allocates the appropriate amount of memory
that is required to hold all the local variables on the stack. If this
is greater than the amount of memory that the OS has allowed for the
total size of the stack, then the program will crash. While the
maximum size of the stack can sometimes be changed by compile time
parameters, it is usually fairly small, and nowhere near the total
amount of RAM available on a machine.
Assuming this is a local variable, then this variable is allocated on the stack - i.e. in the RAM. The compiler keeps track of the variable offset on the stack. In the basic scenario, in case any computation is then performed with the variable, it is moved to one of the processor's registers and the CPU performs the computation. Afterwards the result is returned back to the RAM. Modern processors keep whole stack frames in the registers and have multiple levels of registers, so it can get quite complex.
Please note the "c" name is no more mentioned in the binary (unless you have debugging symbols). The binary only then works with the memory locations. E.g. it would look like this (simple addition):
a = b + c
take value of memory offset 1 and put it in the register 1
take value of memory offset 2 and put in in the register 2
sum registers 1 and 2 and store the result in register 3
copy the register 3 to memory location 3
The binary doesn't know "a", "b" or "c". The compiler just said "a is in memory 1, b is in memory 2, c is in memory 3". And the CPU just blindly executes the commands the compiler has generated.
C++ itself (or, the compiler) would have access to this variable in terms of the program structure, represented as a data structure. Perhaps you're asking how other parts in the program would have access to it at run time.
The answer is that it varies. It can be stored either in a register, on the stack, on the heap, or in the data/bss sections (global/static variables), depending on its context and the platform it was compiled for: If you needed to pass it around by reference (or pointer) to other functions, then it would likely be stored on the stack. If you only need it in the context of your function, it would probably be handled in a register. If it's a member variable of an object on the heap, then it's on the heap, and you reference it by an offset into the object. If it's a global/static variable, then its address is determined once the program is fully loaded into memory.
C++ eventually compiles down to machine language, and often runs within the context of an operating system, so you might want to brush up a bit on Assembly basics, or even some OS principles, to better understand what's going on under the hood.
Lets say our program starts with a stack address of 4000000
When, you call a function, depending how much stack you use, it will "allocate it" like this
Let's say we have 2 ints (8bytes)
int function()
{
int a = 0;
int b = 0;
}
then whats gonna happen in assembly is
MOV EBP,ESP //Here we store the original value of the stack address (4000000) in EBP, and we restore it at the end of the function back to 4000000
SUB ESP, 8 //here we "allocate" 8 bytes in the stack, which basically just decreases the ESP addr by 8
so our ESP address was changed from
4000000
to
3999992
that's how the program knows knows the stack addresss for the first int is "3999992" and the second int is from 3999996 to 4000000
Even tho this pretty much has nothing to do with the compiler, it's really important to know because when you know how stack is "allocated", you realize how cheap it is to do things like
char my_array[20000];
since all it's doing is just doing sub esp, 20000 which is a single assembly instruction
but if u actually use all those bytes like memset(my_array,20000) that's a different history.
how does C++ then have access to this variable in memory?
It doesn't!
Your computer does, and it is instructed on how to do that by loading the location of the variable in memory into a register. This is all handled by assembly language. I shan't go into the details here of how such languages work (you can look it up!) but this is rather the purpose of a C++ compiler: to turn an abstract, high-level set of "instructions" into actual technical instructions that a computer can understand and execute. You could sort of say that assembly programs contain a lot of pointers, though most of them are literals rather than "variables".

What happens when a computer program runs?

I know the general theory but I can't fit in the details.
I know that a program resides in the secondary memory of a computer. Once the program begins execution it is entirely copied to the RAM. Then the processor retrive a few instructions (it depends on the size of the bus) at a time, puts them in registers and executes them.
I also know that a computer program uses two kinds of memory: stack and heap, which are also part of the primary memory of the computer. The stack is used for non-dynamic memory, and the heap for dynamic memory (for example, everything related to the new operator in C++)
What I can't understand is how those two things connect. At what point is the stack used for the execution of the instructions? Instructions go from the RAM, to the stack, to the registers?
It really depends on the system, but modern OSes with virtual memory tend to load their process images and allocate memory something like this:
+---------+
| stack | function-local variables, return addresses, return values, etc.
| | often grows downward, commonly accessed via "push" and "pop" (but can be
| | accessed randomly, as well; disassemble a program to see)
+---------+
| shared | mapped shared libraries (C libraries, math libs, etc.)
| libs |
+---------+
| hole | unused memory allocated between the heap and stack "chunks", spans the
| | difference between your max and min memory, minus the other totals
+---------+
| heap | dynamic, random-access storage, allocated with 'malloc' and the like.
+---------+
| bss | Uninitialized global variables; must be in read-write memory area
+---------+
| data | data segment, for globals and static variables that are initialized
| | (can further be split up into read-only and read-write areas, with
| | read-only areas being stored elsewhere in ROM on some systems)
+---------+
| text | program code, this is the actual executable code that is running.
+---------+
This is the general process address space on many common virtual-memory systems. The "hole" is the size of your total memory, minus the space taken up by all the other areas; this gives a large amount of space for the heap to grow into. This is also "virtual", meaning it maps to your actual memory through a translation table, and may be actually stored at any location in actual memory. It is done this way to protect one process from accessing another process's memory, and to make each process think it's running on a complete system.
Note that the positions of, e.g., the stack and heap may be in a different order on some systems (see Billy O'Neal's answer below for more details on Win32).
Other systems can be very different. DOS, for instance, ran in real mode, and its memory allocation when running programs looked much differently:
+-----------+ top of memory
| extended | above the high memory area, and up to your total memory; needed drivers to
| | be able to access it.
+-----------+ 0x110000
| high | just over 1MB->1MB+64KB, used by 286s and above.
+-----------+ 0x100000
| upper | upper memory area, from 640kb->1MB, had mapped memory for video devices, the
| | DOS "transient" area, etc. some was often free, and could be used for drivers
+-----------+ 0xA0000
| USER PROC | user process address space, from the end of DOS up to 640KB
+-----------+
|command.com| DOS command interpreter
+-----------+
| DOS | DOS permanent area, kept as small as possible, provided routines for display,
| kernel | *basic* hardware access, etc.
+-----------+ 0x600
| BIOS data | BIOS data area, contained simple hardware descriptions, etc.
+-----------+ 0x400
| interrupt | the interrupt vector table, starting from 0 and going to 1k, contained
| vector | the addresses of routines called when interrupts occurred. e.g.
| table | interrupt 0x21 checked the address at 0x21*4 and far-jumped to that
| | location to service the interrupt.
+-----------+ 0x0
You can see that DOS allowed direct access to the operating system memory, with no protection, which meant that user-space programs could generally directly access or overwrite anything they liked.
In the process address space, however, the programs tended to look similar, only they were described as code segment, data segment, heap, stack segment, etc., and it was mapped a little differently. But most of the general areas were still there.
Upon loading the program and necessary shared libs into memory, and distributing the parts of the program into the right areas, the OS begins executing your process wherever its main method is at, and your program takes over from there, making system calls as necessary when it needs them.
Different systems (embedded, whatever) may have very different architectures, such as stackless systems, Harvard architecture systems (with code and data being kept in separate physical memory), systems which actually keep the BSS in read-only memory (initially set by the programmer), etc. But this is the general gist.
You said:
I also know that a computer program uses two kinds of memory: stack and heap, which are also part of the primary memory of the computer.
"Stack" and "heap" are just abstract concepts, rather than (necessarily) physically distinct "kinds" of memory.
A stack is merely a last-in, first-out data structure. In the x86 architecture, it can actually be addressed randomly by using an offset from the end, but the most common functions are PUSH and POP to add and remove items from it, respectively. It is commonly used for function-local variables (so-called "automatic storage"), function arguments, return addresses, etc. (more below)
A "heap" is just a nickname for a chunk of memory that can be allocated on demand, and is addressed randomly (meaning, you can access any location in it directly). It is commonly used for data structures that you allocate at runtime (in C++, using new and delete, and malloc and friends in C, etc).
The stack and heap, on the x86 architecture, both physically reside in your system memory (RAM), and are mapped through virtual memory allocation into the process address space as described above.
The registers (still on x86), physically reside inside the processor (as opposed to RAM), and are loaded by the processor, from the TEXT area (and can also be loaded from elsewhere in memory or other places depending on the CPU instructions that are actually executed). They are essentially just very small, very fast on-chip memory locations that are used for a number of different purposes.
Register layout is highly dependent on the architecture (in fact, registers, the instruction set, and memory layout/design, are exactly what is meant by "architecture"), and so I won't expand upon it, but recommend you take an assembly language course to understand them better.
Your question:
At what point is the stack used for the execution of the instructions? Instructions go from the RAM, to the stack, to the registers?
The stack (in systems/languages that have and use them) is most often used like this:
int mul( int x, int y ) {
return x * y; // this stores the result of MULtiplying the two variables
// from the stack into the return value address previously
// allocated, then issues a RET, which resets the stack frame
// based on the arg list, and returns to the address set by
// the CALLer.
}
int main() {
int x = 2, y = 3; // these variables are stored on the stack
mul( x, y ); // this pushes y onto the stack, then x, then a return address,
// allocates space on the stack for a return value,
// then issues an assembly CALL instruction.
}
Write a simple program like this, and then compile it to assembly (gcc -S foo.c if you have access to GCC), and take a look. The assembly is pretty easy to follow. You can see that the stack is used for function local variables, and for calling functions, storing their arguments and return values. This is also why when you do something like:
f( g( h( i ) ) );
All of these get called in turn. It's literally building up a stack of function calls and their arguments, executing them, and then popping them off as it winds back down (or up ;). However, as mentioned above, the stack (on x86) actually resides in your process memory space (in virtual memory), and so it can be manipulated directly; it's not a separate step during execution (or at least is orthogonal to the process).
FYI, the above is the C calling convention, also used by C++. Other languages/systems may push arguments onto the stack in a different order, and some languages/platforms don't even use stacks, and go about it in different ways.
Also note, these aren't actual lines of C code executing. The compiler has converted them into machine language instructions in your executable. They are then (generally) copied from the TEXT area into the CPU pipeline, then into the CPU registers, and executed from there. [This was incorrect. See Ben Voigt's correction below.]
Sdaz has gotten a remarkable number of upvotes in a very short time, but sadly is perpetuating a misconception about how instructions move through the CPU.
The question asked:
Instructions go from the RAM, to the stack, to the registers?
Sdaz said:
Also note, these aren't actual lines of C code executing. The compiler has converted them into machine language instructions in your executable. They are then (generally) copied from the TEXT area into the CPU pipeline, then into the CPU registers, and executed from there.
But this is wrong. Except for the special case of self-modifying code, instructions never enter the datapath. And they are not, cannot be, executed from the datapath.
The x86 CPU registers are:
General registers
EAX EBX ECX EDX
Segment registers
CS DS ES FS GS SS
Index and pointers
ESI EDI EBP EIP ESP
Indicator
EFLAGS
There are also some floating-point and SIMD registers, but for the purposes of this discussion we'll classify those as part of the coprocessor and not the CPU. The memory-management unit inside the CPU also has some registers of its own, we'll again treat that as a separate processing unit.
None of these registers are used for executable code. EIP contains the address of the executing instruction, not the instruction itself.
Instructions go through a completely different path in the CPU from data (Harvard architecture). All current machines are Harvard architecture inside the CPU. Most these days are also Harvard architecture in the cache. x86 (your common desktop machine) are Von Neumann architecture in the main memory, meaning data and code are intermingled in RAM. That's beside the point, since we're talking about what happens inside the CPU.
The classic sequence taught in computer architecture is fetch-decode-execute. The memory controller looks up the instruction stored at the address EIP. The bits of the instruction go through some combinational logic to create all the control signals for the different multiplexers in the processor. And after some cycles, the arithmetic logic unit arrives at a result, which is clocked into the destination. Then the next instruction is fetched.
On a modern processor, things work a little differently. Each incoming instruction is translated into a whole series of microcode instructions. This enable pipelining, because the resources used by the first microinstruction aren't needed later, so they can begin working on the first microinstruction from the next instruction.
To top it off, terminology is slightly confused because register is an electrical engineering term for a collection of D-flipflops. And instructions (or especially microinstructions) may very well be stored temporarily in such a collection of D-flipflops. But this is not what is meant when a computer scientist or software engineer or run-of-the-mill developer uses the term register. They mean the datapath registers as listed above, and these are not used for transporting code.
The names and number of datapath registers vary for other CPU architectures, such as ARM, MIPS, Alpha, PowerPC, but all of them execute instructions without passing them through the ALU.
The exact layout of the memory while a process is executing is completely dependent on the platform which you're using. Consider the following test program:
#include <stdlib.h>
#include <stdio.h>
int main()
{
int stackValue = 0;
int *addressOnStack = &stackValue;
int *addressOnHeap = malloc(sizeof(int));
if (addressOnStack > addressOnHeap)
{
puts("The stack is above the heap.");
}
else
{
puts("The heap is above the stack.");
}
}
On Windows NT (and it's children), this program is going to generally produce:
The heap is above the stack
On POSIX boxes, it's going to say:
The stack is above the heap
The UNIX memory model is quite well explained here by #Sdaz MacSkibbons, so I won't reiterate that here. But that is not the only memory model. The reason POSIX requires this model is the sbrk system call. Basically, on a POSIX box, to get more memory, a process merely tells the Kernel to move the divider between the "hole" and the "heap" further into the "hole" region. There is no way to return memory to the operating system, and the operating system itself does not manage your heap. Your C runtime library has to provide that (via malloc).
This also has implications for the kind of code actually used in POSIX binaries. POSIX boxes (almost universally) use the ELF file format. In this format, the operating system is responsible for communications between libraries in different ELF files. Therefore, all the libraries use position-independent code (That is, the code itself can be loaded into different memory addresses and still operate), and all calls between libraries are passed through a lookup table to find out where control needs to jump for cross library function calls. This adds some overhead and can be exploited if one of the libraries changes the lookup table.
Windows' memory model is different because the kind of code it uses is different. Windows uses the PE file format, which leaves the code in position-dependent format. That is, the code depends on where exactly in virtual memory the code is loaded. There is a flag in the PE spec which tells the OS where exactly in memory the library or executable would like to be mapped when your program runs. If a program or library cannot be loaded at it's preferred address, the Windows loader must rebase the library/executable -- basically, it moves the position-dependent code to point at the new positions -- which doesn't require lookup tables and cannot be exploited because there's no lookup table to overwrite. Unfortunately, this requires very complicated implementation in the Windows loader, and does have considerable startup time overhead if an image needs to be rebased. Large commercial software packages often modify their libraries to start purposely at different addresses to avoid rebasing; windows itself does this with it's own libraries (e.g. ntdll.dll, kernel32.dll, psapi.dll, etc. -- all have different start addresses by default)
On Windows, virtual memory is obtained from the system via a call to VirtualAlloc, and it is returned to the system via VirtualFree (Okay, technically VirtualAlloc farms out to NtAllocateVirtualMemory, but that's an implementation detail) (Contrast this to POSIX, where memory cannot be reclaimed). This process is slow (and IIRC, requires that you allocate in physical page sized chunks; typically 4kb or more). Windows also provides it's own heap functions (HeapAlloc, HeapFree, etc.) as part of a library known as RtlHeap, which is included as a part of Windows itself, upon which the C runtime (that is, malloc and friends) is typically implemented.
Windows also has quite a few legacy memory allocation APIs from the days when it had to deal with old 80386s, and these functions are now built on top of RtlHeap. For more information about the various APIs that control memory management in Windows, see this MSDN article: http://msdn.microsoft.com/en-us/library/ms810627 .
Note also that this means on Windows a single process an (and usually does) have more than one heap. (Typically, each shared library creates it's own heap.)
(Most of this information comes from "Secure Coding in C and C++" by Robert Seacord)
The stack
In X86 architercture the CPU executes operations with registers. The stack is only used for convenience reasons. You can save the content of your registers to stack before calling a subroutine or a system function and then load them back to continue your operation where you left. (You could to it manually without the stack, but it is a frequently used function so it has CPU support). But you can do pretty much anything without the stack in a PC.
For example an integer multiplication:
MUL BX
Multiplies AX register with BX register. (The result will be in DX and AX, DX containing the higher bits).
Stack based machines (like JAVA VM) use the stack for their basic operations. The above multiplication:
DMUL
This pops two values from the top of the stack and multiplies tem, then pushes the result back to the stack. Stack is essential for this kind of machines.
Some higher level programming languages (like C and Pascal) use this later method for passing parameters to functions: the parameters are pushed to the stack in left to right order and popped by the function body and the return values are pushed back. (This is a choice that the compiler manufacturers make and kind of abuses the way the X86 uses the stack).
The heap
The heap is an other concept that exists only in the realm of the compilers. It takes the pain of handling the memory behind your variables away, but it is not a function of the CPU or the OS, it is just a choice of housekeeping the memory block wich is given out by the OS. You could do this manyually if you want.
Accessing system resources
The operating system has a public interface how you can access its functions. In DOS parameters are passed in registers of the CPU. Windows uses the stack for passing parameters for OS functions (the Windows API).

Can static local variables cut down on memory allocation time?

Suppose I have a function in a single threaded program that looks like this
void f(some arguments){
char buffer[32];
some operations on buffer;
}
and f appears inside some loop that gets called often, so I'd like to make it as fast as possible. It looks to me like the buffer needs to get allocated every time f is called, but if I declare it to be static, this wouldn't happen. Is that correct reasoning? Is that a free speed up? And just because of that fact (that it's an easy speed up), does an optimizing compiler already do something like this for me?
No, it's not a free speedup.
First, the allocation is almost free to begin with (since it consists merely of adding 32 to the stack pointer), and secondly, there are at least two reasons why a static variable might be slower
you lose cache locality. Data allocated on the stack are going to be in the CPU cache already, so accessing it is extremely cheap. Static data is allocated in a different area of memory, and so it may not be cached, and so it will cause a cache miss, and you'll have to wait hundreds of clock cycles for the data to be fetched from main memory.
you lose thread safety. If two threads execute the function simultaneously, it'll crash and burn, unless a lock is placed so only one thread at a time is allowed to execute that section of the code. And that would mean you'd lose the benefit of having multiple CPU cores.
So it's not a free speedup. But it is possible that it is faster in your case (although I doubt it).
So try it out, benchmark it, and see what works best in your particular scenario.
Incrementing 32 bytes on the stack will cost virtually nothing on nearly all systems. But you should test it out. Benchmark a static version and a local version and post back.
For implementations that use a stack for local variables, often times allocation involves advancing a register (adding a value to it), such as the Stack Pointer (SP) register. This timing is very negligible, usually one instruction or less.
However, initialization of stack variables takes a little longer, but again, not much. Check out your assembly language listing (generated by compiler or debugger) for exact details. There is nothing in the standard about the duration or number of instructions required to initialize variables.
Allocation of static local variables is usually treated differently. A common approach is to place these variables in the same area as global variables. Usually all the variables in this area are initialized before calling main(). Allocation in this case is a matter of assigning addresses to registers or storing the area information in memory. Not much execution time wasted here.
Dynamic allocation is the case where execution cycles are burned. But that is not in the scope of your question.
The way it is written now, there is no cost for allocation: the 32 bytes are on the stack. The only real work is you need to zero-initialize.
Local statics is not a good idea here. It wont be faster, and your function can't be used from multiple threads anymore, as all calls share the same buffer. Not to mention that local statics initialization is not guaranteed to be thread safe.
I would suggest that a more general approach to this problem is that if you have a function called many times that needs some local variables then consider wrapping it in a class and making these variables member functions. Consider if you needed to make the size dynamic, so instead of char buffer[32] you had std::vector<char> buffer(requiredSize). This is more expensive than an array to initialise every time through the loop
class BufferMunger {
public:
BufferMunger() {};
void DoFunction(args);
private:
char buffer[32];
};
BufferMunger m;
for (int i=0; i<1000; i++) {
m.DoFunction(arg[i]); // only one allocation of buffer
}
There's another implication of making the buffer static, which is that the function is now unsafe in a multithreaded application, as two threads may call it and overwrite the data in the buffer at the same time. On the other hand it's safe to use a separate BufferMunger in each thread that requires it.
Note that block-level static variables in C++ (as opposed to C) are initialized on first use. This implies that you'll be introducing the cost of an extra runtime check. The branch potentially could end up making performance worse, not better. (But really, you should profile, as others have mentioned.)
Regardless, I don't think it's worth it, especially since you'd be intentionally sacrificing re-entrancy.
If you are writing code for a PC, there is unlikely to be any meaningful speed advantage either way. On some embedded systems, it may be advantageous to avoid all local variables. On some other systems, local variables may be faster.
An example of the former: on the Z80, the code to set up the stack frame for a function with any local variables was pretty long. Further, the code to access local variables was limited to using the (IX+d) addressing mode, which was only available for 8-bit instructions. If X and Y were both global/static or both local variables, the statement "X=Y" could assemble as either:
; If both are static or global: 6 bytes; 32 cycles
ld HL,(_Y) ; 16 cycles
ld (_X),HL ; 16 cycles
; If both are local: 12 bytes; 56 cycles
ld E,(IX+_Y) ; 14 cycles
ld D,(IX+_Y+1) ; 14 cycles
ld (IX+_X),D ; 14 cycles
ld (IX+_X+1),E ; 14 cycles
A 100% code space penalty and 75% time penalty in addition to the code and time to set up the stack frame!
On the ARM processor, a single instruction can load a variable which is located within +/-2K of an address pointer. If a function's local variables total 2K or less, they may be accessed with a single instruction. Global variables will generally require two or more instructions to load, depending upon where they are stored.
With gcc, I do see some speedup:
void f() {
char buffer[4096];
}
int main() {
int i;
for (i = 0; i < 100000000; ++i) {
f();
}
}
And the time:
$ time ./a.out
real 0m0.453s
user 0m0.450s
sys 0m0.010s
changing buffer to static:
$ time ./a.out
real 0m0.352s
user 0m0.360s
sys 0m0.000s
Depending on what exactly the variable is doing and how its used, the speed up is almost nothing to nothing. Because (on x86 systems) stack memory is allocated for all local vars at the same time with a simple single func(sub esp,amount), thus having just one other stack var eliminates any gain. the only exception to this is really huge buffers in which case a compiler might stick in _chkstk to alloc memory(but if your buffer is that big you should re-evaluate your code). The compiler cannot turn stack memory into static memory via optimization, as it cannot assume that the function is going to be used in a single threaded enviroment, plus it would mess with object constructors & destructors etc
If there are any local automatic variables in the function at all, the stack pointer needs to be adjusted. The time taken for the adjustment is constant, and will not vary based on the number of variables declared. You might save some time if your function is left with no local automatic variables whatsoever.
If a static variable is initialized, there will be a flag somewhere to determine if the variable has already been initialized. Checking the flag will take some time. In your example the variable is not initialized, so this part can be ignored.
Static variables should be avoided if your function has any chance of being called recursively or from two different threads.
It will make the function substantially slower on most real cases. This is because the static data segment is not near the stack and you will lose cache coherency, so you will get a cache miss when you try to access it. However when you allocate a regular char[32] on the stack, it is right next to all your other needed data and costs very little to access. The initialization costs of a stack-based array of char are meaningless.
This is ignoring that statics have many other problems.
You really need to actually profile your code and see where the slowdowns are, because no profiler will tell you that allocating a statically-sized buffer of characters is a performance problem.