Why does VS Debug build allocates variables so far apart? - c++

I'm using Visual Studio 2019, and I noticed that in debug builds, the variables are allocated so far apart from one another. I looked at Project Properties and tried searching online but could not find anything. I ran the following code below in both Debug and Release mode and here are the respective outputs.
int main() {
int a = 3;
int b = 5;
int c = 8;
int d[5] = { 10,10,10,10,10 };
int e = 14;
std::cout << "a: " << &a
<< "\nb: " << &b
<< "\nc: " << &c
<< "\nd_start: " << &d[0]
<< "\nd_end: " << &d[4] + 1
<< "\ne: " << &e
<< std::endl;
}
As you can see below, variables are allocated as you would expect (one after the other) with no wasted memory in between. Even the last variable, e, is optimized to slot between c and d.
// Release_x64 Build Ouput
a: 0000003893EFFC40
b: 0000003893EFFC44
c: 0000003893EFFC48
d_start: 0000003893EFFC50
d_end: 0000003893EFFC64
e: 0000003893EFFC4C // e is optimized in between c and d
Below is the output that confuses me. Here you can see that a and b are allocated 32 bytes apart! So there is 28 bytes of wasted/uninitialized memory between them. The same thing happens for other variables except for the int d[5]. d has 32 uninitialized bytes after c but only has 24 uninitialized bytes before e.
// Debug_x64 Build Output
a: 00000086D7EFF3F4
b: 00000086D7EFF414
c: 00000086D7EFF434
d_start: 00000086D7EFF458
d_end: 00000086D7EFF46C
e: 00000086D7EFF484
My question is that why is this happening? Why does the MSVC allocate these variables so far apart from one another and what determines how much space to separate them by so that it's different for arrays?

The debug version of the allocates storage differently than the release version. In particular, the debug version allocates some space at the beginning and end of each block of storage, so its allocation patterns are somewhat different.
The debug allocator also checks the storage at the start and end of the block it allocated to see if it has been damaged in any way.
Storage is allocated in quantized chunks, where the quantum is unspecified but is something like 16, or 32 bytes. Thus, if you allocated a DWORD array of six elements (size = 6 * sizeof(DWORD) bytes = 24 bytes) then the allocator will actually deliver 32 bytes (one 32-byte quantum or two 16-byte quanta). So if you write element [6] (the seventh element) you overwrite some of the "dead space" and the error is not detected. But in the release version, the quantum might be 8 bytes, and three 8-byte quanta would be allocated, and writing the [6] element of the array would overwrite a part of the storage allocator data structure that belongs to the next chunk. After that it is all downhill. There error might not even show up until the program exits! You can construct similar "boundary condition" situations for any size quantum. Because the quantum size is the same for both versions of the allocator, but the debug version of the allocator adds hidden space for its own purposes, you will get different storage allocation patterns in debug and release mode.

Related

Max number of elements vector

I was looking to see how many elements I can stick into a vector before the program crashes. When running the code below the program crashed with a bad alloc at i=90811045, aka when trying to add the 90811045th element. My question is: Why 90811045?
it is:
not a power of two
not the value that vector.max_size() gives
the same number both in debug and release
the same number after restarting my computer
the same number regardless of what the value of the long long is
note: I know I can fix this by using vector.reserve() or other methods, I am just interested in where 90811045 comes from.
code used:
#include <iostream>
#include <vector>
int main() {
std::vector<long long> myLongs;
std::cout << "Max size expected : " << myLongs.max_size() << std::endl;
for (int i = 0; i < 160000000; i++) {
myLongs.push_back(i);
if (i % 10000 == 0) {
std::cout << "Still going! : " << i << " \r";
}
}
return 0;
}
extra info:
I am currently using 64 bit windows with 16 GB of ram.
Why 90811045?
It's probably just incidental.
That vector is not the only thing that uses memory in your process. There is the execution stack where local variables are stored. There is memory allocated by for buffering the input and output streams. Furthermore, the global memory allocator uses some of the memory for bookkeeping.
90811044 were added succesfully. The vector implementation (typically) has a deterministic strategy for allocating larger internal buffer. Typically, it multiplies the previous capacity by a constant factor (greater than 1). Hence, we can conclude that 90811044 * sizeof(long long) + other_usage is consistently small enough to be allocated successfully, but (90811044 * sizeof(long long)) * some_factor + other_usage is consistently too much.

Use 2 malloc calls to define 1 array c

I am trying to make an unsigned char array in c++ that is ~ 4 gigabytes in size.
The code I am using to malloc the space for the array is below:
unsigned char *myArray= (unsigned char*)malloc(sizeof(char)*3774873600);
if(myArray==NULL)
{
cout << "Error! memory could not be allocated. \n";
}else{
cout << "You allocated memory for myArray \n";
}
When I run the program I get the success message saying that the memory was allocated. Then when I run:
myArray[0] = 20;
cout << myArray[0];
I get 20 as the answer.
However if I run:
myArray[3774873599] = 20;
cout << myArray[3774873599];
The program crashes.
I was thinking it is probably because the malloc call is asking for too much continuous memory in 1 call (4gb).
Perhaps it would be better to split the malloc call in 2 parts, and then join them together as 1 continuous array. Would that be possible?
Also in case you are wondering my computer has 64 gb of memory on a 64bit OS, and the program is compiled as 64bit, so I don't think it's at it's limits or anything.
Any help would be much appreciated!

What is the largest amount of memory I can allocate on my MacBook Pro? [duplicate]

This question already has answers here:
Allocating more memory than there exists using malloc
(6 answers)
Closed 6 years ago.
I'm trying to figure out how much memory I can allocate before the allocation will fail.
This simple C++ code allocates a buffer (of size 1024 bytes), assigns to the last five characters of the buffer, reports, and then deletes the buffer. It then doubles the size of the buffer and repeats until it fails.
Unless I'm missing something, the code is able to allocate up to 65 terabytes of memory before it fails on my MacBook Pro. Is this even possible? How can it allocate so much more memory than I have on the machine? I must be missing something simple.
int main(int argc, char *argv[])
{
long long size=1024;
long cnt=0;
while (true)
{
char *buffer = new char[size];
// Assume the alloc succeeded. We are looking for the failure after all.
// Try to write to the allocated memory, may fail
buffer[size-5] = 'T';
buffer[size-4] = 'e';
buffer[size-3] = 's';
buffer[size-2] = 't';
buffer[size-1] = '\0';
// report
if (cnt<10)
cout << "size[" << cnt << "]: " << (size/1024.) << "Kb ";
else if (cnt<20)
cout << "size[" << cnt << "]: " << (size/1024./1024.) << "Mb ";
else
cout << "size[" << cnt << "]: " << (size/1024./1024./1024.) << "Gi ";
cout << "addr: 0x" << (long)buffer << " ";
cout << "str: " << &buffer[size-5] << "\n";
// cleanup
delete [] buffer;
// double size and continue
size *= 2;
cnt++;
}
return 0;
}
When you ask for memory, an operating system reserves the right not to actually give you that memory until you actually use it.
That's what's happening here: you're only ever using 5 bytes. My ZX81 from the 1980s could handle that.
MacOS X, like almost every modern operating system, uses "delayed allocation" for memory. When you call new, the OS doesn't actually allocate any memory. It simply makes a note that your program wants a certain amount of memory, and that memory area you want starts at a certain address. Memory is only actually allocated when your program tries to use it.
Further, memory is allocated in units called "pages". I believe MacOS X uses 4kb pages, so when your program writes to the end of the buffer, the OS gives you 4096 bytes there, while retaining the rest of the buffer as simply a "your program wants this memory" note.
As for why you're hitting the limit at 64 terabytes, it's because current x86-64 processors use 48-bit addressing. This gives 256 TB of address space, which is split evenly between the operating system and your program. Doubling the 64 TB allocation would exactly fit in your program's 128 TB half of the address space, except that the program is already taking up a little bit of it.
Virtual memory is the key to allocating more address space than you have physical RAM+swap space.
malloc uses the mmap(MAP_ANONYMOUS) system call to get pages from the OS. (Assuming OS X works like Linux, since they're both POSIX OSes). These pages are all copy-on-write mapped to a single physical zero page. i.e. they all read as zero with only a TLB miss (no page fault and no allocation of physical RAM). An x86 page is 4kiB. (I'm not mentioning hugepages because they're not relevant here).
Writing to any of those pages triggers a soft page fault for the kernel to handle the copy-on-write. The kernel allocates a zeroed page of physical memory and re-wires that virtual page to be backed by the physical page. On return from the page fault, the store is re-executed and succeeds this time.
So after allocating 64TiB and storing 5 bytes to the end of it, you've used one extra page of physical memory. (And added an entry to malloc's bookkeeping data, but that was probably already allocated and in a dirty page. In a similar question about multiple tiny allocations, malloc's bookkeeping data was what eventually used up all the space).
If you actually dirtied more pages than the system had RAM + swap, the kernel would have a problem because it's too late for malloc to return NULL. This is called "overcommit", and some OSes enable it by default while others don't. In Linux, it's configurable.
As Mark explains, you run out of steam at 64TiB because current x86-64 implementations only support 48-bit virtual addresses. The upper 16 bits need to be copies of bit 47. (i.e. an address is only canonical if the 64-bit value is the sign-extension of the low 48 bits).
This requirement stops programs from doing anything "clever" with the high bits, and then breaking on future hardware that does support even larger virtual address spaces.

Why is iterating a large array on the heap faster than iterating same size array on the stack?

I am allocating 2 same size arrays, one on stack, one on heap, then iterating over them with trivial assignment.
Executable is compiled to allocate 40mb for main thread stack.
This code has only been tested to compile in vc++ with /STACK:41943040 linker tag.
#include "stdafx.h"
#include <string>
#include <iostream>
#include <malloc.h>
#include <windows.h>
#include <ctime>
using namespace std;
size_t stackavail()
{
static unsigned StackPtr; // top of stack ptr
__asm mov [StackPtr],esp // mov pointer to top of stack
static MEMORY_BASIC_INFORMATION mbi; // page range
VirtualQuery((PVOID)StackPtr,&mbi,sizeof(mbi)); // get range
return StackPtr-(unsigned)mbi.AllocationBase; // subtract from top (stack grows downward on win)
}
int _tmain(int argc, _TCHAR* argv[])
{
string input;
cout << "Allocating 22mb on stack." << endl;
unsigned int start = clock();
char eathalfastack[23068672]; // approx 22mb
auto length = sizeof(eathalfastack)/sizeof(char);
cout << "Time taken in ms: " << clock()-start << endl;
cout << "Setting through array." << endl;
start = clock();
for( int i = 0; i < length; i++ ){
eathalfastack[i] = i;
}
cout << "Time taken in ms: " << clock()-start << endl;
cout << "Free stack space: " << stackavail() << endl;
cout << "Allocating 22mb on heap." << endl;
start = clock();
// auto* heaparr = new int[23068672]; // corrected
auto* heaparr = new byte[23068672];
cout << "Time taken in ms: " << clock()-start << endl;
start = clock();
cout << "Setting through array." << endl;
for( int i = 0; i < length; i++ ){
heaparr[i] = i;
}
cout << "Time taken in ms: " << clock()-start << endl;
delete[] heaparr;
getline(cin, input);
}
The output is this:
Allocating 22mb on stack.
Time taken in ms: 0
Setting through array.
Time taken in ms: 45
Free stack space: 18872076
Allocating 22mb on heap.
Time taken in ms: 20
Setting through array.
Time taken in ms: 35
Why is iteration of stack array slower than same thing on heap?
EDIT:
nneonneo cought my error
Now output is identical:
Allocating 22mb on stack.
Time taken in ms: 0
Setting through array.
Time taken in ms: 42
Free stack space: 18871952
Allocating 22mb on heap.
Time taken in ms: 4
Setting through array.
Time taken in ms: 41
Release build per Öö Tiib's answer below:
Allocating 22mb on stack.
Time taken in ms: 0
Setting through array.
Time taken in ms: 5
Free stack space: 18873508
Allocating 22mb on heap.
Time taken in ms: 0
Setting through array.
Time taken in ms: 10
Your arrays are not the same size; sizeof(char[23068672]) != sizeof(int[23068672]), and the elements are of different types.
Something is wrong with your PC, on mine ages old Pentium 4 it takes 15 ms to assign such stack-based char array. Did you try with debug version or something?
There are two parts to your question :
Allocating space on the stack vs heap
Accessing a memory location on stack vs globally visible
Allocating space
First, lets look at allocating space on the stack. The stack as we know grows downwards on the x86 architecture. So, in order to allocate space on the stack, all you have to do is decrement the stack pointer. Just one assembly instruction (dec sp, #amount). This assembly instruction is always present in the prologue of a function (function set-up code). So, as far as I know, allocating space on stack must not take any time. Cost of allocating space on stack = ( decrement sp operation). On a modern super-scalar machine, this execution of this instruction will be overlapped with other instructions.
Allocating space on the heap on the other hand requires a library call to new/malloc. The library call first checks if there is some space on the heap. If yes then it will just return a pointer to the first available address. If space is not available on the stack, it will use a brk system call to request kernel to modify the page-table entries for the additional page. A system call is a costly operation. It will cause a pipeline flush, TLB pollution, etc. So, the cost of allocating space on heap = (function-call + computation for space + (brk system call)?). Definitely, allocating space on heap seems order of magnitude slower than stack.
Accessing element
The addressing mode of the x86 ISA allows memory operand to be addressed using direct addressing mode (temp=mem[addr]) to access a global variable while the variables on stack are generally accessed using indexed addressing mode. (temp=mem[stack-pointer+offset-on-stack]). My assumption is that both the memory operands should take almost the same time however, the direct addressing mode seems definitely faster than the indexed addressing mode. Regarding the memory access of an array, we have two operands to access any element - base address of array and index variable. When we are accessing an array on stack, we add one more operand - the stack - pointer . The x86 addressing mode has a provision for such addresses - base+scale*index+offset . So, okay stack array element access : temp=mem[sp+base-address+iterator*element-size] and heap array access : temp=mem[base-address+iterator*element-size]. Clearly, the stack access must the costlier than the array access.
Now, coming to a generic case of iteration, if the iteration is slower for stack, it means addressing mode may(i am not completely sure) the bottle-neck and if allocating the space is bottleneck, the system call may be the bottleneck.

Maximum memory for stack, static and heap memory in C++

I am trying to find the maximum memory that I could allocate on stack, global and heap memory in C++. I am trying this program on a Linux system with 32 GB of memory, and on my Mac with 2 GB of RAM.
/* test to determine the maximum memory that could be allocated for static, heap and stack memory */
#include <iostream>
using namespace std;
//static/global
long double a[200000000];
int main()
{
//stack
long double b[999999999];
//heap
long double *c = new long double[3999999999];
cout << "Sizeof(long double) = " << sizeof(long double) << " bytes\n";
cout << "Allocated Global (Static) size of a = " << (double)((sizeof(a))/(double)(1024*1024*1024)) << " Gbytes \n";
cout << "Allocated Stack size of b = " << (double)((sizeof(b))/(double)(1024*1024*1024)) << " Gbytes \n";
cout << "Allocated Heap Size of c = " << (double)((3999999999 * sizeof(long double))/(double)(1024*1024*1024)) << " Gbytes \n";
delete[] c;
return 0;
}
Results (on both):
Sizeof(long double) = 16 bytes
Allocated Global (Static) size of a = 2.98023 Gbytes
Allocated Stack size of b = 14.9012 Gbytes
Allocated Heap Size of c = 59.6046 Gbytes
I am using GCC 4.2.1. My question is:
Why is my program running? I expected since stack got depleted (16 MB in linux, and 8 MB in Mac), the program should throw an error. I saw some of the many questions asked in this topic, but I couldn't solve my problem from the answers given there.
On some systems you can allocate any amount of memory that fits in the address space. The problems begin when you start actually using that memory.
What happens is that the OS reserves a virtual address range for the process, without mapping it to anything physical, or even checking that there's enough physical memory (including swap) to back that address range up. The mapping only happens in a page-by-page fashion, when the process tries to access newly allocated pages. This is called memory overcommitment.
Try accessing every sysconf(_SC_PAGESIZE)th byte of your huge arrays and see what happens.
Linux overcommits, meaning that it can allow a process more memory than is available on the system, but it is not until that memory is actually used by the process that actual memory (physical main memory or swap space on disk) is allocated for the process. My guess would be that Mac OS X works in a similar way.