Why deallocating heap memory is much slower than allocating it? - c++

This is an empirical assumption (that allocating is faster then de-allocating).
This is also one of the reason, i guess, why heap based storages (like STL containers or else) choose to not return currently unused memory to the system (that is why shrink-to-fit idiom was born).
And we shouldn't confuse, of course, 'heap' memory with the 'heap'-like data structures.
So why de-allocation is slower?
Is it Windows-specific (i see it on Win 8.1) or OS independent?
Is there some C++ specific memory manager automatically involved on using 'new' / 'delete' or the whole mem. management is completely relies on the OS? (i know C++11 introduced some garbage-collection support, which i never used really, better relying on the old stack and static duration or self managed containers and RAII).
Also, in the code of the FOLLY string i saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?
P. S. please note that the question is not about virtual memory mechanics, i understand that user-space programs didn't use real mem. addresation.

The assertion that allocating memory is faster than deallocating it seemed a bit odd to me, so I tested it. I ran a test where I allocated 64MB of memory in 32-byte chunks (so 2M calls to new), and I tried deleting that memory in the same order it was allocated, and in a random order. I found that linear-order deallocation was about 3% faster than allocation, and that random deallocation was about 10% slower than linear allocation.
I then ran a test where I started with 64MB of allocated memory, and then 2M times either allocated new memory or deleted existing memory (at random). Here, I found that deallocation was about 4.3% slower than allocation.
So, it turns out you were correct - deallocation is slower than allocation (though I wouldn't call it "much" slower). I suspect this has simply to do with more random accesses, but I have no evidence for this other than that the linear deallocation was faster.
To answer some of your questions:
Is there some C++ specific memory manager automatically involved on using 'new' / 'delete'?
Yes. The OS has system calls which allocate pages of memory (typically 4KB chunks) to processes. It's the process' job to divide up those pages into objects. Try looking up the "GNU Memory Allocator."
I saw using old C heap allocation / deallocation, is it faster then C++ 'new' / 'delete'?
Most C++ new/delete implementations just call malloc and free under the hood. This is not required by the standard, however, so it's a good idea to always use the same allocation and deallocation function on any particular object.
I ran my tests with the native testing framework provided in Visual Studio 2015, on a Windows 10 64-bit machine (The tests were also 64-bit). Here's the code:
#include "stdafx.h"
#include "CppUnitTest.h"
using namespace Microsoft::VisualStudio::CppUnitTestFramework;
namespace AllocationSpeedTest
{
class Obj32 {
uint64_t a;
uint64_t b;
uint64_t c;
uint64_t d;
};
constexpr int len = 1024 * 1024 * 2;
Obj32* ptrs[len];
TEST_CLASS(UnitTest1)
{
public:
TEST_METHOD(Linear32Alloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
}
TEST_METHOD(Linear32AllocDealloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
for (int i = 0; i < len; ++i) {
delete ptrs[i];
}
}
TEST_METHOD(Random32AllocShuffle)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
int pos = (rand() % (len - i)) + i;
Obj32* temp = ptrs[i];
ptrs[i] = ptrs[pos];
ptrs[pos] = temp;
}
}
TEST_METHOD(Random32AllocShuffleDealloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
int pos = (rand() % (len - i)) + i;
Obj32* temp = ptrs[i];
ptrs[i] = ptrs[pos];
ptrs[pos] = temp;
}
for (int i = 0; i < len; ++i) {
delete ptrs[i];
}
}
TEST_METHOD(Mixed32Both)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
ptrs[i] = new Obj32();
}
else {
delete ptrs[i];
}
}
}
TEST_METHOD(Mixed32Alloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
ptrs[i] = new Obj32();
}
else {
//delete ptrs[i];
}
}
}
TEST_METHOD(Mixed32Dealloc)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
//ptrs[i] = new Obj32();
}
else {
delete ptrs[i];
}
}
}
TEST_METHOD(Mixed32Neither)
{
for (int i = 0; i < len; ++i) {
ptrs[i] = new Obj32();
}
srand(0);
for (int i = 0; i < len; ++i) {
if (rand() % 2) {
//ptrs[i] = new Obj32();
}
else {
//delete ptrs[i];
}
}
}
};
}
And here are the raw results over several runs. All numbers are in milliseconds.

I had much the same idea as #Basile: I wondered whether your base assumption was actually (even close to) correct. Since you tagged the question C++, I wrote a quick benchmark in C++ instead.
#include <vector>
#include <iostream>
#include <numeric>
#include <chrono>
#include <iomanip>
#include <locale>
int main() {
std::cout.imbue(std::locale(""));
using namespace std::chrono;
using factor = microseconds;
auto const size = 2000;
std::vector<int *> allocs(size);
auto start = high_resolution_clock::now();
for (int i = 0; i < size; i++)
allocs[i] = new int[size];
auto stop = high_resolution_clock::now();
auto alloc_time = duration_cast<factor>(stop - start).count();
start = high_resolution_clock::now();
for (int i = 0; i < size; i++)
delete[] allocs[i];
stop = high_resolution_clock::now();
auto del_time = duration_cast<factor>(stop - start).count();
std::cout << std::left << std::setw(20) << "alloc time: " << alloc_time << " uS\n";
std::cout << std::left << std::setw(20) << "del time: " << del_time << " uS\n";
}
I also used VC++ on Windows instead of gcc on Linux. The result wasn't much different though: freeing the memory took substantially less time than allocating it did. Here are the results from three successive runs.
alloc time: 2,381 uS
del time: 1,429 uS
alloc time: 2,764 uS
del time: 1,592 uS
alloc time: 2,492 uS
del time: 1,442 uS
I'd warn, however, allocation and freeing is handled (primarily) by the standard library, so this could be different between one standard library and another (even when using the same compiler). I'd also note that it wouldn't surprise me if this were to change somewhat in multi-threaded code. Although it's not actually correct, there appear to be a few authors who are under the mis-apprehension that freeing in a multithreaded environment requires locking a heap for exclusive access. This can be avoided, but the means to do so isn't necessarily immediately obvious.

I am not sure of your observation. I wrote the following program (on Linux, hopefully you could port it to your system).
// public domain code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <errno.h>
#include <string.h>
#include <assert.h>
const unsigned possible_word_sizes[] = {
1, 2, 3, 4, 5,
8, 12, 16, 24,
32, 48, 64, 128,
256, 384, 2048
};
long long totalsize;
// return a calloc-ed array of nbchunks malloced zones of
// somehow random size
void **
malloc_chunks (int nbchunks)
{
const int nbsizes =
(int) (sizeof (possible_word_sizes)
/ sizeof (possible_word_sizes[0]));
void **ad = calloc (nbchunks, sizeof (void *));
if (!ad)
{
perror ("calloc chunks");
exit (EXIT_FAILURE);
};
for (int ix = 0; ix < nbchunks; ix++)
{
unsigned sizindex = random () % nbsizes;
unsigned size = possible_word_sizes[sizindex];
void *zon = malloc (size * sizeof (void *));
if (!zon)
{
fprintf (stderr,
"malloc#%d (%d words) failed (total %lld) %s\n",
ix, size, totalsize, strerror (errno));
exit (EXIT_FAILURE);
}
((int *) zon)[0] = ix;
totalsize += size;
ad[ix] = zon;
}
return ad;
}
void
free_chunks (void **chks, int nbchunks)
{
// first, free the two thirds of chunks in random order
for (int i = 0; 3 * i < 2 * nbchunks; i++)
{
int pix = random () % nbchunks;
if (chks[pix])
{
free (chks[pix]);
chks[pix] = NULL;
}
}
// then, free the rest in reverse order
for (int i = nbchunks - 1; i >= 0; i--)
if (chks[i])
{
free (chks[i]);
chks[i] = NULL;
}
}
int
main (int argc, char **argv)
{
assert (sizeof (int) <= sizeof (void *));
int nbchunks = (argc > 1) ? atoi (argv[1]) : 32768;
if (nbchunks < 128)
nbchunks = 128;
srandom (time (NULL));
printf ("nbchunks=%d\n", nbchunks);
void **chks = malloc_chunks (nbchunks);
clock_t clomall = clock ();
printf ("clomall=%ld totalsize=%lld words\n",
(long) clomall, totalsize);
free_chunks (chks, nbchunks);
clock_t clofree = clock ();
printf ("clofree=%ld\n", (long) clofree);
return 0;
}
I compiled it with gcc -O2 -Wall mf.c -o mf on my Debian/Sid/x86-64 (i3770k, 16Gb). I run time ./mf 100000 and got:
nbchunks=100000
clomall=54162 totalsize=19115681 words
clofree=83895
./mf 100000 0.02s user 0.06s system 95% cpu 0.089 total
on my system clock gives CPU microseconds. If the call to random is negligible (and I don't know if it is) w.r.t. malloc & free time, I tend to disagree with your observations. free seems to be twice as fast as malloc. My gcc is 6.1, my libc is Glibc 2.22.
Please take time to compile the above benchmark on your system and report the timings.
FWIW, I took Jerry's code and
g++ -O3 -march=native jerry.cc -o jerry
time ./jerry; time ./jerry; time ./jerry
gives
alloc time: 1940516
del time: 602203
./jerry 0.00s user 0.01s system 68% cpu 0.016 total
alloc time: 1893057
del time: 558399
./jerry 0.00s user 0.01s system 68% cpu 0.014 total
alloc time: 1818884
del time: 527618
./jerry 0.00s user 0.01s system 70% cpu 0.014 total

When you allocate small memory blocks, the block size you specify maps directly to a suballocator for that size, which is commonly represented as a "slab" of memory containing same size records, to avoid memory fragmentation. This can be very fast, similar to an array access. But freeing such blocks is not so straight forward, because you are passing a pointer to memory of unknown size, requiring additional work to determine what slab it belongs to, before the block can be returned to its proper place.
When you allocate large blocks of virtual memory, a memory page range is set up in your process space without actually mapping any physical memory to it, and that requires very little work to accomplish. But freeing such large blocks can require much more work, because the pointer freed must first be matched to the page tables for that range, followed by walking through all of the page entries for the memory range that it spans, and releasing all of the physical memory pages assigned to that range by the intervening page faults.
Of course, the details of this will vary depending on the implementation being used, but the principles remain much the same: memory allocation of a known block size requires less effort than releasing a pointer to a memory block of unknown size. My knowledge of this comes directly from my experience developing high-performance commercial grade RAII memory allocators.
I should also point out that since every heap allocation has a matching and corresponding release, this pair of operations represents a single allocation cycle, i.e. as the two sides of one coin. Together, their execution time can be accurately measured, but separately such measurement is difficult to pin down, as it varies widely depending on block size, previous activity across similar sizes, caching and other operational considerations. But in the end, allocate/free differences may not much matter, since you don't do one without the other.

The problem here is heap fragmentation. Programs written in languages with explicit pointer arithmetic have no realistic ways of defragmenting heap.
If your heap is fragmented, you can't return memory to OS. OS, barring virtual memory, depends on brk(2)-like mechanism - i.e. you set an upper bound for all memory addresses you'll refer to. But when you have even one buffer allocated and still in use near existing boundary, you can't return memory to OS explicitly. Doesn't matter if 99% of all the memory in your program is freed.
Dealocation doesn't have to be slower than allocation. But the fact that you have manual deallocation with heap fragmenting makes allocation slower and more complex.
GCs fight this by compactifying heap. This way, allocation is just incrementing pointer for them, and deallocation is not needed for bulk of objects.

Related

Check memory usage of radixsort C++

I have implemented radix sort in c++
...
void *countSort(int *tab, int size, int exp, string *comp, bool *stat) {
int output[size];
int i, index, count[10] = {0};
sysinfo(&amem);
for (i = 0; i < size; i++){
index = (tab[i]/exp)%10;
count[index]++;
}
for (i = 1; i < 10; i++)
count[i] += count[i - 1];
for (i = size - 1; i >= 0; i--) {
index = count[ (tab[i]/exp)%10 ] - 1;
output[index] = tab[i];
count[ (tab[i]/exp)%10 ]--;
}
if((*comp).rfind("<",0) == 0){
for (i = 0; i < size; i++){
tab[i] = output[i];
swap_counter++;
if(!*stat){ fprintf(stderr, "przestawiam\n"); }
}
}else{
for (i = 0; i < size; i++){
tab[i] = output[size-i-1];
swap_counter++;
if(!*stat){ fprintf(stderr, "przestawiam\n"); }
}
}
}
void *radix_sort(int size, int *tab, string *comp, bool *stat) {
int m;
auto max = [tab, size](){
int m = tab[0];
for (int i = 1; i < size; i++) {
if (tab[i] > m)
m = tab[i];
}
return m;
};
m = max();
for (int exp = 1; m/exp > 0; exp *= 10)
countSort(tab, size, exp, comp, stat);
}
...
int main(){
int tab = (int *) malloc(n*sizeof(int));
for(int n = 100; n <=10000; n+=100){
generate_random_tab(tab, n);
radix_sort(sorted_tab, 0, n-1, ">=", 1);
free(tab);
}
}
Now I want to check and print out information of how much memory radix sort uses.
I want to do this to compare how much of memory different sorting algorithms uses.
How to achieve this?
I was given a hint to use a sysinfo() to analyze how system memory usage changes but I couldn't achieve constant results.
(I'm working on linux)
Your program has linear memory usage malloc(n*sizeof(int)) and int output[size]; --- one of them on heap, the other on stack, so basically you don't need to make run-time measurements as you can calculate it easily.
As you are on Linux, for more complicated cases there is e.g. massif tool in valgrind, but it is focused on heap measurements (which in normal cases in which you want to measure memory usage is enough, as stack is usually to small for serious amounts of data).
sysinfo only shows whole system memory, not individual process memory.
For process memory usage, you might try mallinfo, e.g.
struct mallinfo before = mallinfo();
// radix sort code
struct mallinfo after = mallinfo();
Now you may compare the various entries before and after your sorting code.
Be aware, that this doesn't include stack memory.
Although I don't know, how accurate these numbers are in a C++ context.
Testing a complete example
#include <malloc.h>
#include <stdio.h>
#define SHOW(m) printf(#m "=%d-%d\n", after.m, before.m)
int main()
{
struct mallinfo before = mallinfo();
void *p1 = malloc(1000000);
//int *p2 = new int[1000000];
struct mallinfo after = mallinfo();
SHOW(arena);
SHOW(ordblks);
SHOW(smblks);
SHOW(hblks);
SHOW(hblkhd);
SHOW(usmblks);
SHOW(fsmblks);
SHOW(uordblks);
SHOW(fordblks);
SHOW(keepcost);
return 0;
}
shows different values, depending on whether you use malloc
arena=135168-0
ordblks=1-1
smblks=0-0
hblks=1-0
hblkhd=1003520-0
usmblks=0-0
fsmblks=0-0
uordblks=656-0
fordblks=134512-0
keepcost=134512-0
or new
arena=135168-135168
ordblks=1-1
smblks=0-0
hblks=1-0
hblkhd=4001792-0
usmblks=0-0
fsmblks=0-0
uordblks=73376-73376
fordblks=61792-61792
keepcost=61792-61792
It looks like C++ (Ubuntu, GCC 9.2.1) does some preallocation, but the relevant number seems to be hblkhd (on my machine).
Since your only dynamic allocation is at the beginning of main, you must do the first mallinfo there. Testing only the radix sort code reveals, that there are no additional dynamic memory allocations.

should i use static or dynamic memory allocation in this program?

I am writing a program to get all the prime nos upto a number n(input).
Now in this program, I have used static storage allocation int arr[n+1] however my compiler doesn't know the value of n during compilation (n is provided by the user as input) and hence doesn't know how much space should be allocated.
Should one use dynamic storage allocation in this program?
int *arr=new int[n+1]
However, the program is running perfectly in both cases.
I just wanted to know why my program is running fine in case of static storage allocation even though n is unknown during compilation and the compiler doesn't know how much storage should be allocated.
void prime(int n) {
int arr[n + 1]; // <=======
for (int i = 0; i < n + 1; i++) {
arr[i] = 1;
}
for (int i = 2; i <= n; i++) {
for (int j = 2 * i, l = 0; j < n + 1; j = (2 + l) * i, l++) {
arr[j] = 0;
}
}
for (int i = 2; i < n + 1; i++) {
if (arr[i] == 1) {
cout << i << " ";
}
}
}
int main() {
int n;
cin >> n;
prime(n);
}
This
void prime(int n) {
int arr[n + 1]; // <=======
is not a static allocation. It is a dynamic stack allocation and it is known as variable length array. Which is not allowed by C++ even though all (most?) compilers do accept such code. For more information read here: Why aren't variable-length arrays part of the C++ standard?
Anyway the rule of thumb is: use dynamic heap allocation when (a) you require some object to live "long" time (i.e. longer then the function call itself) or (b) you need lots of memory or (c) you have variable length collection (even though there still are ways to dynamically allocate stack memory, e.g. alloca, I consider it a micro optimization and tricky to work with - avoid if possible).
Also you may want to utilize std::vector. It uses dynamic heap allocation under the hood as well but is generally safer than manual new/delete.

faster way than memcpy to copy 0-terminated string

I have a question about duplicating a 0-terminated string:
const char * str = "Hello World !";
size_t getSize = strlen(str);
char * temp = new char[getSize + 1];
... i know i can use this function
memcpy(temp, str, getSize);
but i want to use my own copy function which have action like this
int Count = 0;
while (str[Count] != '\0') {
temp[Count] = str[Count];
Count++;
}
both way's are true and success. now i want to check it on 10 milions times and for memcpy do this action
const char * str = "Hello World !";
size_t getSize = strlen(str);
for (size_t i = 0; i < 10000000; i++) {
char * temp = new char[getSize + 1];
memcpy(temp, str, getSize);
}
and this is for my own way
const char * str = "Hello World !";
size_t getSize = strlen(str);
for (size_t i = 0; i < 10000000; i++) {
char * temp = new char[getSize + 1];
int Count = 0;
while (str[Count] != '\0') {
temp[Count] = str[Count];
Count++;
}
}
first process done in 420 miliseconds and second done in 650 miliseconds
... why? both of those ways are same ! i want to use my own function not memcpy. is there any way to make my own way faster (fast as memcpy is fast or maybe faster)? how can i update my own way (while) to make it faster or equal with memcpy?
full source
int main() {
const char * str = "Hello world !";
size_t getSize = strlen(str);
auto start_t = chrono::high_resolution_clock::now();
for (size_t i = 0; i < 10000000; i++) {
char * temp = new char[getSize + 1];
memcpy(temp, str, getSize);
}
cout << chrono::duration_cast<chrono::milliseconds>(chrono::high_resolution_clock::now() - start_t).count() << " milliseconds\n";
start_t = chrono::high_resolution_clock::now();
for (size_t i = 0; i < 10000000; i++) {
char * temp = new char[getSize + 1];
int done = 0;
while (str[done] != '\0') {
temp[done] = str[done];
done++;
}
}
cout << chrono::duration_cast<chrono::milliseconds>(chrono::high_resolution_clock::now() - start_t).count() << " milliseconds\n";
return 0;
}
results:
482 milliseconds
654 milliseconds
Replacing library functions with your own often leads to inferior performance.
memcpy represents a very fundamental memory operation. Because of that, it is highly optimized by its authors. Unlike a "naïve" implementation, library version moves more than a single byte at a time whenever is possible, and uses hardware assistance on platforms where one is available.
Moreover, compiler itself "knows" about the inner workings of memcpy and other library functions, and it can optimize them out completely for cases when the length is known at compile time.
Note: Your implementation has semantics of strcpy, not memcpy.
... both of those ways are same !
No, they aren't:
memcpy() doesn't check each character to contain '\0' or not.
There may be more optimizations done by the implementers than you have in your naive approach
It's unlikely that your approach can be made faster than memcpy().
Seeing you didn't use pointers and comparing what you are doing (strcpy) with memcpy clearly shows that you are a beginner and as already stated by everyone else, it is difficult to outsmart an experienced programmer like those that coded your library.
But I'm gonna give you some hints to optimize your code.
I took a quick look at Microsoft's C Standard Library implementation (dubbed C Runtime Library) and they are doing it in assembly which is faster than doing it in C. So that is one point for speed.
In most 32-bit architecture with 32-bit buses, CPU can fetch 32 bits of information from memory in one request to memory (assuming that data is properly aligned), but even if you need 16 bits, or 8 bits, it still needs to make that 1 request. So working with your machine's word size probably gives you some speed up.
Lastly I want to direct your attention to SIMD. If your CPU provides it, you can use it and gain that extra speed. Again MSCRT has some SSE2 optimization options.
In the past from time to time, I had to write code that outperform my library implementation, because I had a specific need or a specific type of data that I could optimize for and while it might have some educational value unless specifically needed, your time is better spent on your actual code than to be spent on re-implementing your library functions.

Unable to measure static arrays memory usage with GetProcessMemoryInfo

I am trying to learn both details on memory usage works, as well as how to measure it using C++. I know that under Windows, a quick way to retrieve the amount of RAM being used by the current application process, when including <Windows.h>, is:
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo( GetCurrentProcess( ), &info, sizeof(info) );
(uint64_t)info.WorkingSetSize;
Then, I used that to run a very simple test:
#include <iostream>
#include <Windows.h>"
int main(void)
{
uint64_t currentUsedRAM(0);
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize;
const int N(1000000);
int x[N]; //in the second run, comment this line out
int y[N]; //in the second run, comment this line out
//int *x = new int[N]; //in the second run UNcomment this line out
//int *y = new int[N]; //in the second run UNcomment this line out
for (int i = 0; i < N; i++)
{
x[i] = 1;
y[i] = 2;
}
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize - currentUsedRAM;
std::cout << "Current RAM used: " << currentUsedRAM << "\n";
return 0;
}
What I don't understand at all when I run the code above, the output is: Current RAM used: 0, while I was expecting something around 8mb since I filled two 1D int arrays of 1 million entries each. Now, if I re-run the code but making x and y become dinamically allocated arrays, now the output is, as expected: Current RAM used: 8007680.
Why is that? How to make it detect memory-usage in both cases?
The compiler have optimised your code. If fact, for your first run, neither x or y is allocated. Considering that there is visible side effect : the return value of GetProcessMemoryInfo, this optimiszation seems kind of weird.
Anyway, you can prevent this by adding some other side effect, such as outputing the sum of each element of those two array, which will guarateen the crashing.
The memory allocating for local objects with automatic storage duration happens at the beginning of the enclosing code block and deallocated at the end. So your code can't measure the memory usage for any automatic sotrage duration variable in main(nor my deleted code snippet, Which I wasn't awared of). But things are different for those objects with dynamic storage duration, they are allocated per request.
I designed a test which involves recusion for the discussion in comment area. You can see that the memory usage increased if the program goes deeper. This is a proof to that it counts the memroy usage on stack. BTW, it isn't counting how many memory your objects need, but how many your program needs.
void foo(int depth, int *a, int *b, uint64_t usage) {
if (depth >= 100)
return ;
int x[100], y[100];
for (int i = 0; i < 100; i++)
{
x[i] = 1 + (a==nullptr?0:a[i]);
y[i] = 2 + (b==nullptr?0:b[i]);
}
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
std::cout << "Current RAM used: " << info.WorkingSetSize - usage << "\n";
foo(depth+1,x,y,usage);
int sum = 0;
for (int i=0; i<100; i++)
sum += x[i] + y[i];
std::cout << sum << std::endl;
}
int main(void)
{
uint64_t currentUsedRAM(0);
PROCESS_MEMORY_COUNTERS info;
GetProcessMemoryInfo(GetCurrentProcess(), &info, sizeof(info));
currentUsedRAM = info.WorkingSetSize;
foo(0, nullptr, nullptr, currentUsedRAM);
return 0;
}
/*
Current RAM used: 0
Current RAM used: 61440
Current RAM used: 65536
Current RAM used: 65536
Current RAM used: 65536
Current RAM used: 65536
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 69632
Current RAM used: 73728
*/
The system allocate 4k each time, which is the size of a page. I don't know why it comes 0, and then suddenly 61440. Explaining how windows manages the memory is very hard and is far beyond my ability, though I have confident in the 4k thing... and that it do count the memory usage for variables with automatic storage duration.

Linux really allocating memory it shoudn't in C++ code

In Linux, the kernel doesn't allocate any physical memory pages until we actually using that memory, but I am having a hard time here trying to find why it does in fact allocate this memory:
for(int t = 0; t < T; t++){
for(int b = 0; b < B; b++){
Matrix[t][b].length = 0;
Matrix[t][b].size = 60;
Matrix[t][b].pointers = (Node**)malloc(60*sizeof(Node*));
}
}
I then access this data structure to add one element to it like this:
Node* elem = NULL;
Matrix[a][b].length++;
Matrix[a][b]->pointers[ Matrix[a][b].length ] = elem;
Essentially, I run my program with htop on the side and Linux does allocate more memory if I increase the no. "60" I have in the code above. Why? Shouldn't it only allocate one page when the first element is added to the array?
It depends on how your Linux system is configured.
Here's a simple C program that tries to allocate 1TB of memory and touches some of it.
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
char *array[1000];
int i;
for (i = 0; i < 1000; ++i)
{
if (NULL == (array[i] = malloc((int) 1e9)))
{
perror("malloc failed!");
return -1;
}
array[i][0] = 'H';
}
for (i = 0; i < 1000; ++i)
printf("%c", array[i][0]);
printf("\n");
sleep(10);
return 0;
}
When I run top by its side, it says the VIRT memory usage goes to 931g (where g means GiB), while RES only goes to 4380 KiB.
Now, when I change my system to use a different overcommit strategy by /sbin/sysctl -w vm.overcommit_memory=2 and re-run it, I get:
malloc failed!: Cannot allocate memory
So your system may be using a different overcommit strategy than you expected. For more information read this.
Your assumption that malloc / new doesn't cause any memory to be written, and therefore assigned physical memory by the OS, is incorrect (for the memory allocator implementation you have).
I've reproduced the behavior you are describing in the following simple program:
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, char **argv)
{
char **array[128][128];
int size;
int i, j;
if (1 == argc || 0 >= (size = atoi(argv[1])))
fprintf(stderr, "usage: %s <num>; where num > 0\n", argv[0]), exit(-1);
for (i = 0; i < 128; ++i)
for (j = 0; j < 128; ++j)
if (NULL == (array[i][j] = malloc(size * sizeof(char*))))
{
fprintf(stderr, "malloc failed when i = %d, j = %d\n", i, j);
perror(NULL);
return -1;
}
sleep(10);
return 0;
}
When I run this with various small size parameters as input, the VIRT and RES memory footprints (as reported by top) grow together in-step, even though I'm not explicitly touching the inner arrays that I'm allocating.
This basically holds true until size exceeds ~512. Thereafter, RES stays constant at 64 MiB while VIRT can be extremely large (e.g. - 1220 GiB when size is 10M). That is because 512 * 8 = 4096, which is a common virtual page size on Linux systems, and 128 * 128 * 4096 B = 64 MiB.
Therefore, it looks like the first page of every allocation is being mapped to physical memory, probably because malloc / new itself is writing to part of the allocation for its own internal book keeping. Of course, lots of small allocations may fit in and be placed on the same page, so only one page gets mapped to physical memory for many such allocations.
In your code example, changing the size of the array matters because it means less of those arrays can be fit on one page, therefore requiring more memory pages to be touched by malloc / new itself (and therefore mapped to physical memory by the OS) over the run of the program.
When you use 60, that takes about 480 bytes, so ~8 of those allocations can be put on one page. When you use 100, that takes about 800 bytes, so only ~5 of those allocations can be put on one page. So, I'd expect the "100 program" to use about 8/5ths as much memory as the "60 program", which seems to be a big enough difference to make your machine start swapping to stable storage.
If each of your smaller "60" allocations were already over 1 page in size, then changing it to be bigger "100" wouldn't affect your program's initial physical memory usage, just like you originally expected.
PS - I think whether you explicitly touch the initial page of your allocations or not will be irrelevant as malloc / new will have already done so (for the memory allocator implementation you have).
Here's a sketch of what you could do if you typically expect that your b arrays will usually be small, usually be less than 2^X pointers (X = 5 in the code below), but also handles exceptional cases where they get even bigger.
You can adjust X down if your expected usage doesn't match. You could also adjust the minimum size arrays up from 0 (and not allocate the smaller 2^i levels), if you expect most of your arrays will usually use at least 2^Y pointers (e.g. - Y = 3).
If you think that actually X == Y (e.g. - 4) for your usage pattern, then you can just do one allocation of B * (0x1 << X) * sizeof(Node*) and divvy up that T array to your b's. Then if a b array needs to exceed 2^X pointers, then resort to malloc for it followed by realloc's if it needs to grow even further.
The main point here is that the initial allocation will map to very little physical memory, addressing the problem that initially spurred your original question.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#define T 1278
#define B 131072
#define CAP_MAX_LG2 5
#define CAP_MAX (0x1 << CAP_MAX_LG2) // pre-alloc T's to handle all B arrays of length up to 2^CAP_MAX_LG2
typedef struct Node Node;
typedef struct
{
int t; // so a matrix element can know to which T_Allocation it belongs
int length;
int cap_lg2; // log base 2 of capacity; -1 if capacity is zero
Node **pointers;
} MatrixElem;
typedef struct
{
Node **base; // pre-allocs B * 2^(CAP_MAX_LG2 + 1) Node pointers; every b array can be any of { 0, 1, 2, 4, 8, ..., CAP_MAX } capacity
Node **frees_pow2[CAP_MAX_LG2 + 1]; // frees_pow2[i] will point at the next free array of 2^i pointers to Node to allocate to a growing b array
} T_Allocation;
MatrixElem Matrix[T][B];
T_Allocation T_Allocs[T];
int Node_init(Node *n) { return 0; } // just a dummy
void Node_fini(Node *n) { } // just a dummy
int Node_eq(const Node *n1, const Node *n2) { return 0; } // just a dummy
void Init(void)
{
for(int t = 0; t < T; t++)
{
T_Allocs[t].base = malloc(B * (0x1 << (CAP_MAX_LG2 + 1)) * sizeof(Node*));
if (NULL == T_Allocs[t].base)
abort();
T_Allocs[t].free_pows2[0] = T_Allocs[t].base;
for (int x = 1; x <= CAP_MAX_LG2; ++x)
T_Allocs[t].frees_pow2[x] = &T_Allocs[t].base[B * (0x1 << (x - 1))];
for(int b = 0; b < B; b++)
{
Matrix[t][b].t = t;
Matrix[t][b].length = 0;
Matrix[t][b].cap_lg2 = -1;
Matrix[t][b].pointers = NULL;
}
}
}
Node *addElement(MatrixElem *elem)
{
if (-1 == elem->cap_lg2 || elem->length == (0x1 << elem->cap_lg2)) // elem needs a bigger pointers array to add an element
{
int new_cap_lg2 = elem->cap_lg2 + 1;
int new_cap = (0x1 << new_cap_lg2);
if (new_cap_lg2 <= CAP_MAX_LG2) // new b array can still fit in pre-allocated space in T
{
Node **new_pointers = T_Allocs[elem->t].frees_pow2[new_cap_lg2];
memcpy(new_pointers, elem->pointers, elem->length * sizeof(Node*));
elem->pointers = new_pointers;
T_Allocs[elem->t].frees_pow2[new_cap_lg2] += new_cap;
}
else if (elem->cap_lg2 == CAP_MAX_LG2) // exceeding pre-alloc'ed arrays in T; use malloc
{
Node **new_pointers = malloc(new_cap * sizeof(Node*));
if (NULL == new_pointers)
return NULL;
memcpy(new_pointers, elem->pointers, elem->length * sizeof(Node*));
elem->pointers = new_pointers;
}
else // already exceeded pre-alloc'ed arrays in T; use realloc
{
Node **new_pointers = realloc(elem->pointers, new_cap * sizeof(Node*));
if (NULL == new_pointers)
return NULL;
elem->pointers = new_pointers;
}
++elem->cap_lg2;
}
Node *ret = malloc(sizeof(Node);
if (ret)
{
Node_init(ret);
elem->pointers[elem->length] = ret;
++elem->length;
}
return ret;
}
int removeElement(const Node *a, MatrixElem *elem)
{
int i;
for (i = 0; i < elem->length && !Node_eq(a, elem->pointers[i]); ++i);
if (i == elem->length)
return -1;
Node_fini(elem->pointers[i]);
free(elem->pointers[i]);
--elem->length;
memmove(&elem->pointers[i], &elem->pointers[i+1], sizeof(Node*) * (elem->length - i));
return 0;
}
int main()
{
return 0;
}