I'm getting a memory leak when I create a 50'000 values vector of double, and I don't know why.
#include <stdafx.h>
#include <Windows.h>
#include <psapi.h>
#define MEMLOGINIT double mem1, mem2;\
PROCESS_MEMORY_COUNTERS_EX pmc;\
GetProcessMemoryInfo(GetCurrentProcess(), (PROCESS_MEMORY_COUNTERS*)&pmc, sizeof(pmc));\
SIZE_T virtualMemUsedByMe = pmc.PrivateUsage;\
mem1 = virtualMemUsedByMe/1024.0;\
std::cout << "1st measure \n Memory used : " << mem1 <<" Ko.\n\n";\
#define MEMLOG(stepName) GetProcessMemoryInfo(GetCurrentProcess(), (PROCESS_MEMORY_COUNTERS*)&pmc, sizeof(pmc));\
virtualMemUsedByMe = pmc.PrivateUsage; \
mem2 = virtualMemUsedByMe/1024.0; \
std::cout << stepName << "\n Memory used : " << mem2 << " Ko.\n Difference with previous measure : " << mem2 - mem1 <<" Ko.\n\n";\
mem1 = mem2;
int _tmain(int argc, _TCHAR* argv[])
{
MEMLOGINIT;
{
vector<double> spTestmemo(50000 ,100.0);
MEMLOG("measure before destruction");
}
MEMLOG("measure after destruction");
};
output with 50k values
Clearly here the 400 ko allocated by the vector aren't released.
However, the destructor works with a vector of 500'000 values.
int _tmain(int argc, _TCHAR* argv[])
{
MEMLOGINIT;
{
//vector<double> spTestmemo(50000 ,100.0);
vector<double> spTestmemo(500000 ,100.0); //instead of the line above
MEMLOG("measure before destruction");
}
MEMLOG("measure after destruction");
};
output with 500k values
Here, a vector ten times bigger than the previous one is almost completely destroyed (small bias of 4 ko).
Thanks for your help.
As NathanOlivier and PaulMcKenzie pointed out in their comments, this is not a memory leak.
The c++ std library may not be releasing all the memory to the OS when you free it, but the memory is still being accounted for.
So don't worry too much about what you see as the OS reported virtual memory usage of your program as long as it is not abnormally high or continuously increasing while your program runs.
--- begin visual studio specific:
Since you seem to be building your code with Visual Studio, its debug runtime library has a facilities for doing what you are doing with your MEMLOGINIT and MEMLOG macros, see https://msdn.microsoft.com/en-us/library/974tc9t1.aspx#BKMK_Check_for_heap_integrity_and_memory_leaks
Basically you can use _CrtMemCheckpoint to get the status of what has been allocated, and _CrtMemDifference and _CrtMemDumpStastistics to compare and log the difference between 2 checkpoints.
The debug version of the runtime library also automatically dumps leaked memory to the debugger console of your program when the program exits. If you define new as DEBUG_NEW it will even log the source file and line number where each leaked allocation was made. That is often very valuable when hunting down memory leaks.
Related
My question is related to a problem described here. I have written a C++ implementation of the Sieve of Eratosthenes that hits a memory overflow if I set the target value too high. As suggested in that question, I am able to fix the problem by using a boolean <vector> instead of a normal array.
However, I am hitting the memory overflow at a much lower value than expected, around n = 1 200 000. The discussion in the thread linked above suggests that the normal C++ boolean array uses a byte for each entry, so with 2 GB of RAM, I expect to be able to get to somewhere on the order of n = 2 000 000 000. Why is the practical memory limit so much smaller?
And why does using <vector>, which encodes the booleans as bits instead of bytes, yield more than an eightfold increase in the computable limit?
Here is a working example of my code, with n set to a small value.
#include <iostream>
#include <cmath>
#include <vector>
using namespace std;
int main() {
// Count and sum of primes below target
const int target = 100000;
// Code I want to use:
bool is_idx_prime[target];
for (unsigned int i = 0; i < target; i++) {
// initialize by assuming prime
is_idx_prime[i] = true;
}
// But doesn't work for target larger than ~1200000
// Have to use this instead
// vector <bool> is_idx_prime(target, true);
for (unsigned int i = 2; i < sqrt(target); i++) {
// All multiples of i * i are nonprime
// If i itself is nonprime, no need to check
if (is_idx_prime[i]) {
for (int j = i; i * j < target; j++) {
is_idx_prime[i * j] = 0;
}
}
}
// 0 and 1 are nonprime by definition
is_idx_prime[0] = 0; is_idx_prime[1] = 0;
unsigned long long int total = 0;
unsigned int count = 0;
for (int i = 0; i < target; i++) {
// cout << "\n" << i << ": " << is_idx_prime[i];
if (is_idx_prime[i]) {
total += i;
count++;
}
}
cout << "\nCount: " << count;
cout << "\nTotal: " << total;
return 0;
}
outputs
Count: 9592
Total: 454396537
C:\Users\[...].exe (process 1004) exited with code 0.
Press any key to close this window . . .
Or, changing n = 1 200 000 yields
C:\Users\[...].exe (process 3144) exited with code -1073741571.
Press any key to close this window . . .
I am using the Microsoft Visual Studio interpreter on Windows with the default settings.
Turning the comment into a full answer:
Your operating system reserves a special section in the memory to represent the call stack of your program. Each function call pushes a new stack frame onto the stack. If the function returns, the stack frame is removed from the stack. The stack frame includes the memory for the parameters to your function and the local variables of the function. The remaining memory is referred to as the heap. On the heap, arbitrary memory allocations can be made, whereas the structure of the stack is governed by the control flow of your program. A limited amount of memory is reserved for the stack, when it gets full (e.g. due to too many nested function calls or due to too large local objects), you get a stack overflow. For this reason, large objects should be allocated on the heap.
General references on stack/heap: Link, Link
To allocate memory on the heap in C++, you can:
Use vector<bool> is_idx_prime(target);, which internally does a heap allocation and deallocates the memory for you when the vector goes out of scope. This is the most convenient way.
Use a smart pointer to manage the allocation: auto is_idx_prime = std::make_unique<bool[]>(target); This will also automatically deallocate the memory when the array goes out of scope.
Allocate the memory manually. I am mentioning this only for educational purposes. As mentioned by Paul in the comments, doing a manual memory allocation is generally not advisable, because you have to manually deallocate the memory again. If you have a large program with many memory allocations, inevitably you will forget to free some allocation, creating a memory leak. When you have a long-running program, such as a system service, creating repeated memory leaks will eventually fill up the entire memory (and speaking from personal experience, this absolutely does happen in practice). But in theory, if you would want to make a manual memory allocation, you would use bool *is_idx_prime = new bool[target]; and then later deallocate again with delete [] is_idx_prime.
I would like to find out the amount of bytes used by a process from within a C++ program by inspecting the operating system's memory information. The reason I would like to do this is to find a possible overhead in memory allocation when allocating memory (due to memory control blocks/nodes in free lists etc.) Currently I am on mac and am using this code:
#include <mach/mach.h>
#include <iostream>
int getResidentMemoryUsage() {
task_basic_info t_info;
mach_msg_type_number_t t_info_count = TASK_BASIC_INFO_COUNT;
if (task_info(mach_task_self(), TASK_BASIC_INFO,
reinterpret_cast<task_info_t>(&t_info),
&t_info_count) == KERN_SUCCESS) {
return t_info.resident_size;
}
return -1;
}
int getVirtualMemoryUsage() {
task_basic_info t_info;
mach_msg_type_number_t t_info_count = TASK_BASIC_INFO_COUNT;
if (task_info(mach_task_self(), TASK_BASIC_INFO,
reinterpret_cast<task_info_t>(&t_info),
&t_info_count) == KERN_SUCCESS) {
return t_info.virtual_size;
}
return -1;
}
int main(void) {
int virtualMemoryBefore = getVirtualMemoryUsage();
int residentMemoryBefore = getResidentMemoryUsage();
int* a = new int(5);
int virtualMemoryAfter = getVirtualMemoryUsage();
int residentMemoryAfter = getResidentMemoryUsage();
std::cout << virtualMemoryBefore << " " << virtualMemoryAfter << std::endl;
std::cout << residentMemoryBefore << " " << residentMemoryAfter << std::endl;
return 0;
}
When running this code I would have expected to see that the memory usage has increased after allocating an int. However when I run the above code I get the following output:
75190272 75190272
819200 819200
I have several questions because this output does not make any sense.
Why hasn't either the virtual/resident memory changed after an integer has been allocated?
How come the operating system is allocating such large amounts of memory to a running process.
When I do run the code and check activity monitor I find that 304 kb of memory is used but that number differs from the virtual/resident memory usage obtained programmatically.
My end goal is to be able to find the memory overhead when assigning data, so is there a way to do this (i.e. determine the bytes used by the OS and compare with the bytes allocated to find the difference is what I am currently thinking of)
Thank you for reading
The C++ runtime typically allocates a block of memory when a program starts up, and then parcels this out to your code when you use things like new, and adds it back to the block when you call delete. Hence, the operating system doesn't know anything about individual new or delete calls. This is also true for malloc and free in C (or C++)
First you measure number of pages, not really memory allocated. Second the runtime pre allocates few pages at startup. If you want to observe something allocate more than a single int. Try allocating several thousands and you will observe some changes.
As far as I can tell, calling malloc() basically means the program is asking the OS for a hunk of memory. I'm writing a program to interface with a camera, in which I need to allocate chucks of memory large enough to store hundreds of images at a time (its a fast camera).
When I allocate space for about 1.9 Gb worth of images, everything works just fine. The allocation calculation is pretty simple:
int allocateBurst( int numImages )
{
int streamSize = ZIMAGESIZE * numImages;
data.images = new unsigned short [streamSize];
return 0;
}
But as soon as I go over the 2 Gb limit, I get runtime errors like this:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
It seems like 2 Gigs might be the maximum size that I can allocate at once. I have 32 Gigs of ram, and would like to simply be able to allocate larger pieces of memory in one allocation. Is this possible?
I'm running Ubuntu 12.10.
There may be an underlying issue that the OS can't grant your large memory allocation because it is using memory for other applications. Check with your OS to see what the limits are.
Also know that some OS's will "page" memory to the hard disk. When your program asks for memory outside the page, the OS will swap pages with the hard disk. Knowing this, I recommend a classic technique of "Double Buffering" or "Multiple Buffering".
You will need at least two threads: reading and writing. One thread is responsible for reading data from the camera and placing into a buffer. When it fills up a buffer, it starts on another buffer. Meanwhile the writing thread is starting at the buffer and writing it to disk (block file writes). When the writing thread finishes a buffer, it starts on the next one. The buffers should be in a circular sequence to reuse them.
The magic is to have enough buffers so that the reader never catches up to the writer.
Since you are using a couple of small buffers, you should not get any errors from the OS.
The are methods to optimize this, such as obtaining static buffers from the OS.
The problem is you're using a signed 32-bit variable to describe an unsigned 64-bit number.
Use "size_t" instead of "int" for holding the storage count. This has nothing to do with what you intend to store, just how large a count of them you need.
#include <iostream>
int main(int /*argc*/, const char** /*argv*/)
{
int units = 2;
// 32-bit signed, i.e. 31-bit numbers.
int intSize = units * 1024 * 1024 * 1024;
// 64-bit values (ULL suffix)
size_t sizetSize = units * 1024ULL * 1024ULL * 1024ULL;
std::cout << "intSize = " << intSize << ", sizetSize = " << sizetSize << std::endl;
try {
unsigned short* intAlloc = new unsigned short[intSize];
std::cout << "intAlloc = " << intAlloc << std::endl;
delete [] intAlloc;
} catch (std::bad_alloc) {
std::cout << "intAlloc failed (std::bad_alloc)" << std::endl;
}
try {
unsigned short* sizetAlloc = new unsigned short[sizetSize];
std::cout << "sizetAlloc = " << sizetAlloc << std::endl;
delete [] sizetAlloc;
} catch (std::bad_alloc) {
std::cout << "sizetAlloc failed (std::bad_alloc)" << std::endl;
}
return 0;
}
Output (g++ -m64 -o test test.cpp under Mint 15 64 bit with g++ 4.7.3 on a virtual machine with 4Gb of memory)
intSize = -2147483648, sizetSize = 2147483648
intAlloc failed
sizetAlloc = 0x7f55affff010
int allocateBurst( int numImages )
{
// change that from int to long
long streamSize = ZIMAGESIZE * numImages;
data.images = new unsigned short [streamSize];
return 0;
}
Try using
long
OR
cast the result of the allocateBurst function to "uint_64" and the return type of the function to uint_64
Because int you allocate 32 bit allocation while long or uint_64 allocates 64 bit allocation which could possibly allocate more memory space for you.
Hope that helps
The question title is quite self-explanatory. I have an run loop that need a dynamic-sized array. But I do know the maximum of that size is going to be, so if needed, I can max it out instead of dynamically-sizing it.
Here's my code, I know that clock_t probably not the best choice for timing in terms of portability, but clock_t provide bad accuracy.
#include <iostream>
#include <cstdlib>
#include <cstring>
#include <cstdio>
#include <ctime>
#define TEST_SIZE 1000000
using namespace std;
int main(int argc, char *argv[])
{
int* arrayPtr = NULL;
int array[TEST_SIZE];
int it = 0;
clock_t begin, end;
begin = clock();
memset(array, 0, sizeof(int) * TEST_SIZE);
end = clock();
cout << "Time to memset: "<< end - begin << endl;
begin = clock();
fill(array, array + TEST_SIZE, 0);
end = clock();
cout << "Time to fill: "<< end - begin << endl;
begin = clock();
for ( it = 0 ; it < TEST_SIZE ; ++ it ) array[it] = 0;
end = clock();
cout << "Time to for: "<< end - begin << endl;
}
Here's my result:
Time to memset: 1590
Time to fill: 2334
Time to for: 2371
Now that I know new & delete does now zero-out the array, is there any way faster than these?
Please help me!
Basically you are comparing apples and oranges.
memset and the for-loop explicitly set the memory content to a particular value(in your example 0). While, the new merely allocates sufficient memory(atleast as requested) and delete merely marks the memory free for reuse. There is no change in the content at that memory. So new and delete do not initialize/de-initialize the actual memory content.
Technically, the content in that memory has an Indeterminate value. Quite literally, the values maybe anything and you cannot rely on them to be anything specific.They might be 0 but they are not guaranteed to be. In fact using these values will cause your program to have an Undefined Behavior.
A new call for an class does two things:
Allocates requested memory &
Calls constructor for the class to initialize the object.
But note that in your case the type is an int and there is no default initialization for int.
new only allocates a memory block, it doesn't initialize the allocated memory.
To initialize an array you can use memset() or do it manually.
Good compiler will optimize all 4 approaches into 1 call of memset. Also, what's the difference between 3rd and 4th approach?
You can also do
int array[TEST_SIZE] = {};
gain readability and save 1 line of code.
I would resort to memset in this case. fill is generic, but the platform can offer you some really nice tricks in its implementation of memset. This is possible, because the function is unambigous in what it does and dumb enough:
It could employ (S)DMA for actual memory modification, which could have faster interface to memories. Also, while it runs the task, the CPU could do something else
When it knows it has to sequentially write contiguous memory region, it could do something preventive about cache invalidation
The implementation in ARM-based embedded systems can benefit from burst mode; it is realized with special assembler instruction (STMFD, STMFA etc.) and in this mode 3 writes are equal two normal writes timewise
I am allocating 2 same size arrays, one on stack, one on heap, then iterating over them with trivial assignment.
Executable is compiled to allocate 40mb for main thread stack.
This code has only been tested to compile in vc++ with /STACK:41943040 linker tag.
#include "stdafx.h"
#include <string>
#include <iostream>
#include <malloc.h>
#include <windows.h>
#include <ctime>
using namespace std;
size_t stackavail()
{
static unsigned StackPtr; // top of stack ptr
__asm mov [StackPtr],esp // mov pointer to top of stack
static MEMORY_BASIC_INFORMATION mbi; // page range
VirtualQuery((PVOID)StackPtr,&mbi,sizeof(mbi)); // get range
return StackPtr-(unsigned)mbi.AllocationBase; // subtract from top (stack grows downward on win)
}
int _tmain(int argc, _TCHAR* argv[])
{
string input;
cout << "Allocating 22mb on stack." << endl;
unsigned int start = clock();
char eathalfastack[23068672]; // approx 22mb
auto length = sizeof(eathalfastack)/sizeof(char);
cout << "Time taken in ms: " << clock()-start << endl;
cout << "Setting through array." << endl;
start = clock();
for( int i = 0; i < length; i++ ){
eathalfastack[i] = i;
}
cout << "Time taken in ms: " << clock()-start << endl;
cout << "Free stack space: " << stackavail() << endl;
cout << "Allocating 22mb on heap." << endl;
start = clock();
// auto* heaparr = new int[23068672]; // corrected
auto* heaparr = new byte[23068672];
cout << "Time taken in ms: " << clock()-start << endl;
start = clock();
cout << "Setting through array." << endl;
for( int i = 0; i < length; i++ ){
heaparr[i] = i;
}
cout << "Time taken in ms: " << clock()-start << endl;
delete[] heaparr;
getline(cin, input);
}
The output is this:
Allocating 22mb on stack.
Time taken in ms: 0
Setting through array.
Time taken in ms: 45
Free stack space: 18872076
Allocating 22mb on heap.
Time taken in ms: 20
Setting through array.
Time taken in ms: 35
Why is iteration of stack array slower than same thing on heap?
EDIT:
nneonneo cought my error
Now output is identical:
Allocating 22mb on stack.
Time taken in ms: 0
Setting through array.
Time taken in ms: 42
Free stack space: 18871952
Allocating 22mb on heap.
Time taken in ms: 4
Setting through array.
Time taken in ms: 41
Release build per Öö Tiib's answer below:
Allocating 22mb on stack.
Time taken in ms: 0
Setting through array.
Time taken in ms: 5
Free stack space: 18873508
Allocating 22mb on heap.
Time taken in ms: 0
Setting through array.
Time taken in ms: 10
Your arrays are not the same size; sizeof(char[23068672]) != sizeof(int[23068672]), and the elements are of different types.
Something is wrong with your PC, on mine ages old Pentium 4 it takes 15 ms to assign such stack-based char array. Did you try with debug version or something?
There are two parts to your question :
Allocating space on the stack vs heap
Accessing a memory location on stack vs globally visible
Allocating space
First, lets look at allocating space on the stack. The stack as we know grows downwards on the x86 architecture. So, in order to allocate space on the stack, all you have to do is decrement the stack pointer. Just one assembly instruction (dec sp, #amount). This assembly instruction is always present in the prologue of a function (function set-up code). So, as far as I know, allocating space on stack must not take any time. Cost of allocating space on stack = ( decrement sp operation). On a modern super-scalar machine, this execution of this instruction will be overlapped with other instructions.
Allocating space on the heap on the other hand requires a library call to new/malloc. The library call first checks if there is some space on the heap. If yes then it will just return a pointer to the first available address. If space is not available on the stack, it will use a brk system call to request kernel to modify the page-table entries for the additional page. A system call is a costly operation. It will cause a pipeline flush, TLB pollution, etc. So, the cost of allocating space on heap = (function-call + computation for space + (brk system call)?). Definitely, allocating space on heap seems order of magnitude slower than stack.
Accessing element
The addressing mode of the x86 ISA allows memory operand to be addressed using direct addressing mode (temp=mem[addr]) to access a global variable while the variables on stack are generally accessed using indexed addressing mode. (temp=mem[stack-pointer+offset-on-stack]). My assumption is that both the memory operands should take almost the same time however, the direct addressing mode seems definitely faster than the indexed addressing mode. Regarding the memory access of an array, we have two operands to access any element - base address of array and index variable. When we are accessing an array on stack, we add one more operand - the stack - pointer . The x86 addressing mode has a provision for such addresses - base+scale*index+offset . So, okay stack array element access : temp=mem[sp+base-address+iterator*element-size] and heap array access : temp=mem[base-address+iterator*element-size]. Clearly, the stack access must the costlier than the array access.
Now, coming to a generic case of iteration, if the iteration is slower for stack, it means addressing mode may(i am not completely sure) the bottle-neck and if allocating the space is bottleneck, the system call may be the bottleneck.