Linux: Why do the values in smaps increase continuously? - c++

I sum up the values of mappings' of the current proceess. Repeated this over a period of time. Saved the result in a file and then I plotted it. What I find a little bit odd is, that the values for the different fields of smaps seems to be increasing more or less linearly. Also I allocated some memory using the new command in c++. I freed the memory, but no recognizable difference. I was accepting some up and down movement in the plot of the fields. Unfortunately, there were not any up and down movement.
Is this behaviour normal or did I perhaps do something wrong? But I am pretty sure, that my parser works, because I checked it with pmap. My parser and pmap return for the same process the same result.

Allocating memory from OS is pretty expensive therefore large blocks of heap are allocated at a time. new tries to find empty place on the pre-allocated heap and only when there is none, it allocates another block from the OS. Deallocating from this pre-allocated heap is also done with large blocks only. (You can check manual page “mallopt” on how to tune the behavior using environment. Note that all allocations need to be done in pages, each usually 4 KiB large.)
This goes for small memory allocations. Large allocations (by default, 128 KiB or more, again tuneable with mallopt) are done with anonymous mmap and are deallocated when freed.

Related

Allocating aligned memory for larger arrays

In my program I want to allocate 32 byte aligned memory to use SSE/AVX. The amount I want to allocate is somewhere around 2000*1300*17*17*4(large data set). I tried using functions _aligned_malloc() and _mm_malloc but for larger sizes it doesn't allocate memory and results in a access violation exception. If the amount allocated is small like around 512*320*4*17*17(small data set) then the code work fine.
Here these functions return a null pointer when allocation is done for large data set.But works fine when input data size is small. Also here if I just use unaligned memory allocation using new then code works fine for large data set too.
Finally Can someone tell me Is there any significant performance gains in using aligned memory for AVX.
Edit: After some research according to this post it says that new allocate memory from free store and malloc() allocate memory from heap. Here I am exceeding maximum heap size as _aligned_malloc() return errno 12 which means ENOMEM in that case Can someone tell me a work around for this.
On memory allocation:
I seems you are actually trying to alocate 2000*1300*17*17*4 32 bytes elements. This is means you are trying to allocate 96 GB while your system has only 12 GB memory.
Since new is working but malloc not it seems your local implementation of new seems to be able to allocate huge amounts of virtual memory. Malloc allocates from the heap which means it is usally limited to the physical amount of memory you've got. That's the reason it fails.
As the dataset is bigger than your main memory you might want to allocate the memory using mmap which maps a file into virtual memory making it accessable as if it was in physical memory (but it will only partially be cached in memory). I'm not sure if it's guaranteed but mmap usally aligns on optimal page size boundary (almost always 4096 byte).
Anyway you will have a huge performance loss due to the fact that your disk is way slower than your RAM. This is so serious that using AVX will probably not speed up anything at all.
On the performance loss of using unaligned memory:
On modern hardware (say Intel's Haswell onwards I think) this depends on your access patterns. Unaligned access should have almost no performance overhead on iterating over the array in memory order (each cache line will still be loaded only once). If you access it in random order than you will often cross the 64 byte cache line boundry. This means your processor will have to load 2 lines into cache and remove 2 lines from the cache instead of only one. While this might be a serious problem for some situations in your case the disk will slows things down so much that you will barely notice this.
Addtional tips (or a shot in the dark):
The way you gave the size of the array (2000*1300*17*17*4) suggests that you are using a multidimensional array (e.g. auto x = new __m256[2000][1300][17][17][4]). So some tipps on that:
Iterate through it mostly sequential
Check if it is sparse (meaning some of the memory will never be accessed) and shrink it if possible.
You could try to flatten the array and do more complex index calculation yourself in order to reduce the amount of memory need. If you get it to fit completely into your RAM you can start to optimise your code (using AVX and/or aligned memory).
"Total paging file size for all drives is 15247MB" suggests that you actually using only parts of that 96 GB so there might be a way to further reduce your usage.
In that case you might also want to ask another question on how to reduce the memory usage with more info on what you are doing.

Windows heap manager and heap segments

I found the following sentence in a book :
Whenever the heap manager runs out of committed space in the heap segment, it
explicitly commits more memory and divides the newly committed space into blocks
as more and more allocations are requested
Does this mean when a block is allocated in the segment the virtual memory used by the user and the metadata isn't considered committed anymore ?
This is from the advanced windows debugging book I take it, not sure what you mean as you get kind of vague towards the end, however what it basically means is as follows:
When you allocate heap space the contents of the heap are not necessarily pre-determined, so you can use that allocated space as you see fit: for example, I allocate 1 megabyte of heap memory, and I then decide to populate that space with only 512k or data, that would mean I have committed half of my allocated heap, leaving a further 512k free. That memory will still show as being utilised to the OS because I have explicitly set the heap allocation to 1024k, however next time I use that same space I could use more or less than the 512k I utilised last time, up to the amount I have allocated for use. The amount you use at a given point is the commit, the amount you have set aside is the allocation.
This is all much much simplified, and I would recommend reading such sources as:
stack-memory-vs-heap-memory from here
the-stack-and-the-heap from learn CPP
Memory_Stack_vs_Heap from CBootCamp
As good sources to get you started on memory and its usage in C++.
If there is anything specific or more detail you can think of (your question is a bit unclear) then let me know and I will get back to you as soon as possible.
No. Allocated blocks are part of committed memory.

C++: Does this look like memory fragmentation?

SUMMARY:
I have an application which consumes way more memory that it should (roughly about 250% of the expected amount) but I can't seem to find any memory leaks. Calling the same function (which does a lot of allocations) will keep increasing memory usage to some point and then it will not change and stay there.
PROGRAM DETAILS:
The application uses a quadtree data structure to store 'Points'. It is possible to specify the maximum number of points to be stored in memory (cache size). The 'Points' are stored in 'PointBuckets' (arrays of points linked to the leaf nodes of the quadtree) which, if the maximum total number of points in the quadtree is reached, are serialized and saved to temporary files, to be retrieved when needed. This all seems to work fine.
Now when a file is loaded a new Quadtree is created and the old one is deleted if it exists, then points are read from the file and inserted into the quadtree one by one. A lot of memory allocations take place as buckets are being created and deleted during node splitting etc.
SYMPTOMS:
If I load a file that is expected to use 300MB of memory once, I get the expected amount of memory consumed. All good. If I keep loading the same file over and over again the memory usage keeps growing (I'm looking at the RES column in top, Linux) till about 700MB. That could indicate a memory leak. However if I then keep loading the files still, memory consumption just stays at 700MB.
Another thing: When I use valgrind massif and look at the memory usage it always stays within expected limit. For example if I specify cache size to be 1.5 GB and run my program alone, it will eventually consume 4GB of memory. If I run it in massif, it will stay below 2GB for all the time and then in the produced graphs I'll be able to see that it in fact never allocated more then the expected 1.5GB. My naive assumption is that this happens because massif uses a custom memory pool which somehow prevents fragmentation.
So what do you think is going on here? What kind of solution should I look for to solve this issue, if it is memory fragmentation?
I'd put it more at simple allocator and OS caching behaviours. They retain memory you allocated instead of freeing it so that it can be returned to you in a more prompt fashion the next time you request it. However, 250% does sound like a lot for this kind of effect- you could be looking at fragmentation problems.
Try swapping your allocator for a fragmentation-free allocator like object pool or memory arena.

Vector of 20,000 small objects vs vector of 20,000 object pointers to 20,000 heap objects

Developing a 32-bit C++/carbon app under OS X Snow Leopard, ran into a problem where an stl vector of approximately 20,000 small objects (72 bytes each) was failing during a reallocation. Seems the vector, which was several megabytes in size, couldn't expand to a contiguous piece of memory, which at the point of failure was only 1.2 MB in size.
GuardMalloc[Appname-33692]: *** mmap(size=2097152) failed (error code=12)
*** error: can't allocate region
GuardMalloc[Appname-35026]: Failed to VM allocate 894752 bytes
GuardMalloc[ Appname-35026]: Explicitly trapping into debugger!!!
#0 0x00a30da8 in GMmalloc_zone_malloc_internal
#1 0x00a31710 in GMmalloc
#2 0x94a54617 in operator new
#3 0x0026f1d3 in __gnu_cxx::new_allocator<DataRecord>::allocate at new_allocator.h:88
#4 0x0026f1f8 in std::_Vector_base<DataRecord, std::allocator<DataRecord> >::_M_allocate at stl_vector.h:117
#5 0x0026f373 in std::vector<DataRecord, std::allocator<DataRecord> >::_M_insert_aux at vector.tcc:275
#6 0x0026f5a6 in std::vector<DataRecord, std::allocator<DataRecord> >::push_back at stl_vector.h:610
I can think of several strategies:
1) Reserve() a really, really big vector as soon as the app launches. However, this assumes the user might not load additional files that contribute to this vector, pushing it beyond the pre-allocated limit and possibly getting back into the same situation.
2) Change the vector of objects/memory allocations into a vector of pointers to objects/memory allocations. Clearly makes the vector itself a more manageable size, but then creates 20,000 small objects (which could eventually become like 50,000 objects, depending on what additional files the user loads). Does this create a gigantic overhead problem?
3) Change from a vector to a list, which may have its own overhead issues.
The vector is being constantly iterated through, and generally only appended to.
Any sage thoughts on these issues?
===============
ADDITIONAL NOTE: this particular vector just holds all imported record, so they can be indexed and sorted by ANOTHER vector that contains a sort order. Once an item is put into this vector, it stays there for the lifetime of the app (also helps support undo operations by making sure the index into the vector always remains the same for that particular object).
I think a std::deque would be more suitable than a std::list or a std::vector in your case. std::list is not efficient in iteration or random indexing, while std::vector is slow to resize (as you have observed). A std::deque does not need large amounts of memory when resizing, at the cost of slightly slower random indexing than a vector.
Don't use continuous storage like a vector. Go for a deque or list and the reallocations will not fail anymore.
If you really need high performance, consider writing your own container (ie ArrayList).
Out of your three options, 1 doesn't seem like a guaranteed solution, while 2 adds complexity and the vector still has to grow.
Option 3 seems somewhat reasonable (or possibly use a deque as mentioned in another answer) because while it's semantically similar to option 2, it provides a more normalized method of allocating a new data object. Of course this assumes that you only append data and don't need random access.
However all that said I find it incredible that your program has fragmented memory so badly that on reasonably modern hardware it can't allocate 1.2MB of memory. Far more likely is that there's some undefined behavior lurking (or possibly a memory leak) in your program causing it to behave in this way, failing to allocate the memory. You could use valgrind to help hunt down what may be going on. Does the same problem happen when you use the builtin new and delete rather than GMalloc?
Is your program being ulimited to only have access to a small amount of memory?
Finally, if valgrind finds nothing and your program really is fragmenting memory horribly, I would consider stepping back and reconsidering your approach. You may want to evaluate an alternate approach that doesn't rely on millions(?) of allocations (I just can't see a small number of allocations fragmenting the heap that much).
if even in the heap, there is not enough contigous space, use deque;
deque allocate not contigous space when it is needed. so it could handle the limit of 1.2 MB
deque is made of some number of blocks of memory not only one contigous space. that s why it could work. but it is not sure(/totaly safe) because you dont control the behaviour of the deque.
see thisabout memory fragmentation (following is copy/paste from the web page):
http://www.design-reuse.com/articles/25090/dynamic-memory-allocation-fragmentation-c.html :
Memory Fragmentation
The best way to understand memory fragmentation is to look at an example. For this example, it is assumed hat there is a 10K heap. First, an area of 3K is requested, thus:
#define K (1024)
char *p1;
p1 = malloc(3*K);
Then, a further 4K is requested:
p2 = malloc(4*K);
3K of memory is now free.
Some time later, the first memory allocation, pointed to by p1, is de-allocated:
free(p1);
This leaves 6K of memory free in two 3K chunks. A further request for a 4K allocation is issued:
p1 = malloc(4*K);
This results in a failure – NULL is returned into p1 – because, even though 6K of memory is available, there is not a 4K contiguous block available. This is memory fragmentation.
this is an issue even for kernels using virtual memory such as osx.

Can't track down source of huge memory use

I've been trying to track down a memory problem for a couple of days - my program is using around 3GB of memory, when it should be using around 200MB-300MB. Valgrind is actually reporting that it is using ~300MB at its peak, and is not reporting any memory leaks.
The program reads an input file, and stores every unique word in that file. It is multi-threaded, and I've been running it using 4 threads. My major sources of data are:
Constant-size array of wchar_t (4MB total)
Map between words and a list of associated values. This grows with the size of input. If there are 1,000,000 unique words in the input file, there will be 1,000,000 entries in the tree.
I am doing a huge number of allocations and deallocations (using new and delete) -- at least two per unique word. Is it possible that memory I free is not being reused for some reason, causing the program to keep acquiring more and more memory? It consistently grabs more as it continues to run.
In general, any ideas about where I should go from here?
Edit 1 (based on advice from Graham):
One path I'll try is minimizing allocation. I'll work with a single string per thread (which may grow occasionally if a word is longer than this string is), but if I remember my code correctly this will eliminate a huge number of new/delete calls. If all goes well I'll be left with: one-time allocation of input buffer, one-time allocation of string-per-thread (with some reallocs), two allocs per map entry (one for key, one for value).
Thanks!
It's likely to be heap fragmentation. Because you are allocating and releasing small blocks in such huge quantities, it's probable that there are loads of small free chunks which are too small to be reused by subsequent allocations. Since these chunks are effectively wasted, the process has to keep grabbing more and more memory from the system to honour new allocations.
You may be able to mitigate the effect by first reserving a sufficiently large default capacity in each string with string::reserve(), and then clearing strings to empty when you're finished with them (rather than deleting). Then, keep a list of empty strings to be reused instead of allocating new ones all the time.
EDIT: the above suggestion assumes the objects being allocated are std::strings. If they're not, then you can probably still apply the general technique of keeping old empty objects around for reuse.
Memory your program frees should be returned to the heap where it can be allocated again.
However, that does not mean it is freed back to the operating system. Often, the app will continue to "own" memory that has been allocated and freed.
Is this a Windows app? How are you allocating and freeing the memory? And how are you determining how much memory the app is using?
You should try wrapping the resource allocations into a class if you can. Call new in the constructor, and delete in the destructor. Try and take advantage of scope so memory management is done more automatically.
http://en.wikipedia.org/wiki/RAII