How do you benchmark memory consumption? - c++

I would like to know if there is an efficient way to measure the actual memory consumption of a particular C data structure.
The goal being to make benchmarks based on how the memory usage changes after specific operations on those data structures.
I do not seek a way to count the number of objects in use; I do want to know exactly how big the memory usage of an object put under stress can get.
Is there a standard way to do that, either in C code, or from outside? (Some equivalent to the time (1) utility would be a start).
Obviously, I could track down every single pointer and do a sum of all sizeofs. If this is the only way, please do tell me. I wonder whether there is a simpler way. Or maybe a library to do it for me.

If you want to monitor the memory usage of the program on a global level you can replace new/delete in C++ or malloc/free in C with your own functions and log the memory usage.

On Unix for memory consumption you can use valgrind with the tool Massif (+ any visualization tool), but I don't know if it is suited for your problem since it will give you a detailled view of all the memory consumption of your program.

Yep, cnicutar, on Linux you have pmap or maybe even pstat.
On MS there are myriad profiling tools for VStudio depending on your contribution to the MS machine (even free ones for cmd line use). Call me a green horn, I don't have issues with memory leaks.

Related

C++: Measuring memory usage from within the program, Windows and Linux [duplicate]

This question already has answers here:
How to determine CPU and memory consumption from inside a process
(10 answers)
Closed 6 years ago.
See, I wanted to measure memory usage of my C++ program. From inside the program, without profilers or process viewers, etc.
Why from inside the program?
Measurements will be done thousands of times—must be automated; therefore, having an eye on Task Manager, top, whatever, will not do
Measurements are to be done during production runs—performance degradation, which may be caused by profilers, is not acceptable since the run times are non-negligible already (several hours for large problem instances)
note. Why measure at all? The only reason to measure used memory (as reported by the OS) as opposed to calculating “expected” usage in advance is the fact that I can not directly, analytically “sizeof” how much does my principal data structure use. The structure itself is
unordered_map<bitset, map<uint16_t, int64_t> >
these are packed into a vector for all I care (a list would actually suffice as well, I only ever need to access the “neighbouring” structures; without details on memory usage, I can hardly decide which to choose)
vector< unordered_map<bitset, map<uint16_t, int64_t> > >
so if anybody knows how to “sizeof” the memory occupied by such a structure, that would also solve the issue (though I'd probably have to fork the question or something).
Environment: It may be assumed that the program runs all alone on the given machine (along with the OS, etc. of course; either a PC or a supercomputer's node); it is certain to be the only one program requiring large (say > 512 MiB) amounts of memory—computational experiment environment. The program is either run on my home PC (16GiB RAM; Windows 7 or Linux Mint 18.1) or the institution supercomputer's node (circa 100GiB RAM, CentOS 7), and the program may want to consume all that RAM. Note that the supercomputer effectively prohibits disk swapping of user processes, and my home PC has a smallish page file.
Memory usage pattern. The program can be said to sequentially fill a sort of table, each row wherein is the vector<...> as specified above. Say the prime data structure is called supp. Then, for each integer k, to fill supp[k], the data from supp[k-1] is required. As supp[k] is filled it is used to initialize supp[k+1]. Thus, at each time, this, prev, and next “table rows” must be readily accessible. After the table is filled, the program does a relatively quick (compared with “initializing” and filling the table), non-exhaustive search in the table, through which a solution is obtained. Note that the memory is only allocated through the STL containers, I never explicitly new() or malloc() myself.
Questions. Wishful thinking.
What is the appropriate way to measure total memory usage (including swapped to disk) of a process from inside its source code (one for Windows, one for Linux)?
Should probably be another question, or rather a good googling session, but still---what is the proper (or just easy) way to explicitly control (say encourage or discourage) swapping to disk? A pointer to an authoritative book on the subject would be very welcome. Again, forgive my ignorance, I'd like a means to say something on the lines of “NEVER swap supp” or
“swap supp[10]”; then, when I need it, “unswap supp[10]”—all from the program's code. I thought I'd have to resolve to serialize the data structures and explicitly store them as a binary file, then reverse the transformation.
On Linux, it appeared the easiest to just catch the heap pointers through sbrk(0), cast them as 64-bit unsigned integers, and compute the difference after the memory gets allocated, and this approach produced plausible results (did not do more rigorous tests yet).
edit 5. Removed reference to HeapAlloc wrangling—irrelevant.
edit 4. Windows solution
This bit of code reports the working set that matches the one in Task Manager; that's about all I wanted—tested on Windows 10 x64 (tested by allocations like new uint8_t[1024*1024], or rather, new uint8_t[1ULL << howMuch], not in my “production” yet ).
On Linux, I'd try getrusage or something to get the equivalent.
The principal element is GetProcessMemoryInfo, as suggested by #IInspectable and #conio
#include<Windows.h>
#include<Psapi.h>
//get the handle to this process
auto myHandle = GetCurrentProcess();
//to fill in the process' memory usage details
PROCESS_MEMORY_COUNTERS pmc;
//return the usage (bytes), if I may
if (GetProcessMemoryInfo(myHandle, &pmc, sizeof(pmc)))
return(pmc.WorkingSetSize);
else
return 0;
edit 5. Removed reference to GetProcessWorkingSetSize as irrelevant. Thanks #conio.
On Windows, the GlobalMemoryStatusEx function gives you useful information both about your process and the whole system.
Based on this table you might want to look at MEMORYSTATUSEX.ullAvailPhys to answer "Am I getting close to hitting swapping overhead?" and changes in (MEMORYSTATUSEX.ullTotalVirtual – MEMORYSTATUSEX.ullAvailVirtual) to answer "How much RAM is my process allocating?"
To know how much physical memory your process takes you need to query the process working set or, more likely, the private working set. The working set is (more or less) the amount of physical pages in RAM your process uses. Private working set excludes shared memory.
See
What is private bytes, virtual bytes, working set?
How to interpret Windows Task Manager?
https://blogs.msdn.microsoft.com/tims/2010/10/29/pdc10-mysteries-of-windows-memory-management-revealed-part-two/
for terminology and a little bit more details.
There are performance counters for both metrics.
(You can also use QueryWorkingSet(Ex) and calculate that on your own, but that's just nasty in my opinion. You can get the (non-private) working set with GetProcessMemoryInfo.)
But the more interesting question is whether or not this helps your program to make useful decisions. If nobody's asking for memory or using it, the mere fact that you're using most of the physical memory is of no interest. Or are you worried about your program alone using too much memory?
You haven't said anything about the algorithms it employs or its memory usage patterns. If it uses lots of memory, but does this mostly sequentially, and comes back to old memory relatively rarely it might not be a problem. Windows writes "old" pages to disk eagerly, before paging out resident pages is completely necessary to supply demand for physical memory. If everything goes well, reusing these already written to disk pages for something else is really cheap.
If your real concern is memory thrashing ("virtual memory will be of no use due to swapping overhead"), then this is what you should be looking for, rather than trying to infer (or guess...) that from the amount of physical memory used. A more useful metric would be page faults per unit of time. It just so happens that there are performance counters for this too. See, for example Evaluating Memory and Cache Usage.
I suspect this to be a better metric to base your decision on.

how can I see how much memory my program is eating?

my program is running and creating variables, I need to know what's the total of Bytes these variables take.
I don't want to know how much is the physical memory space that the system gives my program to be executed, I know I can open the processes manager and find out.
I neither want to write into my code some sizeof and agregations so I can know the total size of the variable pool (let say the code is too complex to be modify like that).
Finally I'm using Microsoft VC++ 2010 Express, I just want to know if there is a workspace which monitor that kind of information.
Thanks in advance.
Check this out: Memory Performance Information . There are few metrics of a running process you might be interested in, you will primarily want private bytes, and this data is available both programmatically or through tools like Performance Monitor. You can also enumerate heaps of the process with GetProcessHeaps (and even HeapWalk if you need details) and check heap allocation sizes directly.
Valgrind Massif profiler is a great tool (see here ) but only for Unix/Linux I think. In your case, on Windows I think Insure++ or softwareverify are good choices (they are commercial tools).
A free alternative is Google's tcmalloc which provides a heap profiler here

how to find allocated memory in linux

Good afternoon all,
What I'm trying to accomplish: I'd like to implement an extension to a C++ unit test fixture to detect if the test allocates memory and doesn't free it. My idea was to record allocation levels or free memory levels before and after the test. If they don't match then you're leaking memory.
What I've tried so far: I've written a routine to read /proc/self/stat to get the vm size and resident set size. Resident set size seems like what I need but it's obviously not right. It changes between successive calls to the function with no memory allocation. I believe it's returning the cached memory used not what's allocated. It also changes in 4k increments so it's too coarse to be of any real use.
I can get the stack size by allocating a local and saving it's address. Are there any problems with doing this?
Is there a way to get real free or allocated memory on linux?
Thanks
Your best bet may actually be to use a tool specifically designed for the job of finding memory leaks. I have personal experience with Electric Fence, which is easy to use and seems to do the job nicely (not sure how well it will handle C++). Also recommended by others is Dmalloc.
For sure though, everyone seems to like Valgrind, which can do just about anything and even has front-ends (though anything that has a front-end built for it means that it probably isn't the simplest thing in the world). If the KDE folks can recommend it, it must be able to handle just about anything. (I'm not saying anything bad about KDE, just that it is a very large C++ codebase, so if Valgrind can handle KDE software, it must have something going for it. Though I don't have personal experience with it as Electric Fence was always enough for me)
I'd have to agree with those suggesting Valgrind and similar, but if the run-time overhead is too great, one option may be to use mallinfo() call to retrieve statistics on currently allocated memory, and check whether uordblks is nonzero.
Note that this will have to be run before global destructors are called - so if you have any allocations that are cleaned up there, this will register a false positive. It also won't tell you where the allocation is made - but it's a good first pass to figure out which test cases need work.
don't look a the OS to get allocation info. the C library manages memory internally, and only asks the OS for more RAM in chunks (4KB in your case). In most cases, it's never released to back to the OS, so you can't really check anything there.
You'll have to patch malloc() and free() to get the info you need.
Or, use Valgrind.
Not a direct answer but you could re-define the ::new and ::delete operators, and internally either via a singleton or global objects, keep track of the allocated, and de-allocated memory.
Edit: If this is a personal, DIY project then cool. But if its for something critical you can always jump onto one of the many leak detection libraries/programs available, a quick google search should suffice.
google-perftools can be used in your test code.

Will Garbage Collected C be Faster Than C++?

I had been wondering for quite some time on how to manager memory in my next project. Which is writing a DSL in C/C++.
It can be done in any of the three ways.
Reference counted C or C++.
Garbage collected C.
In C++, copying class and structures from stack to stack and managing strings separately with some kind of GC.
The community probably already has a lot of experience on each of these methods. Which one will be faster? What are the pros and cons for each?
A related side question. Will malloc/free be slower than allocating a big chunk at the beginning of the program and running my own memory manager over it? .NET seems to do it. But I am confused why we can't count on OS to do this job better and faster than what we can do ourselves.
It all depends! That's a pretty open question. It needs an essay to answer it!
Hey.. here's one somebody prepared earlier:
http://lambda-the-ultimate.org/node/2552
http://www.hpl.hp.com/personal/Hans_Boehm/gc/issues.html
It depends how big your objects are, how many of them there are, how fast they're being allocated and discarded, how much time you want to invest optimizing and tweaking to make optimizations. If you know the limits of how much memory you need, for fast performance, I would think you can't really beat grabbing all the memory you need from the OS up front, and then managing it yourself.
The reason it can be slow allocating memory from the OS is that it deals with lots of processes and memory on disk and in ram, so to get memory it's got to decide if there is enough. Possibly, it might have to page another processes memory out from ram to disk so it can give you enough. There's lots going on. So managing it yourself (or with a GC collected heap) can be far quicker than going to the OS for each request. Also, the OS usually deals with bigger chunks of memory, so it might round up the size of requests you make meaning you could waste memory.
Have you got a real hard requirement for going super quick? A lot of DSL applications don't need raw performance. I'd suggest going with whatever's simplest to code. You could spend a lifetime writing memory management systems and worrying which is best.
Why would garbage collected C be faster than C++? The only garbage collectors available for C are pretty inefficient things, more designed to plug memory leaks than to actually improve the quality of your code.
In any case, C++ has the potential for reaching better performance with less code (note that it's only a potential. It's also very possible to write C++ code that is far slower than the equivalent C).
Considering the current state of both languages, GC's are not currently going to improve performance in your code. GC's can be made very efficient in languages designed for it. C/C++ are not among those. ;)
Apart from that, it's impossible to say. Languages don't have a speed. It doesn't make sense to ask which language is faster. It depends on 1) the specific code, 2) the compiler that compiles it, and 3) the system it's running on (hardware as well as OS).
malloc is a fairly slow operation, far slower than the .NET equivalents, so yes, if you are performing a lot of small allocations, you may be better off allocating a large pool of memory once, and then using chunks of that.
The reason is that the OS has to find a free chunk of memory, basically by following a linked list of all free memory areas. In .NET, a new() call is basically nothing more than moving the heap pointer as many bytes as required by the allocation.
uh ... It depends how you write the garbage collection system for your DSL. Neither C or C++ comes with a garbage collection facility built-in but either could be used to write a very efficient or a very inefficient garbage collector. Writing such a thing, by the way, is a non-trivial task.
DSLs are often written in higher level languages such as Ruby or Python specifically because the language writer can leverage the garbage collection and other facilities of the language. C and C++ are great for writing full, industrial strength languages but you certainly need to know what you are doing to use them - knowledge of yacc and lex is especially useful here but a good understanding of dynamic memory management is important also, as you say. You could also check out keykit, an open source music DSL written in C, if you still like the idea of a DSL in C/C++.
With most garbage collection implementations, allocation can see a speed improvement, but then you have the additional cost of the collection phase which can be triggered at any point in your program's execution, leading to a sudden (seemingly random) delay.
As for your second question, it depends on your memory management algorithms. You'd be safe sticking with your library's default malloc implementation, but there are alternatives which boast better performance.
A related side question. Will malloc/free be slower than allocating a big chuck at the begining of the program and running my own memory manager over it? .NET seems to do it. But I am confused why we can't count on OS to do this job better and faster than what we can do ourselves.
The problem with letting the OS handle memory allocation is that it introduces indeterministic behaviour. There's no way for the programmer to know how long the OS will take to return a new chunk of memory - an allocation may be quite costly if memory has to be paged out to disk.
Preallocating therefore might be a good idea, especially when using a copying garbage collector. It'll increase memory consumption, but allocation will be fast because in most cases it'll just be a pointer increment.
As people have pointed out - GC is faster to allocate (because it just gives you the next block on its list), but slower overall (because it has to compact the heap regularly, in order for allocs to be fast).
so - go for the compromise solution (which is actually pretty damn good):
You create your own heaps, one for each size of object you generally allocate (or 4-byte, 8 byte, 16-byte, 32-byte, etc) then, when you want a new piece of memory you grab the last 'block' on the appropriate heap. Because you pre-allocate from these heaps, all you need to do when allocating is grab the next free block. This works better than the standard allocator because you are happily wasting memory - if you want to allocate 12 bytes, you'll give up a whole 16 byte block from the 16-byte heap. You keep a bitmap of used v free blocks so you can allocate quickly without wasting loads of memory or needing to compact.
Also, because you're running several heaps, highly-parallel systems work much better as you don't need to lock so often (ie you have multiple locks for each heap so you don't get contention nearly as much)
Try it - we used it to replace the standard heap on a very intensive application, performance went up by quite a lot.
BTW. the reason the standard allocators are slow is that they try not to waste memory - so if you allocate a 5 byte, 7 byte and 32 bytes from the standard heap, it'll keep those 'boundaries'. Next time you need to allocate, it'll walk through those looking for enough space to give you what you asked for. That worked well for low-memory systems, but you only have to look at how much memory most apps use today to see that GC systems go the other way, and try to make allocations as fast as possible whilst caring nothing for how much memory is wasted.
The problem has a lot of variables, but if your application is written with garbage collection in mind, and if you exploit the special features of the Boehm collector, such as different allocation calls for blocks that don't contain pointers, then as a general rule your application
- Will have simpler interfaces
- Will run somewhat faster
- Will require from 1.2x to 2x the space
than a similar application using explicit memory management.
For documentation and evidence supporting these claims, you can see the information on Boehm's web site, and also Ben Zorn's several papers on the measured cost of conservative garbage collection.
Most importantly you'll save a ton of effort and won't have to worry about a significant class of memory-management bugs.
The issue of C vs C++ is orthogonal, but GC will definitely be faster than reference counting, especially when there's no compiler support for reference counting.
Neither C nor C++ will give you garbage for free. What they will give you is memory allocation libraries (which provide malloc/free, etc). There are many online resources to algorithms for writing garbage collection libraries. A good start is link text
Most non GC languages will allocate and de-allocate the memory as needed and no longer needed. GC'd languages usually allocate large chunks of memory before hand and only free the memory when idle and not in the middle of a intensive task so I am going to yes if GC kicks in at correct time.
The D programming language is a garbage collected language and ABI compatible with C and partly ABI compatible with C++. This Page shows some benchmarks between string performance in C++ and D.
I suggest that if you have written a program where memory allocation and deallocation (explicitly or GC'ed) is the bottleneck, then you should re-think your architecture, design and implementation.
If you don't want to explicitly manage memory, don't use C/C++. There are plenty of languages with either reference counting or compiler-supported garbage collectors that will probably work much better for you.
C/C++ are designed in an environment where the programmer manages their own memory. Trying to retrofit GC or ref counting onto them may help some, but you'll find that you either have to compromise the performance of the GC (because it doesn't have any compiler hinting as to where pointers might be), or you'll find new and fascinating ways that you can screw up the reference counts or the GC or whatever.
I know it sounds like a good idea, but really, you should just grab a language more suited to the task.

Reducing memory footprint of large unfamiliar codebase

Suppose you have a fairly large (~2.2 MLOC), fairly old (started more than 10 years ago) Windows desktop application in C/C++. About 10% of modules are external and don't have sources, only debug symbols.
How would you go about reducing application's memory footprint in half? At least, what would you do to find out where memory is consumed?
Override malloc()/free() and new()/delete() with wrappers that keep track of how big the allocations are and (by recording the callstack and later resolving it against the symbol table) where they are made from. On shutdown, have your wrapper display any memory still allocated.
This should enable you both to work out where the largest allocations are and to catch any leaks.
this is description/skeleton of memory tracing application I used to reduce memory consumption of our game by 20%. It helped me to track many allocations done by external modules.
It's not an easy task. Begin by chasing down any memory leaks you cand find (a good tool would be Rational Purify). Skim the source code and try to optimize data structures and/or algorithms.
Sorry if this may sound pessimistic, but cutting down memory usage by 50% doesn't sound realistic.
There is a chance is you can find some significant inefficiencies very fast. First you should check what is the memory used for. A tool which I have found very handy for this is Memory Validator
Once you have this "memory usage map", you can check for Low Hanging Fruit. Are there any data structures consuming a lot of memory which could be represented in a more compact form? This is often possible, esp. when the data access is well encapsulated and when you have a spare CPU power you can dedicate to compressing / decompressing them on each access.
I don't think your question is well posed.
The size of source code is not directly related to the memory footprint. Sure, the compiled code will occupy some memory but the application might will have memory requirements on it's own. Both static (the variables declared in the code) and dynamic (the object the application creates).
I would suggest you to profile program execution and study the code carefully.
First places to start for me would be:
Does the application do a lot of preallocation memory to be used later? Does this memory often sit around unused, never handed out? Consider switching to newing/deleting (or better use a smart_ptr) as needed.
Does the code use a static array such as
Object arrayOfObjs[MAX_THAT_WILL_EVER_BE_USED];
and hand out objs in this array? If so, consider manually managing this memory.
One of the tools for memory usage analysis is LeakDiag, available for free download from Microsoft. It apparently allows to hook all user-mode allocators down to VirtualAlloc and to dump process allocation snapshots to XML at any time. These snapshots then can be used to determine which call stacks allocate most memory and which call stacks are leaking. It lacks pretty frontend for snapshot analysis (unless you can get LDParser/LDGrapher via Microsoft Premier Support), but all the data is there.
One more thing to note is that you may have false leak positives from BSTR allocator due to caching, see "Hey, why am I leaking all my BSTR's?"