Can massif measure global/static data cost? - profiling

I see massif can measure heap use, and also stack use with some options. Does it also report global data consumption (the data defined as global or static variables)?

Does it also report global data consumption (the data defined as global or static variables)?
No, Massif is heap-only tool and does not measure .data and .bss sections, and directly mmap-ed memory (but it may measure stack which is used to store some local variables and by alloca):
http://valgrind.org/docs/manual/ms-manual.html
Massif is a heap profiler. It measures how much heap memory your program uses. This includes both the useful space, and the extra bytes allocated for book-keeping and alignment purposes. It can also measure the size of your program's stack(s), although it does not do so by default. ...
9.2.8. Measuring All Memory in a Process
It is worth emphasising that by default Massif measures only heap memory, i.e. memory allocated with malloc, calloc, realloc, memalign, new, new[], and a few other, similar functions. (And it can optionally measure stack memory, of course.) This means it does not directly measure memory allocated with lower-level system calls such as mmap, mremap, and brk. ...
--stacks=<yes|no> [default: no]
Specifies whether stack profiling should be done. This option slows Massif down greatly, and so is off by default. Note that Massif assumes that the main stack has size zero at start-up. This is not true, but doing otherwise accurately is difficult. Furthermore, starting at zero better indicates the size of the part of the main stack that a user program actually has control over.

Related

How does GLIBC decide segment for malloc

I look at some Linux Glibc(2.25) system and see that when the code use malloc .
sometimes the buffer has been allocated at heap segment and sometimes in anonymous segment, It's not relate for size, I can see all the segments in /proc/PID/maps
I thought that the heap segment relate for malloc And anonymous segment relate for mmap. But why GLIBC decide for the same size to use malloc and sometimes use mmap
I saw that sometimes when I use malloc in some thread the memory has been allocated at heap segment but when I switch for another thread(using GDB) the memory has been allocated to anonymous segment
glibc's malloc implementation will sometimes use brk or sbrk (what you're calling the heap -- it shows up as 'heap' in /proc/PID/maps) and sometimes use mmap. Which depends on some tradeoffs, but generally
if a process only needs a small amount of heap space, brk/sbrk is better
if a process needs a lot of heap space and/or very large blocks, mmap is better.
So GLIBC's malloc implementation has a bunch of heuristics to decide what is 'small' and what is 'large' and looks at what calls have been made so far to malloc/free in order decide on which method to use to get more memory from the system when it needs it.
There's a function mallopt you can call that affects this tuning -- there's a bunch of info on the man page about it.

Why Stack and Heap Size are not defined in User Manual of microcontroller?

I am quite new to embedded programming.So may be this is a quite easy question for you.
I have seen different linker script file/linker configuration files of different SDK(e.g IAR EWARM, Tasking etc) in which the size of stack/heap are defined.
The size/Range of RAM and flash are also defined of every microcontroler in Linker file.Which are usually taken from memory map of User Manual.(address range are provided i user manual)
My question is how this size of stack and heap are calculated?
Can i select any value to the size of stack/heap size? Or is their any criteria foe that?
These are not defined in the microcontroller user manual because they are not hardware defined constraints. Rather they are application defined. It is a software dependent partitioning of memory, not hardware dependent.
Local, non-static variables, function arguments and call return addresses are generally stored on the stack; so the required stack size depends on the call depth and the number and size of local-variables and parameters for each function in a call-tree. The stack usage is dynamic, but there will be some worst-case path where the combination of variables and call-depth causes a peak usage.
On top of that on many architectures you have to also account for interrupt handler stack usage, which is generally less deterministic, but still has a "worst-case" of interrupt nesting and call depth. For this reasons ISR should generally be short, deterministic and use few variables.
Further is you have a multi-threaded environment such as an RTOS scheduler, each thread will have a separate stack. Typically these thread stacks are statically allocated arrays or dynamically (heap) allocated rather then defined by the linker script. The linker script normally defines only the system stack for the main() thread and interrupt/exception handlers.
Estimating the required stack usage is not always easy, but methods for doing so exist, using either static or dynamic analysis. Some examples (partly toolchain specific) at:
https://www.keil.com/support/man/docs/armclang_intro/armclang_intro_hla1474359990839.htm
https://www.keil.com/appnotes/docs/apnt_316.asp for example.
Many default linker scripts automatically expand the heap to fill all remaining space available after static data and stack allocation. One notable exception is the Keil ARM-MDK toolchain, which requires you to explicitly set a heap size.
A linker script may reserve memory regions for other purposes; especially if the memory is not homogeneous - for example on-chip MCU memory will typically be faster to access than external RAM, and may itself be subdivided on different busses so for example there might be a small segment useful for DMA on a separate buss so avoiding bus contention and yielding more deterministic execution.
The use of dynamic memory (heap) allocation in embedded systems needs to be carefully considered (or even banned as #Lundin would suggest, but not all embedded systems are subject to the same constraints). There are a number of issues to consider, including:
Memory constraints - many embedded systems have very small memories, you have to consider the response, safety and functionality of the system in the event an allocation request cannot be satisfied.
Memory leaks - your own, your colleagues on a team and third party code may not be as high a quality as you would hope; you need to be certain that the entire code base is free of memory leaks (failing to deallocate/free memory appropriately).
Determinism - most heap allocators take a variable and non-deterministic length of time to allocate memory, and even freeing can be non-deterministic if it involves block consolidation.
Heap corruption - an owner of an allocated block can easily under/overrun an allocation and corrupt adjacent memory. Typically such memory contains the heap-management meta-data for the block or other flocks, and the actual data for other allocations. Corrupting this data has non-deterministic effects on other code most often unrelated to the code that caused the error, such that it is common for failure to occur some-time after and in code unrelated to the event that caused the error. Such bugs hard hard to spot and resolve. If the heap meta-data is corrupted, often the error is detected when when further heap operations (alloc/free) fail.
Efficiency - Heap allocations mage by malloc() et-al are normally 8 byte aligned and have a block of pre-pended meta-data. Some implementations may add some "buffer" region to help detect overruns (especially in debug builds). As such making numerous allocations of very small blocks can be a remarkably inefficient use of a scarce resource.
Common strategies in embedded system to deal with these issues include:
Disallowing any dynamic memory allocations. This is common in safety critical and MISRA compliant applications for example.
Allowing dynamic memory allocation only during initialisation, and disallowing free(). This may seem counterintuitive, but can be useful where an application itself is "dynamic" and perhaps in some configurations not all tasks or device drivers etc. are started, where static allocation might lead to a great deal of unused/unusable memory.
Replacing the default heap with a deterministic memory allocation means such as a fixed-block allocator. Often these have a separate API rather then overriding malloc/free, so not then strictly a replacement; just a different solution.
Disallowing dynamic memory allocation in hard-real-time critical code. This addresses only the determinism issue, but in systems with large memories, and carefully design code, and perhaps MMU protection of allocations, there maybe mitigations for those.
Basically the stack size is picked depending on expected program size. For larger and more complex programs, you will want more stack size. It also depends on architecture, 32 bitters will generally consume slightly more memory than 8 and 16 bitters. The exact value is picked based on experience, though once you know exactly how much RAM your program actually uses, you can increase the stack size to use most of the unused memory.
It's also custom to map the stack so that it grows into a harmless area upon overflow, such as non-mapped memory or flash. Ideally so that you get a hardware exception, "software interrupt" or similar when stack overflow happens. You should never map it so that it grows into .data/.bss and overwrites other variables there.
As for the heap, the size is almost always picked to 0 and the segment is removed completely from the linker script. Heap allocation is banned in almost every microcontroller application.
Stack and heap are part of your program itself. They are based on how your program is structured and written How much memory it is taking up. rest free memory will work as Stack or Heap depending on how you set it up.
In Linker script you can define these values.

Understanding Memory Pools

To my understanding, a memory pool is a block, or multiple blocks of memory allocate on the stack before runtime.
By contrast, to my understanding, dynamic memory is requested from the operating system and then allocated on the heap during run time.
// EDIT //
Memory pools are evidently not necessarily allocated on the stack ie. a memory pool can be used with dynamic memory.
Non dynamic memory is evidently also not necessarily allocated on the stack, as per the answer to this question.
The topics of 'dynamic vs. static memory' and 'memory pools' are thus not really related although the answer is still relevant.
From what I can tell, the purpose of a memory pool is to provide manual management of RAM, where the memory must be tracked and reused by the programmer.
This is theoretically advantageous for performance for a number of reasons:
Dynamic memory becomes fragmented over time
The CPU can parse static blocks of memory faster than dynamic blocks
When the programmer has control over memory, they can choose to free and rebuild data when it is best to do so, according the the specific program.
4. When multithreading, separate pools allow separate threads to operate independently without waiting for the shared heap (Davislor)
Is my understanding of memory pools correct? If so, why does it seem like memory pools are not used very often?
It seems this question is thwart with XY problem and premature optimisation.
You should focus on writing legible code, then using a profiler to perform optimisations if necessary.
Is my understanding of memory pools correct?
Not quite.
... on the stack ...
... on the heap ...
Storage duration is orthogonal to the concept of pools; pools can be allocated to have any of the four storage durations (they are: static, thread, automatic and dynamic storage duration).
The C++ standard doesn't require that any of these go into a stack or a heap; it might be useful to think of all of them as though they go into the same place... after all, they all (commonly) go onto silicon chips!
... allocate ... before runtime ...
What matters is that the allocation of multiple objects occurs before (or at least less often than) those objects are first used; this saves having to allocate each object separately. I assume this is what you meant by "before runtime". When choosing the size of the allocation, the closer you get to the total number of objects required at any given time the less waste from excessive allocation and the less waste from excessive resizing.
If your OS isn't prehistoric, however, the advantages of pools will quickly diminish. You'd probably see this if you used a profiler before and after conducting your optimisation!
Dynamic memory becomes fragmented over time
This may be true for a naive operating system such as Windows 1.0. However, in this day and age objects with allocated storage duration are commonly stored in virtual memory, which periodically gets written to, and read back from disk (this is called paging). As a consequence, fragmented memory can be defragmented and objects, functions and methods that are more commonly used might even end up being united into common pages.
That is, paging forms an implicit pool (and cache prediction) for you!
The CPU can parse static blocks of memory faster than dynamic blocks
While objects allocated with static storage duration commonly are located on the stack, that's not mandated by the C++ standard. It's entirely possible that a C++ implementation may exist where-by static blocks of memory are allocated on the heap, instead.
A cache hit on a dynamic object will be just as fast as a cache hit on a static object. It just so happens that the stack is commonly kept in cache; you should try programming without the stack some time, and you might find that the cache has more room for the heap!
BEFORE you optimise you should ALWAYS use a profiler to measure the most significant bottleneck! Then you should perform the optimisation, and then run the profiler again to make sure the optimisation was a success!
This is not a machine-independent process! You need to optimise per-implementation! An optimisation for one implementation is likely a pessimisation for another.
If so, why does it seem like memory pools are not used very often?
The virtual memory abstraction described above, in conjunction with eliminating guess-work using cache profilers virtually eliminates the usefulness of pools in all but the least-informed (i.e. use a profiler) scenarios.
A customized allocator can help performance since the default allocator is optimized for a specific use case, which is infrequently allocating large chunks of memory.
But let's say for example in a simulator or game, you may have a lot of stuff happening in one frame, allocating and freeing memory very frequently. In this case the default allocator is not as good.
A simple solution can be allocating a block of memory for all the throwaway stuff happening during a frame. This block of memory can be overwritten over and over again, and the deletion can be deferred to a later time. e.g: end of a game level or whatever.
Memory pools are used to implement custom allocators.
One commonly used is a linear allocator. It only keeps a pointer seperating allocated/free memory. Allocating with it is just a matter of incrementing the pointer by the N bytes requested, and returning it's previous value. And deallocation is done by resetting the pointer to the start of the pool.

Memory stability of a C++ application in Linux

I want to verify the memory stability of a C++ application I wrote and compiled for Linux.
It is a network application that responds to remote clients connectings in a rate of 10-20 connections per second.
On long run, memory was rising to 50MB, eventhough the app was making calls to delete...
Investigation shows that Linux does not immediately free memory. So here are my questions :
How can force Linux to free memory I actually freed? At least I want to do this once to verify memory stability.
Otherwise, is there any reliable memory indicator that can report memory my app is actually holding?
What you are seeing is most likely not a memory leak at all. Operating systems and malloc/new heaps both do very complex accounting of memory these days. This is, in general, a very good thing. Chances are any attempt on your part to force the OS to free the memory will only hurt both your application performance and overall system performance.
To illustrate:
The Heap reserves several areas of virtual memory for use. None of it is actually committed (backed by physical memory) until malloc'd.
You allocate memory. The Heap grows accordingly. You see this in task manager.
You allocate more memory on the Heap. It grows more.
You free memory allocated in Step 2. The Heap cannot shrink, however, because the memory in #3 is still allocated, and Heaps are unable to compact memory (it would invalidate your pointers).
You malloc/new more stuff. This may get tacked on after memory allocated in step #3, because it cannot fit in the area left open by free'ing #2, or because it would be inefficient for the Heap manager to scour the heap for the block left open by #2. (depends on the Heap implementation and the chunk size of memory being allocated/free'd)
So is that memory at step #2 now dead to the world? Not necessarily. For one thing, it will probably get reused eventually, once it becomes efficient to do so. In cases where it isn't reused, the Operating System itself may be able to use the CPU's Virtual Memory features (the TLB) to "remap" the unused memory right out from under your application, and assign it to another application -- on the fly. The Heap is aware of this and usually manages things in a way to help improve the OS's ability to remap pages.
These are valuable memory management techniques that have the unmitigated side effect of rendering fine-grained memory-leak detection via Process Explorer mostly useless. If you want to detect small memory leaks in the heap, then you'll need to use runtime heap leak-detection tools. Since you mentioned that you're able to build on Windows as well, I will note that Microsoft's CRT has adequate leak-checking tools built-in. Instructions for use found here:
http://msdn.microsoft.com/en-us/library/974tc9t1(v=vs.100).aspx
There are also open-source replacements for malloc available for use with GCC/Clang toolchains, though I have no direct experience with them. I think on Linux Valgrind is the preferred and more reliable method for leak-detection anyway. (and in my experience easier to use than MSVCRT Debug).
I would suggest using valgrind with memcheck tool or any other profiling tool for memory leaks
from Valgrind's page:
Memcheck
detects memory-management problems, and is aimed primarily at
C and C++ programs. When a program is run under Memcheck's
supervision, all reads and writes of memory are checked, and calls to
malloc/new/free/delete are intercepted. As a result, Memcheck can
detect if your program:
Accesses memory it shouldn't (areas not yet allocated, areas that have been freed, areas past the end of heap blocks, inaccessible areas
of the stack).
Uses uninitialised values in dangerous ways.
Leaks memory.
Does bad frees of heap blocks (double frees, mismatched frees).
Passes overlapping source and destination memory blocks to memcpy() and related functions.
Memcheck reports these errors as soon as they occur, giving the source
line number at which it occurred, and also a stack trace of the
functions called to reach that line. Memcheck tracks addressability at
the byte-level, and initialisation of values at the bit-level. As a
result, it can detect the use of single uninitialised bits, and does
not report spurious errors on bitfield operations. Memcheck runs
programs about 10--30x slower than normal. Cachegrind
Massif
Massif is a heap profiler. It performs detailed heap profiling by
taking regular snapshots of a program's heap. It produces a graph
showing heap usage over time, including information about which parts
of the program are responsible for the most memory allocations. The
graph is supplemented by a text or HTML file that includes more
information for determining where the most memory is being allocated.
Massif runs programs about 20x slower than normal.
Using valgrind is as simple as running application with desired switches and give it as an input of valgrind:
valgrind --tool=memcheck ./myapplication -f foo -b bar
I very much doubt that anything beyond wrapping malloc and free [or new and delete ] with another function can actually get you anything other than very rough estimates.
One of the problems is that the memory that is freed can only be released if there is a long contiguous chunk of memory. What typically happens is that there are "little bits" of memory that are used all over the heap, and you can't find a large chunk that can be freed.
It's highly unlikely that you will be able to fix this in any simple way.
And by the way, your application is probably going to need those 50MB later on when you have more load again, so it's just wasted effort to free it.
(If the memory that you are not using is needed for something else, it will get swapped out, and pages that aren't touched for a long time are prime candidates, so if the system runs low on memory for some other tasks, it will still reuse the RAM in your machine for that space, so it's not sitting there wasted - it's just you can't use 'ps' or some such to figure out how much ram your program uses!)
As suggested in a comment: You can also write your own memory allocator, using mmap() to create a "chunk" to dole out portions from. If you have a section of code that does a lot of memory allocations, and then ALL of those will definitely be freed later, to allocate all those from a separate lump of memory, and when it's all been freed, you can put the mmap'd region back into a "free mmap list", and when the list is sufficiently large, free up some of the mmap allocations [this is in an attempt to avoid calling mmap LOTS of times, and then munmap again a few millisconds later]. However, if you EVER let one of those memory allocations "escape" out of your fenced in area, your application will probably crash (or worse, not crash, but use memory belonging to some other part of the application, and you get a very strange result somewhere, such as one user gets to see the network content supposed to be for another user!)
Use valgrind to find memory leaks : valgrind ./your_application
It will list where you allocated memory and did not free it.
I don't think it's a linux problem, but in your application. If you monitor the memory usage with « top » you won't get very precise usages. Try using massif (a tool of valgrind) : valgrind --tool=massif ./your_application to know the real memory usage.
As a more general rule to avoid leaks in C++ : use smart pointers instead of normal pointers.
Also in many situations, you can use RAII (http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization) instead of allocating memory with "new".
It is not typical for an OS to release memory when you call free or delete. This memory goes back to the heap manager in the runtime library.
If you want to actually release memory, you can use brk. But that opens up a very large can of memory-management worms. If you directly call brk, you had better not call malloc. For C++, you can override new to use brk directly.
Not an easy task.
The latest dlmalloc() has a concept called an mspace (others call it a region). You can call malloc() and free() against an mspace. Or you can delete the mspace to free all memory allocated from the mspace at once. Deleting an mspace will free memory from the process.
If you create an mspace with a connection, allocate all memory for the connection from that mspace, and delete the mspace when the connection closes, you would have no process growth.
If you have a pointer in one mspace pointing to memory in another mspace, and you delete the second mspace, then as the language lawyers say "the results are undefined".

Dynamic allocation in uClinux

I'm new to embedded development, and the big differences I see between traditional Linux and uClinux is that uClinux lacks the MMU.
From this article:
Without VM, each process must be located at a place in memory where it can be run. In the simplest case, this area of memory must be contiguous. Generally, it cannot be expanded as there may be other processes above and below it. This means that a process in uClinux cannot increase the size of its available memory at runtime as a traditional Linux process would.
To me, this sounds like all data must reside on the stack, and that heap allocation is impossible, meaning malloc() and/or "new" are out of the question... is that accurate? Perhaps there are techniques/libraries which allow for managing a "static heap" (i.e. a stack based area from which "dynamic" allocations can be requested)?
Or am I over thinking it? Or over simplifying it?
Under regular Linux, the programmer does not need to deal with physical resources. The kernel takes care of this, and a user space process sees only its own address space. As the stack grows, or malloc-type requests are made, the kernel will map free memory into the process's virtual address space.
In uClinux, the programmer must be more concerned with physical memory. The MMU and VM are not available, and all address space is shared with the kernel. When a user space program is loaded, the process is allocated physical memory pages for the text, stack, and variables. The process's program counter, stack pointer, and data/bss table pointers are set to physical memory addresses. Heap allocations (via malloc-type calls) are made from the same pool.
You will not have to get rid of heap allocation in programs. You will need to be concerned with some new issues. Since the stack cannot grow via virtual memory, you must size it correctly during linking to prevent stack overflows. Memory fragmentation becomes an issue because there's no MMU to consolidate smaller free pages. Errant pointers become more dangerous because they can now cause unintended writes to anywhere in physical memory.
It's been a while since I've worked with uCLinux (it was before it was integrated into the main tree), but I thought malloc was still available as part of the c library. There was a lot higher chance of doing Very Bad Things (tm) in memory since the heap wasn't isolated, but it was possible.
yes you can use malloc in user space applications on uclinux ,but then you have to increase the size of stack of user space application(before running the program cause stack size would be static),so that when malloc runs it will get the space it needs.
for e.g. uclinux on arm-cortex
arm toolchain provides command to find and change size of stack used by binary of user application then you can tranfer it to your embedded system and run
----- > arm-uclinuxeabi-flthdr