mmap: enforce 64K alignment

mmap: enforce 64K alignment - c++

I'm porting a project written (by me) for Windows to mobile platforms.
I need an equivalent of VirtualAlloc (+friends), and the natural one is mmap. However there are 2 significant differences.
Addresses returned by VirtualAlloc are guaranteed to be multiples of the so-called allocation granularity (dwAllocationGranularity). Not to be confused with the page size, this number is arbitrary, and on most Windows system is 64K. In contrast address returned by mmap is only guaranteed to be page-aligned.
The reserved/allocated region may be freed at once by a call to VirtualFree, and there's no need to pass the allocation size (that is, the size used in VirtualAlloc). In contrast munmap should be given the exact region size to be unmapped, i.e. it frees the given number of memory pages without any relation about how they were allocated.
This imposes problems for me. While I could live with (2), the (1) is a real problem. I don't want to get into details, but assuming the much smaller allocation granularity, such as 4K, will lead to a serious efficiency degradation. This is related to the fact that my code needs to put some information at every granularity boundary within the allocated regions, which impose "gaps" within the contiguous memory region.
I need to solve this. I can think about pretty naive methods of allocating increased regions, so that they can be 64K-aligned and still have adequate size. Or alternatively reserve huge regions of virtual address space, and then allocating properly-aligned memory regions (i.e. implement a sort of an aligned heap). But I wonder if there are alternatives. Such as special APIs, maybe some flags, secret system calls or whatever.

(1) is actually quite easy to solve. As you note, munmap takes a size parameter, and this is because munmap is capable of partial deallocation. You can thus allocate a chunk of memory that is bigger than what you need, then deallocate the parts that aren't aligned.

If possible, use posix_memalign (which despite the name allocates rather than aligns memory); this allows you to specify an alignment, and the memory allocated can be released using free. You would need to check whether posix_memalign is provided by your target platform.

Related

Writing a memory manager and defragmenting memory

The idea is writing a memory manager that allocates a bunch of memory at a time to minimize malloc and free calls, i've tried writing this by my self two times but both times i ran into the problem of defragmenting memory.
You could just check if a block is empty every so often, and if it is empty delete it. But let's say your blocks are 100 bytes each, first you allocate 20 bytes of memory, this will create a new 100 byte block because no blocks exist yet, then you allocate 80 bytes and this fills the first block, then you allocate another 20 bytes and this will create another new block because this first block is full, then you free the second allocation (80 bytes) and that leaves you with two blocks of which only the first 20 bytes are used, this means you have 100 bytes allocated that could be freed by moving the 20 bytes from the second block into the first block and deleting the second block.
These are the problems i ran into:
you can't move the memory around because this means all pointers to that memory will have to be updated, and for that to happen you need to know their addresses, which you don't;
100 bytes is a very small block size, what if i want to store a very low-res (64,64) ARGB image in memory? This will use 16KB of memory and moving all of that might be even slower than just not writing a memory manager at all.
Is it even worth it writing a custom memory manager after all that?

Is it even worth it writing a custom memory manager after all that?
This is asking for an opinion, but I'll try to give a factual answer.
The memory allocators that come with most operating systems and language support libraries are generally very high quality and are designed to address the types of problems you encountered (fragmentation and performance) as well as others. They are about as good as general purpose memory allocators can be.
You can do (a little) better than the provided memory allocator if your application has a particular allocation pattern that can be exploited. That's rare, but you can generally take advantage of it by making something substantially simpler than a general purpose memory manager.
you can't move the memory around
True. Most modern systems don't even try to move memory around--they try to avoid fragmentation to begin with (typically by clustering similarly sized allocations).
Old systems (ones without virtual memory managers) sometimes used memory managers that had an extra layer of indirection. Instead of returning a pointer to the allocated memory, the allocator would return an "handle", which could be as simple as an index into a table maintained by the memory manager. When the user wanted to actually access the memory, they would "lock" it. The memory manager was free to move around memory that wasn't locked (e.g., to eliminate fragmentation) because the handles gave an extra level of indirection.
what if i want to store a very low-res (64,64) ARGB image
Most memory managers provide a range of sizes so a large allocation wouldn't be split across n smaller blocks. Most will punt very large allocations to the system allocator, which, on a virtual memory operating system, can generally solve the problem unless the process address space is overly fragmented.

Sequential memory allocation

I'm planning a application that allocates a lot of variables in memory. In difference from another "regular" application, I want this memory be allocated in specific memory blocks of 4096 bytes. My allocated vars must be placed in memory sequentially. One after another, in order to fill the whole allocated memory.
For example, I'm allocating a region (4096 bytes) in memory and this region is ready for my further use. From now, each time that my application creates a new variable in memory (which is probably made in "regular" application with malloc), this variable will be placed in free space in my memory region.
This sequential memory allocation is similare to how an array allocation works. But, in my case, I need an array that will be able to contain many types of data (string, byte, int, ...).
One possible solution is to achieve this is by pointer arithmetics. I want to avoid this method, this may insert a lot of bugs in my application.
Maybe someone solved this problem before?
Thank you!

malloc() by no means guarantees that subsequent allocated blocks are on sequential memory address. Even worse, most implementations use a small number of bytes before and/or after the allocated block for 'housekeeping'. This means that, even if you're lucky that addresses are sequential, there will be small gaps in between the blocks. So the actual allocated blocks are slightly bigger to make space for those 'housekeeping' bytes.
As you suggest, you'll need to write some code yourself and write a few functions with malloc(), realloc(), ... You can hide all the logic in these functions and should not make your application code using these functions more complex compared to using malloc() if it did what you wanted.
Important questions: Why do you need to have these blocks adjacent to each other? What about freeing blocks?

How does a computer 'know' what memory is allocated?

When memory is allocated in a computer, how does it know which bytes are already occupied and can't be overwritten?
So if these are some bytes of memory that aren't being used:
[0|0|0|0]
How does the computer know whether they are or not? They could just be an integer that equals zero. Or it could be empty memory. How does it know?

That depends on the way the allocation is performed, but it generally involves manipulation of data belonging to the allocation mechanism.
When you allocate some variable in a function, the allocation is performed by decrementing the stack pointer. Via the stack pointer, your program knows that anything below the stack pointer is not allocated to the stack, while anything above the stack pointer is allocated.
When you allocate something via malloc() etc. on the heap, things are similar, but more complicated: all theses allocators have some internal data structures which they never expose to the calling application, but which allow them to select which memory addresses to return on an allocation request. Some malloc() implementation, for instance, use a number of memory pools for small objects of fixed size, and maintain linked lists of free objects for each fixed size which they track. That way, they can quickly pop one memory region of that list, only doing more expensive computations when they run out of regions to satisfy a certain request size.
In any case, each of the allocators have to request memory from the system kernel from time to time. This mechanism always works on complete memory pages (usually 4 kiB), and works via the syscalls brk() and mmap(). Again, the kernel keeps track of which pages are visible in which processes, and at which addresses they are mapped, so there is additional memory allocated inside the kernel for this.
These mappings are made available to the processor via the page tables, which uses them to resolve the virtual memory addresses to the physical addresses. So here, finally, you have some hardware involved in the process, but that is really far, far down in the guts of the mechanics, much below anything that a userspace process is ever able to see. Still, even the page tables are managed by the software of the kernel, not by the hardware, the hardware only interpretes what the software writes into the page tables.

First of all, I have the impression that you believe that there is some unoccupied memory that doesn't holds any value. That's wrong. You can imagine the memory as a very large array when each box contains a value whereas someone put something in it or not. If a memory was never written, then it contains a random value.
Now to answer your question, it's not the computer (meaning the hardware) but the operating system. It holds somewhere in its memory some tables recording which part of the memory are used. Also any byte of memory can be overwriten.

In general, you cannot tell by looking at content of memory at some location whether that portion of memory is used or not. Memory value '0' does not mean the memory is not used.
To tell what portions of memory are used you need some structure to tell you this. For example, you can divide memory into chunks and keep track of which chunks are used and which are not.

There are memory blocks, they have an occupied or not occupied. On the heap, there are very complex data structures which organise it. But the answer to your question is too broad.

Also check realloc() if shrinking allocated size of memory?

When you call realloc() you should check whether the function failed before assigning the returned pointer to the pointer passed as a parameter to the function...
I've always followed this rule.
Now is it necessary to follow this rule when you know for sure the memory will be truncated and not increased?
I've never ever seen it fail. Just wondered if I could save a couple instructions.

realloc may, at its discretion, copy the block to a new address regardless of whether the new size is larger or smaller. This may be necessary if the malloc implementation requires a new allocation to "shrink" a memory block (e.g. if the new size requires placing the memory block in a different allocation pool). This is noted in the glibc documentation:
In several allocation implementations, making a block smaller sometimes necessitates copying it, so it can fail if no other space is available.
Therefore, you must always check the result of realloc, even when shrinking. It is possible that realloc has failed to shrink the block because it cannot simultaneously allocate a new, smaller block.

Even if you realloc (read carefully realloc(3) and about Posix realloc please) to a smaller size, the underlying implementation is doing the equivalent of malloc (of the new smaller size), followed by a memcpy (from old to new zone), then free (of the old zone). Or it may do nothing... (e.g. because some crude malloc implementations maitain a limited set of sizes -like power of two or 3 times power of two-, and the old and new size requirements fits in the same size....)
That malloc can fail. So realloc can still fail.
Actually, I usually don't recommend using realloc for that reason: just do the malloc, memcpy, free yourself.
Indeed, dynamic heap memory functions like malloc rarely fail. But when they do, chaos may happen if you don't handle that. On Linux and some other Posix systems you could setrlimit(2) with RLIMIT_AS -e.g. using bash ulimit builtin- to lower the limits for testing purposes.
You might want to study the source code implementations of C memory management. For example MUSL libc (for Linux) is very readable code. On Linux, malloc is often built above mmap(2) (the C library may allocate a large chunk of memory using mmap then managing smaller used and freed memory zones inside it).

Why is memory not reusable after allocating/deallocating a number of small objects?

While investigating a memory link in one of our projects, I've run into a strange issue. Somehow, the memory allocated for objects (vector of shared_ptr to object, see below) is not fully reclaimed when the parent container goes out of scope and can't be used except for small objects.
The minimal example: when the program starts, I can allocate a single continuous block of 1.5Gb without problem. After I use the memory somewhat (by creating and destructing an number of small objects), I can no longer do big block allocation.
Test program:
#include <iostream>
#include <memory>
#include <vector>
using namespace std;
class BigClass
{
private:
double a[10000];
};
void TestMemory() {
cout<< "Performing TestMemory"<<endl;
vector<shared_ptr<BigClass>> list;
for (int i = 0; i<10000; i++) {
shared_ptr<BigClass> p(new BigClass());
list.push_back(p);
};
};
void TestBigBlock() {
cout<< "Performing TestBigBlock"<<endl;
char* bigBlock = new char [1024*1024*1536];
delete[] bigBlock;
}
int main() {
TestBigBlock();
TestMemory();
TestBigBlock();
}
Problem also repeats if using plain pointers with new/delete or malloc/free in cycle, instead of shared_ptr.
The culprit seems to be that after TestMemory(), the application's virtual memory stays at 827125760 (regardless of number of times I call it). As a consequence, there's no free VM regrion big enough to hold 1.5 GB. But I'm not sure why - since I'm definitely freeing the memory I used. Is it some "performance optimization" CRT does to minimize OS calls?
Environment is Windows 7 x64 + VS2012 + 32-bit app without LAA

Sorry for posting yet another answer since I am unable to comment; I believe many of the others are quite close to the answer really :-)
Anyway, the culprit is most likely address space fragmentation. I gather you are using Visual C++ on Windows.
The C / C++ runtime memory allocator (invoked by malloc or new) uses the Windows heap to allocate memory. The Windows heap manager has an optimization in which it will hold on to blocks under a certain size limit, in order to be able to reuse them if the application requests a block of similar size later. For larger blocks (I can't remember the exact value, but I guess it's around a megabyte) it will use VirtualAlloc outright.
Other long-running 32-bit applications with a pattern of many small allocations have this problem too; the one that made me aware of the issue is MATLAB - I was using the 'cell array' feature to basically allocate millions of 300-400 byte blocks, causing exactly this issue of address space fragmentation even after freeing them.
A workaround is to use the Windows heap functions (HeapCreate() etc.) to create a private heap, allocate your memory through that (passing a custom C++ allocator to your container classes as needed), and then destroy that heap when you want the memory back - This also has the happy side-effect of being very fast vs delete()ing a zillion blocks in a loop..
Re. "what is remaining in memory" to cause the issue in the first place: Nothing is remaining 'in memory' per se, it's more a case of the freed blocks being marked as free but not coalesced. The heap manager has a table/map of the address space, and it won't allow you to allocate anything which would force it to consolidate the free space into one contiguous block (presumably a performance heuristic).

There is absolutely no memory leak in your C++ program. The real culprit is memory fragmentation.
Just to be sure(regarding memory leak point), I ran this program on Valgrind, and it did not give any memory leak information in the report.
//Valgrind Report
mantosh#mantosh4u:~/practice$ valgrind ./basic
==3227== HEAP SUMMARY:
==3227== in use at exit: 0 bytes in 0 blocks
==3227== total heap usage: 20,017 allocs, 20,017 frees, 4,021,989,744 bytes allocated
==3227==
==3227== All heap blocks were freed -- no leaks are possible
Please find my response to your query/doubt asked in original question.
The culprit seems to be that after TestMemory(), the application's
virtual memory stays at 827125760 (regardless of number of times I
call it).
Yes, real culprit is hidden fragmentation done during the TestMemory() function.Just to understand the fragmentation, I have taken the snippet from wikipedia
"
when free memory is separated into small blocks and is interspersed by allocated memory. It is a weakness of certain storage allocation algorithms, when they fail to order memory used by programs efficiently. The result is that, although free storage is available, it is effectively unusable because it is divided into pieces that are too small individually to satisfy the demands of the application.
For example, consider a situation wherein a program allocates 3 continuous blocks of memory and then frees the middle block. The memory allocator can use this free block of memory for future allocations. However, it cannot use this block if the memory to be allocated is larger in size than this free block."
The above explains paragraph explains very nicely about memory fragmentation.Some allocation patterns(such as frequent allocation and deal location) would lead to memory fragmentation,but its end impact(.i.e. memory allocation 1.5GBgets failed) would greatly vary on different system as different OS/heap manager has different strategy and implementation.
As an example, your program ran perfectly fine on my machine(Linux) however you have encountered the memory allocation failure.
Regarding your observation on VM size remains constant: VM size seen in task manager is not directly proportional to our memory allocation calls. It mainly depends on the how much bytes is in committed state. When you allocate some dynamic memory(using new/malloc) and you do not write/initialize anything in those memory regions, it would not go committed state and hence VM size would not get impacted due to this. VM size depends on many other factors and bit complicated so we should not rely completely on this while understanding about dynamic memory allocation of our program.
As a consequence, there's no free VM regrion big enough to hold 1.5
GB.
Yes, due to fragmentation, there is no contiguous 1.5GB memory. It should be noted that total remaining(free) memory would be more than 1.5GB but not in fragmented state. Hence there is not big contiguous memory.
But I'm not sure why - since I'm definitely freeing the memory I used.
Is it some "performance optimization" CRT does to minimize OS calls?
I have explained about why it may happen even though you have freed all your memory. Now in order to fulfil user program request, OS will call to its virtual memory manager and try to allocate the memory which would be used by heap memory manager. But grabbing the additional memory does depend on many other complex factor which is not very easy to understand.
Possible Resolution of Memory Fragmentation
We should try to reuse the memory allocation rather than frequent memory allocation/free. There could be some patterns(like a particular request size allocation in particular order) which may lead overall memory into fragmented state. There could be substantial design change in your program in order to improve memory fragmentation. This is complex topic and require internal understanding of memory manager to understand the complete root cause of such things.
However there are tools exists on Windows based system which I am not much aware. But I found one excellent SO post regarding the which tool(on windows) can be useful to understand and check the fragmentation status of your program by yourself.
https://stackoverflow.com/a/1684521/2724703

This is not memory leak. The memory U used was allocated by C\C++ Runtime. The Runtime apply a a bulk of memory from OS once and then each new you called will allocated from that bulk memory. when delete one object, the Runtime not return memory to OS immediately, it may hold that memory for performance.

There is nothing here which indicates a genuine "leak". The pattern of memory you describe is not unexpected. Here are a few points which might help to understand. What happens is highly OS dependent.
A program often has a single heap which can be extended or shrunk in length. It is however one contiguous memory area, so changing the size is just changing where the end of the heap is. This makes it very difficult to ever "return" memory to the OS, since even one little tiny object in that space will prevent its shrinking. On Linux you can lookup the function 'brk' (I know you're on Windows, but I presume it does something similar).
Large allocations are often done with a different strategy. Rather than putting them in the general purpose heap, an extra block of memory is created. When it is deleted this memory can actually be "returned" to the OS since its guaranteed nothing is using it.
Large blocks of unused memory don't tend to consume a lot of resources. If you generally aren't using the memory any more they might just get paged to disk. Don't presume that because some API function says you're using memory that you are actually consuming significant resources.
APIs don't always report what you think. Due to a variety of optimizations and strategies it may not actually be possible to determine how much memory is in use and/or available on a system at a particular moment. Unless you have intimate details of the OS you won't know for sure what those values mean.
The first two points can explain why a bunch of small blocks and one large block result in different memory patterns. The latter points indicate why this approach to detecting leaks is not useful. To detect genuine object-based "leaks" you generally need a dedicated profiling tool which tracks allocations.
For example, in the code provided:
TestBigBlock allocates and deletes array, assume this uses a special memory block, so memory is returned to OS
TestMemory extends the heap for all the small objects, and never returns any heap to the OS. Here the heap is entirely available from the applications point-of-view, but from the OS's point of view it is assigned to the application.
TestBigBlock now fails, since although it would use a special memory block, it shares the overall memory space with heap, and there just isn't enough left after 2 is complete.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js