I'm trying to implement a free list allocator using Red Black Tree for optimize O(LogN) best fit search.
My strategy is that when a block is allocated, it's allocated with a Header where
struct Header {
std::size_t m_Size;
};
thus sizeof(Header) == sizeof(std::size_t)
This is done, so when deallocating I could know how much bytes were allocated to give it back as a free node.
Now there's a problem with this solution, because now I have to align the Header itself + the allocated block to the requested alignment, so padding in between the Header and the allocated block start, and padding between the end of the allocated block and a start of a new Header (so next block Header will be already aligned) are needed.
So to illustrate the problem better here's a Red Black Tree with nodes indicating free block sizes minus sizeof(Header)
Now let's assume a user allocate a block of size 16 with alignment of 16:
allocate(16, 16);
Now best fit will yield us node 17.
But we can't count on it, let's assume node 17 is at address 0x8 and we're on x32 so sizeof(Header) = 4.
Header's address will be from 0x8-0xC, now we would need to add a padding so our block will be aligned to 16 as requested, this padding is equal to 4 bytes so our allocated block will start at 0x10 which is aligned to 16. Now no padding is needed at the end of the block since 0x10 + 16d will be aligned to next block Header.
The padding between the end of the allocated block to the start of the new block is easy to calculate before-hand like so:
std::size_t headerPadding = size % sizeof(Header) != 0 ? sizeof(Header) - size % sizeof(Header) : 0;
So it's not dependent on the address of the free node.
But the padding between the end of the Header and the start of the allocated block IS dependent on the address of the free node like I've demonstrated.
And for our example the total needed size in the case of this specific node will be 4 (padding between Header and allocated block) + 16 (allocated block size) + 0 (padding needed for next free block header alignment) = 20.
Obviously node 17 doesn't match.
Now my strategy to fix this is as follows:
- Find best fit
- See if best fit matches the size requirements as described
- If yes we're done
- If not get its successor and check if it matches the size requirements as described
- If yes we're done
- If not start over from successor parent until we reach a node that satisfy size requirements or we reach the original best fit again
Here's a code describing the process:
void FreeTreeAllocator::Find(const std::size_t size, const std::size_t alignment, std::size_t& sizePadding, std::size_t& headerPadding, RBTree::Node*& curr)
{
headerPadding = size % sizeof(Header) != 0 ? sizeof(Header) - size % sizeof(Header) : 0;
RBTree::Node* best = m_Tree.SearchBest(m_Tree.m_Root, size + headerPadding);
RBTree::Node* origin = best;
std::vector<std::size_t> visited;
while (visited[visited.size() - 1] != (std::size_t)origin && !IsNodeBigEnough(size, alignment, sizePadding, headerPadding, best))
{
RBTree::Node* successor = m_Tree.Successor(best);
if (IsNodeBigEnough(size, alignment, sizePadding, headerPadding, successor))
{
best = successor;
break;
}
else
{
std::vector<std::size_t>::iterator it;
do {
best = successor->m_Parent;
it = std::find(visited.begin(), visited.end(), (std::size_t)best);
} while (it != visited.end());
}
visited.push_back((std::size_t)best);
}
}
bool FreeTreeAllocator::IsNodeBigEnough(const std::size_t size, const std::size_t alignment, std::size_t& sizePadding, std::size_t& headerPadding, RBTree::Node* curr)
{
if (curr == m_Tree.m_Nil)
return false;
void* currentAddress = reinterpret_cast<char*>(curr) + sizeof(Header);
std::size_t space = curr->m_Value;
std::align(alignment, size, currentAddress, space);
sizePadding = reinterpret_cast<char*>(currentAddress) - reinterpret_cast<char*>(curr) - sizeof(Header);
return sizePadding + size + headerPadding <= curr->m_Value;
}
Now for the given allocation request:
allocate(16, 16);
and the given example tree from the picture the following the algorithm described, the search path will be:
17 -> 21 -> 22 -> 23 -> 25 -> 27
At worst case this is O(LogN + M) where M is the size of the right sub-tree of the original best fit node.
Now one way this could be solved if I make sizeof(Header) = sizeof(std::max_align_t), this way the padding between the Header and the start of the allocated block will always be 0, so we won't need this padding anymore because every request will be aligned without this padding, so we could really just do:
void FreeTreeAllocator::Find(const std::size_t size, std::size_t& headerPadding, RBTree::Node*& curr)
{
headerPadding = size % sizeof(Header) != 0 ? sizeof(Header) - size % sizeof(Header) : 0;
RBTree::Node* best = m_Tree.SearchBest(m_Tree.m_Root, size + headerPadding);
return best;
But that will waste alot of memory compared to my proposed idea where I settle on O(LogN + M) best fit search.
Now why do I ask it?
Because I see the use of Red Black Tree as optimization for free list allocator to reduce best fit search to O(LogN) while I can't seem to make it really O(LogN), the flaw with my design I guess is that there need to be a Header for book-keeping on how much bytes to give back to a free block when deallocating, and I don't see a way to do it without this, if I was able to not have a Header at all, or make it so I don't have issues with it's alignment to find padding specific for a node in the list (by making sizeof(Header) = sizeof(std::max_align_t), or even sizeof(Header) = 1) then this could be solved with simple O(LogN) search.
I'm looking for ideas on how to solve this issue, how other implementations are doing this in O(LogN) while keeping inner fragmentation as low as possible?
UPDATE:
I ended up having the node addresses aligned to alignof(std::max_align_t) - sizeof(Header) so that makes Header always aligned no matter if you're on x32/x64 (remember that Header is composed of sizeof(std::size_t)), and no matter if alignof(std::max_align_t) is 8 or 16.That makes the allocated payload to start at address that is aligned to alignof(std::max_align_t) just like malloc, so no matter what is allocated it will always be aligned to the max alignment and no padding is required between Header and payload.
The only padding required is after payload to match the next address that is aligned to alignof(std::max_align_t) - sizeof(Header) + any padding required to make it so the allocated block is at least sizeof(RBTree::Node) bytes big (including sizeof(Header) inside the equation) so when deallocation we could safely store an RBTree::Node without overriding other data.
Without the padding between Header and the payload, and with the padding needed to align next block to alignof(std::max_align_t) - sizeof(Header) we can easily use the default RBTree::Search of O(LogN) since the padding can be calculated beforehand depending on the size of the block and removing from the equation the start address of a particular node.
The only remaining problem I have with optimizing this free list allocator to O(LogN) is the deallocation part, more precisely the coalescence part.
What I can't solve now is how to do O(1) coalescence. I rearranged the RBTree::Node struct such as that m_Parent is first and so its LSB is always set 1 (for every function relying on m_Parent I have a getter function for fixing it) and then I can check if next block of current deallocated block (we can get to next block with size from Header) if the first sizeof(std::size_t) bytes & 0x1 is true and if so it's a free node, if not it's a busy block with Header (since Header's m_Size LSB will always be 0 because we add padding for alignment with std::max_align_t).The remaining problem is how to get to the previous memory block and know if it's free or busy, that I cannot figure yet and would love to hear suggestions.
For the padding issue:
Make sure the size of your free list nodes is a power of 2 -- either 16 or 32 bytes -- and make sure the addresses of your free list nodes are all aligned at node_size * x - sizeof(Header) bytes.
Now all of your allocations will automatically be aligned at multiples of the node size, with no padding required.
Allocations requiring larger alignments will be rare, so it might be reasonable just to find the left-most block of appropriate size, and walk forward in the tree until you find a block that works.
If you need to optimize large-alignment allocations, though, then you can sort the blocks first by size, and then break ties by sorting on the number of zeros at the right of each nodes's allocation address (node address + sizeof(Header)).
Then, a single search in the tree will either find an exactly-fitting block that works, or a larger block. There is a good chance you'll be able to split the larger block in a way that satisfies the alignment requirement, but if not, then you can again skip ahead in the tree to find a block of that size that works, or an even larger block, etc.
The resulting search is faster, but still not guaranteed O(log N). To fix that, you can just give up after a limited number of skips forward, and revert to finding a block of requested_size + requested_alignment. If you find one of those, then it is guaranteed that you'll be able to split it up to satisfy your alignment constraint.
there need to be a Header for book-keeping on how much bytes to give back to a free block when deallocating
On a 64-bit platform, one way to eliminate the header is to make your allocator manage arenas of power-of-2 object sizes. Each arena is for one object size and all arenas are of the same size. Then map (reserve only) one large chunk of virtual memory in such a way, that it is aligned by its own size (which is also a power of 2). That way, your pointers to objects are structured: the lower order bits are the offset to the object within the arena, the next bits are the arena number. For each arena it needs to maintain the free list and allocated object count, but the free lists must initially include only one page or 1 object (whichever is bigger), so that it doesn't commit page frames to the entire reserved virtual memory, which would run out of memory immediately.
For example, if you have 8GiB arenas for objects of power-2-sizes from 8 to 65536 bytes, then the lower [0:32] bits are the object offset within the arena, bits [33:36] are the arena number and log2 of object size (arenas [0, 2] are unused because they are not big enough for free list next pointer).
The complete answer is my OP update and this answer.
I found a solution for coalescence in O(1).
My OP update describe how we can achieve coalescence with next block in O(1) but not how to achieve coalescence in O(1) with previous block.
To do this I store an additional std::size_t m_PrevSize in both busy block Header struct and RBTree::Node struct as the first member.
When a block is allocated and becomes busy (either by simple allocation or by block splitting) I simply move to the next block using the Header's m_Size property and the first std::size_t bytes to 0. This will set the next memory block either busy or free that the previous is busy and there's no need to merge with.
When a block is deallocated and I convert it to free block, I do the same thing but set the first std::size_t bytes to RBTree::Node's m_Value property which is basically how much bytes this free block have, and the when deallocating I can check own m_PrevSize property and if it's not 0 go backward m_PrevSize bytes and do the merge.
Related
For example, we have 10^7 32bit integers. The memory usage to store these integers in an array is 32*10^7/8=40MB. However, inserting 10^7 32bit integers into a set takes more than 300MB of memory. Code:
#include <iostream>
#include <set>
int main(int argc, const char * argv[]) {
std::set<int> aa;
for (int i = 0; i < 10000000; i++)
aa.insert(i);
return 0;
}
Other containers like map, unordered_set takes even more memory with similar tests. I know that set is implemented with red black tree, but the data structure itself does not explain the high memory usage.
I am wondering the reason behind this 5~8 times original data memory usage, and some workaround/alternatives for a more memory efficient set.
Let's examine std::set implementation in GCC (which is not much different in other compilers). std::set is implemented as a red-black tree on GCC. Each node has a pointer to parent, left, and right nodes and a color enumerator (_S_red and _S_black). This means that besides the int (which is probably 4 bytes) there are 3 pointers (8 * 3 = 24 bytes for a 64-bit system) and one enumerator (since it comes before the pointers in _Rb_tree_node_base, it is padded to 8 byte boundary, so effectively it takes extra 8 bytes).
So far I have counted 24 + 8 + 4 = 36 bytes for each integer in the set. But since the node has to be aligned to 8 bytes, it has to be padded so that it is divisible by 8. Which means each node takes 40 bytes (10 times bigger than int).
But this is not all. Each such node is allocated by std::allocator. This allocator uses new to allocate each node. Since delete can't know how much memory to free, each node also has some heap-related meta-data. The meta-data should at least contain the size of the allocated block, which usually takes 8 bytes (in theory, it is possible to use some kind of Huffman coding and store only 1 byte in most cases, but I never saw anybody do that).
Considering everything, the total for each int node is 48 bytes. This is 12 times more than int. Every int in the set takes 12 times more than it would have taken in an array or a vector.
Your numbers suggest that you are on a 32 bit system, since your data takes only 300 MB. For 32 bit system, pointers take 4 bytes. This makes it 3 * 4 + 4 = 16 byte for the red-black tree related data in nodes + 4 for int + 4 for meta-data. This totals with 24 bytes for each int instead of 4. This makes it 6 times larger than vector for a big set. The numbers suggest that heap meta-data takes 8 bytes and not just 4 bytes (maybe due to alignment constraint).
So on your system, instead of 40MB (had it been std::vector), it is expected to take 280MB.
If you want to save some peanuts, you can use a non-standard allocator for your sets. You can avoid the metadata overhead by using boost's Segregated storage node allocators. But that is not such a big win in terms of memory. It could boost your performance, though, since the allocators are simpler than the code in new and delete.
I would like to allocate some char buffers0, to be passed to an external non-C++ function, that have a specific alignment requirement.
The requirement is that the buffer be aligned to a N-byte1 boundary, but not to a 2N boundary. For example, if N is 64, then an the pointer to this buffer p should satisfy ((uintptr_t)p) % 64 == 0 and ((uintptr_t)p) % 128 != 0 - at least on platforms where pointers have the usual interpretation as a plain address when cast to uintptr_t.
Is there a reasonable way to do this with the standard facilities of C++11?
If not, is there is a reasonable way to do this outside the standard facilities2 which works in practice for modern compilers and platforms?
The buffer will be passed to an outside routine (adhering to the C ABI but written in asm). The required alignment will usually be greater than 16, but less than 8192.
Over-allocation or any other minor wasted-resource issues are totally fine. I'm more interested in correctness and portability than wasting a few bytes or milliseconds.
Something that works on both the heap and stack is ideal, but anything that works on either is still pretty good (with a preference towards heap allocation).
0 This could be with operator new[] or malloc or perhaps some other method that is alignment-aware: whatever makes sense.
1 As usual, N is a power of two.
2 Yes, I understand an answer of this type causes language-lawyers to become apoplectic, so if that's you just ignore this part.
Logically, to satisfy "aligned to N, but not 2N", we align to 2N then add N to the pointer. Note that this will over-allocate N bytes.
So, assuming we want to allocate B bytes, if you just want stack space, alignas would work, perhaps.
alignas(N*2) char buffer[B+N];
char *p = buffer + N;
If you want heap space, std::aligned_storage might do:
typedef std::aligned_storage<B+N,N*2>::type ALIGNED_CHAR;
ALIGNED_CHAR buffer;
char *p = reinterpret_cast<char *>(&buffer) + N;
I've not tested either out, but the documentation suggests it should be OK.
You can use _aligned_malloc(nbytes,alignment) (in MSVC) or _mm_malloc(nbytes,alignment) (on other compilers) to allocate (on the heap) nbytes of memory aligned to alignment bytes, which must be an integer power of two.
Then you can use the trick from Ken's answer to avoid alignment to 2N:
void*ptr_alloc = _mm_malloc(nbytes+N,2*N);
void*ptr = static_cast<void*>(static_cast<char*>(ptr_alloc) + N);
/* do your number crunching */
_mm_free(ptr_alloc);
We must ensure to keep the pointer returned by _mm_malloc() for later de-allocation, which must be done via _mm_free().
I am now reading the source code of OPENCV, a computer vision open source library. I am confused with this function:
#define CV_MALLOC_ALIGN 16
void* fastMalloc( size_t size )
{
uchar* udata = (uchar*)malloc(size + sizeof(void*) + CV_MALLOC_ALIGN);
if(!udata)
return OutOfMemoryError(size);
uchar** adata = alignPtr((uchar**)udata + 1, CV_MALLOC_ALIGN);
adata[-1] = udata;
return adata;
}
/*!
Aligns pointer by the certain number of bytes
This small inline function aligns the pointer by the certian number of bytes by
shifting it forward by 0 or a positive offset.
*/
template<typename _Tp> static inline _Tp* alignPtr(_Tp* ptr, int n=(int)sizeof(_Tp))
{
return (_Tp*)(((size_t)ptr + n-1) & -n);
}
fastMalloc is used to allocated memory for a pointer, which invoke malloc function and then alignPtr. I cannot understand well why alignPtr is called after memory is allocated? My basic understanding is by doing so it is much faster for the machine to find the pointer. Can some references on this issue be found in the internet? For modern computer, is it still necessary to perform this operation? Any ideas will be appreciated.
Some platforms require certain types of data to appear on certain byte boundaries (e.g:- some compilers
require pointers to be stored on 4-byte boundaries).
This is called alignment, and it calls for extra padding within, and possibly at the end of, the object's data.
Compiler might break in case they didn't find proper alignment OR there could be performance bottleneck in reading that data ( as there would be a need to read two blocks for getting same data).
EDITED IN RESPONSE TO COMMENT:-
Memory request by a program is generally handled by memory allocator. One such memory allocator is fixed-size allocator. Fixed size allocation return chunks of specified size even if requested memory is less than that particular size. So, with that background let me try to explain what's going on here:-
uchar* udata = (uchar*)malloc(size + sizeof(void*) + CV_MALLOC_ALIGN);
This would allocate amount of memory which is equal to memory_requested + random_size. Here random_size is filling up the gap to make it fit for size specified for fixed allocation scheme.
uchar** adata = alignPtr((uchar**)udata + 1, CV_MALLOC_ALIGN);
This is trying to align pointer to specific boundary as explained above.
It allocates a block a bit bigger than it was asked for.
Then it sets adata to the address of the next properly allocated byte (add one byte, then round up to the next properly aligned address).
Then it stores the original pointer before the new address. I assume this is later used to free the originally allocated block.
And then we return the new address.
This only makes sense if CV_MALLOC_ALIGN is a stricter alignment than malloc guarantees - perhaps a cache line?
I want to allocate a vector of size 1765880295 and so i used resize(1765880295) but the program stops running.The adjact problem would be code block not responding.
what is wrong?
Although the max_size gives 4294967295 which is greater than 1765880295 the problem is still the same even without resizing the vector.
Depending on what is stored in the vector -- for example, a 32-bit pointer or uint32, the size of the vector (number of elements * size of each element) will exceed the maximum addressable space on a 32-bit system.
The max_size is dependent on the implementation (some may have 1073741823 as their max_size). But even if your implementation supports a bigger number, the program will fail if there is not enough memory.
For example: if you have a vector<int>, and the sizeof(int) == 4 // bytes, we do the math, and...
1765880295 * 4 bytes = 7063521180 bytes ≈ 6.578 gygabytes
So you would require around 6.6GiB of free memory to allocate that enormous vector.
Lets we have,
std::array <int,5> STDarr;
std::vector <int> VEC(5);
int RAWarr[5];
I tried to get size of them as,
std::cout << sizeof(STDarr) + sizeof(int) * STDarr.max_size() << std::endl;
std::cout << sizeof(VEC) + sizeof(int) * VEC.capacity() << std::endl;
std::cout << sizeof(RAWarr) << std::endl;
The outputs are,
40
20
40
Are these calculations correct? Considering I don't have enough memory for std::vector and no way of escaping dynamic allocation, what should I use? If I would know that std::arrayresults in lower memory requirement I could change the program to make array static.
These numbers are wrong. Moreover, I don't think they represent what you think they represent, either. Let me explain.
First the part about them being wrong. You, unfortunately, don't show the value of sizeof(int) so we must derive it. On the system you are using the size of an int can be computed as
size_t sizeof_int = sizeof(RAWarr) / 5; // => sizeof(int) == 8
because this is essentially the definition of sizeof(T): it is the number of bytes between the start of two adjacent objects of type T in an array. This happens to be inconsistent with the number print for STDarr: the class template std::array<T, n> is specified to have an array of n objects of type T embedded into it. Moreover, std::array<T, n>::max_size() is a constant expression yielding n. That is, we have:
40 // is identical to
sizeof(STDarr) + sizeof(int) * STDarr.max_size() // is bigger or equal to
sizeof(RAWarr) + sizeof_int * 5 // is identical to
40 + 40 // is identical to
80
That is 40 >= 80 - a contradication.
Similarily, the second computation is also inconsistent with the third computation: the std::vector<int> holds at least 5 elements and the capacity() has to be bigger than than the size(). Moreover, the std::vector<int>'s size is non-zero. That is, the following always has to be true:
sizeof(RAWarr) < sizeof(VEC) + sizeof(int) * VEC.capacity()
Anyway, all this is pretty much irrelevant to what your actual question seems to be: What is the overhead of representing n objects of type T using a built-in array of T, an std::array<T, n>, and an std::vector<T>? The answer to this question is:
A built-in array T[n] uses sizeof(T) * n.
An std::array<T, n> uses the same size as a T[n].
A std::vector<T>(n) has needs some control data (the size, the capacity, and possibly and possibly an allocator) plus at least 'n * sizeof(T)' bytes to represent its actual data. It may choose to also have a capacity() which is bigger than n.
In addition to these numbers, actually using any of these data structures may require addition memory:
All objects are aligned at an appropriate address. For this there may be additional byte in front of the object.
When the object is allocated on the heap, the memory management system my include a couple of bytes in addition to the memory made avaiable. This may be just a word with the size but it may be whatever the allocation mechanism fancies. Also, this memory may live somewhere else than the allocate memory, e.g. in a hash table somewhere.
OK, I hope this provided some insight. However, here comes the important message: if std::vector<T> isn't capable of holding the amount of data you have there are two situations:
You have extremely low memory and most of this discussion is futile because you need entirely different approaches to cope with the few bytes you have. This would be the case if you are working on extremely resource constrained embedded systems.
You have too much data an using T[n] or std::array<T, n> won't be of much help because the overhead we are talking of is typically less than 32 bytes.
Maybe you can describe what you are actually trying to do and why std::vector<T> is not an option.