I am now reading the source code of OPENCV, a computer vision open source library. I am confused with this function:
#define CV_MALLOC_ALIGN 16
void* fastMalloc( size_t size )
{
uchar* udata = (uchar*)malloc(size + sizeof(void*) + CV_MALLOC_ALIGN);
if(!udata)
return OutOfMemoryError(size);
uchar** adata = alignPtr((uchar**)udata + 1, CV_MALLOC_ALIGN);
adata[-1] = udata;
return adata;
}
/*!
Aligns pointer by the certain number of bytes
This small inline function aligns the pointer by the certian number of bytes by
shifting it forward by 0 or a positive offset.
*/
template<typename _Tp> static inline _Tp* alignPtr(_Tp* ptr, int n=(int)sizeof(_Tp))
{
return (_Tp*)(((size_t)ptr + n-1) & -n);
}
fastMalloc is used to allocated memory for a pointer, which invoke malloc function and then alignPtr. I cannot understand well why alignPtr is called after memory is allocated? My basic understanding is by doing so it is much faster for the machine to find the pointer. Can some references on this issue be found in the internet? For modern computer, is it still necessary to perform this operation? Any ideas will be appreciated.
Some platforms require certain types of data to appear on certain byte boundaries (e.g:- some compilers
require pointers to be stored on 4-byte boundaries).
This is called alignment, and it calls for extra padding within, and possibly at the end of, the object's data.
Compiler might break in case they didn't find proper alignment OR there could be performance bottleneck in reading that data ( as there would be a need to read two blocks for getting same data).
EDITED IN RESPONSE TO COMMENT:-
Memory request by a program is generally handled by memory allocator. One such memory allocator is fixed-size allocator. Fixed size allocation return chunks of specified size even if requested memory is less than that particular size. So, with that background let me try to explain what's going on here:-
uchar* udata = (uchar*)malloc(size + sizeof(void*) + CV_MALLOC_ALIGN);
This would allocate amount of memory which is equal to memory_requested + random_size. Here random_size is filling up the gap to make it fit for size specified for fixed allocation scheme.
uchar** adata = alignPtr((uchar**)udata + 1, CV_MALLOC_ALIGN);
This is trying to align pointer to specific boundary as explained above.
It allocates a block a bit bigger than it was asked for.
Then it sets adata to the address of the next properly allocated byte (add one byte, then round up to the next properly aligned address).
Then it stores the original pointer before the new address. I assume this is later used to free the originally allocated block.
And then we return the new address.
This only makes sense if CV_MALLOC_ALIGN is a stricter alignment than malloc guarantees - perhaps a cache line?
Related
I'm trying to implement a free list allocator using Red Black Tree for optimize O(LogN) best fit search.
My strategy is that when a block is allocated, it's allocated with a Header where
struct Header {
std::size_t m_Size;
};
thus sizeof(Header) == sizeof(std::size_t)
This is done, so when deallocating I could know how much bytes were allocated to give it back as a free node.
Now there's a problem with this solution, because now I have to align the Header itself + the allocated block to the requested alignment, so padding in between the Header and the allocated block start, and padding between the end of the allocated block and a start of a new Header (so next block Header will be already aligned) are needed.
So to illustrate the problem better here's a Red Black Tree with nodes indicating free block sizes minus sizeof(Header)
Now let's assume a user allocate a block of size 16 with alignment of 16:
allocate(16, 16);
Now best fit will yield us node 17.
But we can't count on it, let's assume node 17 is at address 0x8 and we're on x32 so sizeof(Header) = 4.
Header's address will be from 0x8-0xC, now we would need to add a padding so our block will be aligned to 16 as requested, this padding is equal to 4 bytes so our allocated block will start at 0x10 which is aligned to 16. Now no padding is needed at the end of the block since 0x10 + 16d will be aligned to next block Header.
The padding between the end of the allocated block to the start of the new block is easy to calculate before-hand like so:
std::size_t headerPadding = size % sizeof(Header) != 0 ? sizeof(Header) - size % sizeof(Header) : 0;
So it's not dependent on the address of the free node.
But the padding between the end of the Header and the start of the allocated block IS dependent on the address of the free node like I've demonstrated.
And for our example the total needed size in the case of this specific node will be 4 (padding between Header and allocated block) + 16 (allocated block size) + 0 (padding needed for next free block header alignment) = 20.
Obviously node 17 doesn't match.
Now my strategy to fix this is as follows:
- Find best fit
- See if best fit matches the size requirements as described
- If yes we're done
- If not get its successor and check if it matches the size requirements as described
- If yes we're done
- If not start over from successor parent until we reach a node that satisfy size requirements or we reach the original best fit again
Here's a code describing the process:
void FreeTreeAllocator::Find(const std::size_t size, const std::size_t alignment, std::size_t& sizePadding, std::size_t& headerPadding, RBTree::Node*& curr)
{
headerPadding = size % sizeof(Header) != 0 ? sizeof(Header) - size % sizeof(Header) : 0;
RBTree::Node* best = m_Tree.SearchBest(m_Tree.m_Root, size + headerPadding);
RBTree::Node* origin = best;
std::vector<std::size_t> visited;
while (visited[visited.size() - 1] != (std::size_t)origin && !IsNodeBigEnough(size, alignment, sizePadding, headerPadding, best))
{
RBTree::Node* successor = m_Tree.Successor(best);
if (IsNodeBigEnough(size, alignment, sizePadding, headerPadding, successor))
{
best = successor;
break;
}
else
{
std::vector<std::size_t>::iterator it;
do {
best = successor->m_Parent;
it = std::find(visited.begin(), visited.end(), (std::size_t)best);
} while (it != visited.end());
}
visited.push_back((std::size_t)best);
}
}
bool FreeTreeAllocator::IsNodeBigEnough(const std::size_t size, const std::size_t alignment, std::size_t& sizePadding, std::size_t& headerPadding, RBTree::Node* curr)
{
if (curr == m_Tree.m_Nil)
return false;
void* currentAddress = reinterpret_cast<char*>(curr) + sizeof(Header);
std::size_t space = curr->m_Value;
std::align(alignment, size, currentAddress, space);
sizePadding = reinterpret_cast<char*>(currentAddress) - reinterpret_cast<char*>(curr) - sizeof(Header);
return sizePadding + size + headerPadding <= curr->m_Value;
}
Now for the given allocation request:
allocate(16, 16);
and the given example tree from the picture the following the algorithm described, the search path will be:
17 -> 21 -> 22 -> 23 -> 25 -> 27
At worst case this is O(LogN + M) where M is the size of the right sub-tree of the original best fit node.
Now one way this could be solved if I make sizeof(Header) = sizeof(std::max_align_t), this way the padding between the Header and the start of the allocated block will always be 0, so we won't need this padding anymore because every request will be aligned without this padding, so we could really just do:
void FreeTreeAllocator::Find(const std::size_t size, std::size_t& headerPadding, RBTree::Node*& curr)
{
headerPadding = size % sizeof(Header) != 0 ? sizeof(Header) - size % sizeof(Header) : 0;
RBTree::Node* best = m_Tree.SearchBest(m_Tree.m_Root, size + headerPadding);
return best;
But that will waste alot of memory compared to my proposed idea where I settle on O(LogN + M) best fit search.
Now why do I ask it?
Because I see the use of Red Black Tree as optimization for free list allocator to reduce best fit search to O(LogN) while I can't seem to make it really O(LogN), the flaw with my design I guess is that there need to be a Header for book-keeping on how much bytes to give back to a free block when deallocating, and I don't see a way to do it without this, if I was able to not have a Header at all, or make it so I don't have issues with it's alignment to find padding specific for a node in the list (by making sizeof(Header) = sizeof(std::max_align_t), or even sizeof(Header) = 1) then this could be solved with simple O(LogN) search.
I'm looking for ideas on how to solve this issue, how other implementations are doing this in O(LogN) while keeping inner fragmentation as low as possible?
UPDATE:
I ended up having the node addresses aligned to alignof(std::max_align_t) - sizeof(Header) so that makes Header always aligned no matter if you're on x32/x64 (remember that Header is composed of sizeof(std::size_t)), and no matter if alignof(std::max_align_t) is 8 or 16.That makes the allocated payload to start at address that is aligned to alignof(std::max_align_t) just like malloc, so no matter what is allocated it will always be aligned to the max alignment and no padding is required between Header and payload.
The only padding required is after payload to match the next address that is aligned to alignof(std::max_align_t) - sizeof(Header) + any padding required to make it so the allocated block is at least sizeof(RBTree::Node) bytes big (including sizeof(Header) inside the equation) so when deallocation we could safely store an RBTree::Node without overriding other data.
Without the padding between Header and the payload, and with the padding needed to align next block to alignof(std::max_align_t) - sizeof(Header) we can easily use the default RBTree::Search of O(LogN) since the padding can be calculated beforehand depending on the size of the block and removing from the equation the start address of a particular node.
The only remaining problem I have with optimizing this free list allocator to O(LogN) is the deallocation part, more precisely the coalescence part.
What I can't solve now is how to do O(1) coalescence. I rearranged the RBTree::Node struct such as that m_Parent is first and so its LSB is always set 1 (for every function relying on m_Parent I have a getter function for fixing it) and then I can check if next block of current deallocated block (we can get to next block with size from Header) if the first sizeof(std::size_t) bytes & 0x1 is true and if so it's a free node, if not it's a busy block with Header (since Header's m_Size LSB will always be 0 because we add padding for alignment with std::max_align_t).The remaining problem is how to get to the previous memory block and know if it's free or busy, that I cannot figure yet and would love to hear suggestions.
For the padding issue:
Make sure the size of your free list nodes is a power of 2 -- either 16 or 32 bytes -- and make sure the addresses of your free list nodes are all aligned at node_size * x - sizeof(Header) bytes.
Now all of your allocations will automatically be aligned at multiples of the node size, with no padding required.
Allocations requiring larger alignments will be rare, so it might be reasonable just to find the left-most block of appropriate size, and walk forward in the tree until you find a block that works.
If you need to optimize large-alignment allocations, though, then you can sort the blocks first by size, and then break ties by sorting on the number of zeros at the right of each nodes's allocation address (node address + sizeof(Header)).
Then, a single search in the tree will either find an exactly-fitting block that works, or a larger block. There is a good chance you'll be able to split the larger block in a way that satisfies the alignment requirement, but if not, then you can again skip ahead in the tree to find a block of that size that works, or an even larger block, etc.
The resulting search is faster, but still not guaranteed O(log N). To fix that, you can just give up after a limited number of skips forward, and revert to finding a block of requested_size + requested_alignment. If you find one of those, then it is guaranteed that you'll be able to split it up to satisfy your alignment constraint.
there need to be a Header for book-keeping on how much bytes to give back to a free block when deallocating
On a 64-bit platform, one way to eliminate the header is to make your allocator manage arenas of power-of-2 object sizes. Each arena is for one object size and all arenas are of the same size. Then map (reserve only) one large chunk of virtual memory in such a way, that it is aligned by its own size (which is also a power of 2). That way, your pointers to objects are structured: the lower order bits are the offset to the object within the arena, the next bits are the arena number. For each arena it needs to maintain the free list and allocated object count, but the free lists must initially include only one page or 1 object (whichever is bigger), so that it doesn't commit page frames to the entire reserved virtual memory, which would run out of memory immediately.
For example, if you have 8GiB arenas for objects of power-2-sizes from 8 to 65536 bytes, then the lower [0:32] bits are the object offset within the arena, bits [33:36] are the arena number and log2 of object size (arenas [0, 2] are unused because they are not big enough for free list next pointer).
The complete answer is my OP update and this answer.
I found a solution for coalescence in O(1).
My OP update describe how we can achieve coalescence with next block in O(1) but not how to achieve coalescence in O(1) with previous block.
To do this I store an additional std::size_t m_PrevSize in both busy block Header struct and RBTree::Node struct as the first member.
When a block is allocated and becomes busy (either by simple allocation or by block splitting) I simply move to the next block using the Header's m_Size property and the first std::size_t bytes to 0. This will set the next memory block either busy or free that the previous is busy and there's no need to merge with.
When a block is deallocated and I convert it to free block, I do the same thing but set the first std::size_t bytes to RBTree::Node's m_Value property which is basically how much bytes this free block have, and the when deallocating I can check own m_PrevSize property and if it's not 0 go backward m_PrevSize bytes and do the merge.
I would like to allocate some char buffers0, to be passed to an external non-C++ function, that have a specific alignment requirement.
The requirement is that the buffer be aligned to a N-byte1 boundary, but not to a 2N boundary. For example, if N is 64, then an the pointer to this buffer p should satisfy ((uintptr_t)p) % 64 == 0 and ((uintptr_t)p) % 128 != 0 - at least on platforms where pointers have the usual interpretation as a plain address when cast to uintptr_t.
Is there a reasonable way to do this with the standard facilities of C++11?
If not, is there is a reasonable way to do this outside the standard facilities2 which works in practice for modern compilers and platforms?
The buffer will be passed to an outside routine (adhering to the C ABI but written in asm). The required alignment will usually be greater than 16, but less than 8192.
Over-allocation or any other minor wasted-resource issues are totally fine. I'm more interested in correctness and portability than wasting a few bytes or milliseconds.
Something that works on both the heap and stack is ideal, but anything that works on either is still pretty good (with a preference towards heap allocation).
0 This could be with operator new[] or malloc or perhaps some other method that is alignment-aware: whatever makes sense.
1 As usual, N is a power of two.
2 Yes, I understand an answer of this type causes language-lawyers to become apoplectic, so if that's you just ignore this part.
Logically, to satisfy "aligned to N, but not 2N", we align to 2N then add N to the pointer. Note that this will over-allocate N bytes.
So, assuming we want to allocate B bytes, if you just want stack space, alignas would work, perhaps.
alignas(N*2) char buffer[B+N];
char *p = buffer + N;
If you want heap space, std::aligned_storage might do:
typedef std::aligned_storage<B+N,N*2>::type ALIGNED_CHAR;
ALIGNED_CHAR buffer;
char *p = reinterpret_cast<char *>(&buffer) + N;
I've not tested either out, but the documentation suggests it should be OK.
You can use _aligned_malloc(nbytes,alignment) (in MSVC) or _mm_malloc(nbytes,alignment) (on other compilers) to allocate (on the heap) nbytes of memory aligned to alignment bytes, which must be an integer power of two.
Then you can use the trick from Ken's answer to avoid alignment to 2N:
void*ptr_alloc = _mm_malloc(nbytes+N,2*N);
void*ptr = static_cast<void*>(static_cast<char*>(ptr_alloc) + N);
/* do your number crunching */
_mm_free(ptr_alloc);
We must ensure to keep the pointer returned by _mm_malloc() for later de-allocation, which must be done via _mm_free().
float* tempBuf = new float[maxVoices]();
Will the above result in
1) memory that is 16-byte aligned?
2) memory that is confirmed to be contiguous?
What I want is the following:
float tempBuf[maxVoices] __attribute__ ((aligned));
but as heap memory, that will be effective for Apple Accelerate framework.
Thanks.
The memory will be aligned for float, but not necessarily for CPU-specific SIMD instructions. I strongly suspect on your system sizeof(float) < 16, though, which means it's not as aligned as you want. The memory will be contiguous: &A[i] == &A[0] + i.
If you need something more specific, new std::aligned_storage<Length, Alignment> will return a suitable region of memory, presuming of course that you did in fact pass a more specific alignment.
Another alternative is struct FourFloats alignas(16) {float[4] floats;}; - this may map more naturally to the framework. You'd now need to do new FourFloats[(maxVoices+3)/4].
Yes, new returns contiguous memory.
As for alignment, no such alignment guarantee is provided. Try this:
template<class T, size_t A>
T* over_aligned(size_t N){
static_assert(A <= alignof(std::max_align_t),
"Over-alignment is implementation-defined."
);
static_assert( std::is_trivially_destructible<T>{},
"Function does not store number of elements to destroy"
);
using Helper=std::aligned_storage_t<sizeof(T), A>;
auto* ptr = new Helper[(N+sizeof(Helper)-1)/sizeof(Helper)];
return new(ptr) T[N];
}
Use:
float* f = over_aligned<float,16>(37);
Makes an array of 37 floats, with the buffer aligned to 16 bytes. Or it fails to compile.
If the assert fails, it can still work. Test and consult your compiler documentation. Once convinced, put compiler-specific version guards around the static assert, so when you change compilers you can test all over again (yay).
If you want real portability, you have to have a fall back to std::align and manage resources separately from data pointers and count the number of T if and only if T has a non-trivial destructor, then store the number of T "before" the start of your buffer. It gets pretty silly.
It's guaranteed to be aligned properly with respect to the type you're allocating. So if it's an array of 4 floats (each supposedly 4 bytes), it's guaranteed to provide a usable sequence of floats. It's not guaranteed to be aligned to 16 bytes.
Yes, it's guaranteed to be contiguous (otherwise what would be the meaning of a single pointer?)
If you want it to be aligned to some K bytes, you can do it manually with std::align. See MSalter's answer for a more efficient way of doing this.
If tempBuf is not nullptr then the C++ standard guarantees that tempBuf points to the zeroth element of least maxVoices contiguous floats.
(Don't forget to call delete[] tempBuf once you're done with it.)
I have few questions:
1) why when I created more than two dynamic allocated variables the difference between their memory address is 16 bytes. (I thought one of the advantages of using dynamic variables is saving memory, so when you delete unused variable it will free that memory); but if the difference between two dynamic variables is 16 bytes even using a short integer, then there a lot of memery that I will not benifit .
2) creating a dynamic allocated variable using new operator.
int x;
cin >> x;
int* a = new int(3);
int y = 4;
int z = 1;
In the e.g above. what is the flow of execution of this program. is it gonna store all variable likes x,a,y and z in the stack and then will store the value 3 in the address that a points to?
3) creating a dynamic alloated array.
int x;
cin >> x;
int* array = new int[x];
int y = 4;
int z = 1;
and the same question here.
4) does the size of the heap(free scope) depend on how much of memory im using in the code area,the stack are, and the global area ?
Storing small values like integers on the heap is fairly pointless because you use the same or more memory to store the pointer. The 16 byte alignment is just so the CPU can access the memory as efficiently as possible.
Yes, although the stack variables might be allocated to registers; that is up to the compiler.
Same as 2.
The size of the heap is controlled by the operating system and expanded as necessary as you allocate more memory.
Yes, in the examples, a and array are both "stack" variables. The data they point to is not.
I put stack in quotes because we are not going to concern ourselves with hardware detail here, but just the semantics. They have the semantics of stack variables.
The chunks of heap memory which you allocate need to store some housekeeping data so that the allocator (the code which works in behind of new) could work. The data usually includes chunk length and the address of next allocated chunk, among other things — depending on the actual allocator.
In your case, the service data are stored directly in front of (and, maybe, behind of, too) the actual allocated chunk. This (plus, likely, alignment) is the reason of 16 byte gap you observe.
I'm programming a system which has a massive amount of redundant data that needs to be kept in memory, and accessible with as little latency as possible. (uncompressed, the data is guaranteed to absorb 1GB of memory, minimum).
One such method I thought of is creating a container class like the following:
class Chunk{
public:
Chunk(){ ... };
~Chunk() { /*carefully delete elements according to mask*/ };
getElement(int index);
setElement(int index);
private:
unsigned char mask; // on bit == data is not-redundant, array is 8x8, 64 elements
union{
Uint32 redundant; // all 8 elements are this value if mask bit == 0
Uint32 * ptr; // pointer to 8 allocated elements if mask bit == 1
}array[8];
};
My question, is that is there any unseen consequences of using a union to shift between a Uint32 primative, and a Uint32* pointer?
This approach should be safe on all C++ implementations.
Note, however, that if you know your platform's memory alignment requirements, you may be able to do better than this. In particular, if you know that memory allocations are aligned to 2 bytes or greater (many platforms use 8 or 16 bytes), you can use the lower bit of the pointer as a flag:
class Chunk {
//...
uintptr_t ptr;
};
// In your get function:
if ( (ptr & 1) == 0 ) {
return ((uint32_t *)ptr)[index];
} else {
return *((uint32_t *)(ptr & ~(uintptr_t)0);
}
You can further reduce space usage by using a custom allocation method (with placement new) and placing the pointer immediately after the class, in a single memory allocation (ie, you'll allocate room for Chunk and either the mask or the array, and have ptr point immediately after Chunk). Or, if you know most of your data will have the low bit off, you can use the ptr field directly as the fill-in value:
} else {
return ptr & ~(uintptr_t)0;
}
If it's the high bit that's usually unused, a bit of bit shifting will work:
} else {
return ptr >> 1;
}
Note that this approach of tagging pointers is unportable. It is only safe if you can ensure your memory allocations will be properly aligned. On most desktop OSes, this will not be a problem - malloc already ensures some degree of alignment; on Unixes, you can be absolutely sure by using posix_memalign. If you can obtain such a guarentee for your platform, though, this approach can be quite effective.
If space is at a premium you may be wasting memory. It will allocate enough space for the largest element, which in this case could be up to be 64 bits for the pointer.
If you stick to 32-bit architectures you should not have problems with the cast.