Vulkan memory alignment requirements

Vulkan memory alignment requirements - c++

I'm implementing a naive memory manager for Vulkan device memory, and would like to make sure that I understand the alignment requirements for memory and how to satisfy them.
So, assuming that I've allocated a 'pool' of memory using vkAllocateMemory and wish to sub-allocate blocks of memory in this pool to individual resources (based on a VkMemoryRequirements struct), will the following pseudocode be able to allocate a section of this memory with the correct size and alignment requirements?
Request memory with RequiredSize and RequiredAlignment
Iterate over blocks in the pool looking for one that is free and has size > RequiredSize
If the offset in memory of the current block is NOT divisible by RequiredAlignment, figure out the difference between the alignment and the remainder
If the size of the current block minus the difference is less than RequiredSize, skip to the next block in the pool
If the difference is more than 0, insert a padding block with size equal to the difference, and adjust the current unallocated block size and offset
Allocate RequiredSize bytes from the start of the current unallocated block (now aligned), adjust the Size and Offset of the unallocated block accordingly
Return vkDeviceMemory handle (of pool), size and offset (of new allocated block)
If we reach the end of the block list instead, this pool cannot allocate the memory
In other words, do we just need to make sure that Offset is a multiple of RequiredAlignment?

In other words, do we just need to make sure that Offset is a multiple of RequiredAlignment?
for alignment that is nearly sufficient.
in vkBindbufferMemory one of the valid usage requirements is:
memoryOffset must be an integer multiple of the alignment member of the VkMemoryRequirements structure returned from a call to vkGetBufferMemoryRequirements with buffer
and there is a parallel statement in the valid usage requirements of vkBindImageMemory:
memoryOffset must be an integer multiple of the alignment member of the VkMemoryRequirements structure returned from a call to vkGetImageMemoryRequirements with image
If the previous block contains a non-linear resource while the current one is linear or vice versa then the alignment requirement is the max of the VkMemoryRequirements.alignment and the device's bufferImageGranularity. This also needs to be check for the end of the memory block.
However you also need to take into account that the memory type of the pool must be set in the memoryTypeBits flags of VkMemoryRequirements .

Related

What is the "space" parameter in std::align?

In https://en.cppreference.com/w/cpp/memory/align, there's a space parameter that's the "buffer size." What is meant by buffer size here?
Is it like the amount of space you have to use to create the designed alignment? If it is, why is it needed?

It is an input output parameter. So it will do 2 things:
Tell the function how much space is available, so if alignment would overrun the buffer, the function fails:
The function modifies the pointer only if it would be possible to fit the wanted number of bytes aligned by the given alignment into the buffer. If the buffer is too small, the function does nothing and returns nullptr.
Function can output how much space is left after alignment, so you can string calls together. This is useful if you were writing some sort of aligned allocator.

You have to consider why you would want to align a pointer.
Think about a case where you have a range of memory that has been allocated for you to create objects into. This is called a memory buffer. The size of the buffer is the number of bytes from the start of the range to its end. Each object has a type. Each type has an alignment requirement. Objects of that type can only be created in addresses aligned to the required byte boundary.
Let's say the first address of the memory range isn't aligned to the byte boundary that is required by the type of the object that you want to create. In such case, you cannot create the object at the beginning of the memory range. That's where you need std::align. It adjusts the given pointer to the next address that is aligned, which is the first address where the object can be created.
To do that, you only need to know the address and the alignment. But you also need to know whether the object can fit inside of your memory range after the alignment. For example, if you have 16 bytes of memory and you want to create an object of 16 bytes, but the first address isn't aligned to the 4 byte boundary so if you create the object starting from the adjusted (aligned) address, then it would overflow the memory range by the number of adjusted bytes. So in order to know that, we also pass the size of the object and the size of the memory space. If the object won't fit, then std::align returns null.
We may also want to create more than one object into that memory buffer, so we will need to know how much the pointer had been adjusted so that we can find out where we can create the next object. That is why space is a non-const reference argument. The function deducts the number of aligned bytes from space if the object fits.

Custom malloc implementation

Recently I was asked a question to implement a very simple malloc with the following restrictions and initial conditions.
#define HEAP_SIZE 2048
int main()
{
privateHeap = malloc(HEAP_SIZE + 256); //extra 256 bytes for heap metadata
void* ptr = mymalloc( size_t(750) );
myfree( ptr );
return 0;
}
I need to implement mymalloc and myfree here using the exact space provided. 256 bytes is nicely mapping to 2048 bits, and I can have a bit array storing if a byte is allocated or if it is free. But when I make a myfree call with ptr, I cannot tell how much size was allocated to begin with. I cannot use any extra bits.
I don't seem to think there is a way around this, but I've been reiterated that it can be done. Any suggestions ?
EDIT 1:
Alignment restrictions don't exist. I assumed I am not going to align anything.
There was a demo program that did a series of mallocs and frees to test this, and it didn't have any memory blocks that were small. But that doesn't guarantee anything.
EDIT 2:
The guidelines from the documentation:
Certain Guidelines on your code:
Manage the heap metadata in the private heap; do not create extra linked lists outside of the provided private heap;
Design mymalloc, myrealloc, myFree to work for all possible inputs.
myrealloc should do the following like the realloc in C++ library:
void* myrealloc( void* C, size_t newSize ):
If newSize is bigger than the size of chunk in reallocThis:
It should first try to allocate a chunk of size newSize in place so that new chunk's base pointer also is reallocThis;
If there is no free space available to do in place allocation, it should allocate a chunk of requested size in a different region;
and then it should copy the contents from the previous chunk.
If the function failed to allocate the requested block of memory, a NULL pointer is returned, and the memory block pointed to
by argument reallocThis is left unchanged.
If newSize is smaller, realloc should shrink the size of the chunk and should always succeed.
If newSize is 0, it should work like free.
If reallocThis is NULL, it should work like malloc.
If reallocThis is pointer that was already freed, then it should fail gracefully by returning NULL
myFree should not crash when it is passed a pointer that has already been freed.

A common way malloc implementations keep track of the size of memory allocations so free knows how big they are is to store the size in the bytes before pointer return by malloc. So say you only need two bytes to store the length, when the caller of malloc requests n bytes of memory, you actually allocate n + 2 bytes. You then store the length in the first two bytes, and return a pointer to the byte just past where you stored the size.
As for your algorithm generally, a simple and naive implementation is to keep track of unallocated memory with a linked list of free memory blocks that are kept in order of their location in memory. To allocate space you search for a free block that's big enough. You then modify the free list to exclude that allocation. To free a block you add it back to the free list, coalescing adjacent free blocks.
This isn't a good malloc implementation by modern standards, but a lot of old memory allocators worked this way.

You seem to be thinking of the 256 bytes of meta-data as a bit-map to track free/in-use on a byte-by-byte basis.
I'd consider the following as only one possible alternative:
I'd start by treating the 2048-byte heap as a 1024 "chunks" of 2 bytes each. This gives you 2 bits of information for each chunk. You can treat the first of those as signifying whether that chunk is in use, and the second as signifying whether the following chunk is part of the same logical block as the current one.
When your free function is called, you use the passed address to find the correct beginning point in your bitmap. You then walk through bits marking each chunk as free until you reach one where the second bit is set to 0, indicating the end of the current logical block (i.e., that the next 2 byte chunk is not part of the current logical block).
[Oops: just noticed that Ross Ridge already suggested nearly the same basic idea in a comment.]

Address of start block storage

The allocation function attempts to allocate the requested amount of
storage.
If it is successful, it shall return the address of the start of a
block of storage whose length in bytes shall be at least as large as
the requested size.
What does that constraint mean? Could you get an example it violates?
It seems, my question is unclear.
UPD:
Why "at least"? What is the point of allocation more than requested size? Could you get suitable example?

The allowance for allocating "more than required" is there to allow:
Good alignment of the next block of data.
Reduce restrictions on what platforms are able to run code compiled from C and C++.
Flexibility in the design of the memory allocation functionality.
An example of point one is:
char *p1 = new char[1];
int *p2 = new int[1];
If we allocate exactly 1 byte at address 0x1000 for the first allocation, and follow that exactly with a second allocation of 4 bytes for an int, the int will start at address 0x1001. This is "valid" on some architectures, but often leads to a "slower load of the value", on other architectures it will directly lead to a crash, because an int is not accessible on an address that isn't an even multiple of 4. Since the underlying architecture of new doesn't actually know what the memory is eventually going to be used for, it's best to allocate it at "the highest alignment", which in most architectures means 8 or 16 bytes. (If the memory is used for example to store SSE data, it will need an alignment of 16 bytes)
The second case would be where "pointers can only point to whole blocks of 32-bit words". There have been architectures like that in the past. In this case, even if we ignore the above problem with alignment, the memory location specified by a generic pointer is two parts, one for the actual address, and one for the "which byte within that word". In the memory allocator, since typical allocations are much larger than a single byte, we decide to only use the "full word" pointer, so all allocations are by design always rounded up to whole words.
The third case, for example, would be to use a "pre-sized block" allocator. Some real-time OS's for example will have a fixed number of predefined sizes that they allocate - for example 16, 32, 64, 256, 1024, 16384, 65536, 1M, 16M bytes. Allocations are then rounded up to the nearest equal or larger size, so an allocation for 257 bytes would be allocated from the 1024 size. The idea here is to a) provide fast allocation, by keeping track of free blocks in each size, rather than the traditional model of having a large number of blocks in any size to search through to see if there is a big enough block. It also helps against fragmentation (when lots of memory is "free", but the wrong size, so can't be used - for example, if run a loop until the system is out of memory that allocates blocks of 64 bytes, then free every other, and try to allocate a 128 byte block, there is not a single 128 byte block free, because ALL of the memory is carved up into little 64-byte sections).

It means that the allocation function shall return an address of a block of memory whose size is at least the size you requested.
Most of the allocation functions, however, shall return an address of a memory block whose size is bigger than the one you requested and the next allocations will return an address inside this block until it reaches the end of it.
The main reasons for this behavior are:
Minimize the number of new memory block allocations (each block can contain several allocations), which are expensive in terms of time complexity.
Specific alignment issues.

The two most common reasons why allocation
returns a block larger than requested is
Alignment
Bookkeeping

It may be such because in modern operating systems it is much effective to allocate memory page that can be of 512 kb. So internal malloc functionality can just alloc this page of memory, fill its beginning with some service information as to in how many sub-blocks it's divided into, theirs sizes, etc. And only after that it returns to you an address suitable for your needs. Next call to malloc will return another portion of this allocated page for instance. It doesn't matter that the block of memory is restricted to the size you requested. In fact you can heap-overflow this buffer because of no safe mechanism to prevent this sort of activity. Also you can consider alignment questions that other responders stated above. There are enough of types of memory management. You can google it if you are interested enough(Best Fit, First Fit, Last Fit, correct me if I'm wrong).

Efficient allocation of dynamic arrays within mmap'ed memory

I have a very large (fixed at runtime, around 10 - 30 million) number of arrays. Each array is of between 0 and 128 elements that are each 6 bytes.
I need to store all the arrays in mmap'ed memory (so I can't use malloc), and the arrays need to be able to grow dynamically (up to 128 elements, and the arrays never shrink).
I implemented a naive approach of having an int array representing the state of each block of 6 bytes in the mmap'ed memory. A value of 0xffffffff at an offset represents the corresponding offset in the mmap'ed memory being free, any other value is the id of the array (which is needed for defragmentation of the blocks in my current implementation, blocks can't be moved without knowing the id of their array to update other data structures). On allocation and when an array outgrows its allocation it would simply iterate until it found enough free blocks, and insert at the corresponding offset.
This is what the allocation array and mmap'ed memory look like, kindof:
| 0xffffffff | 0xfffffff | 1234 | 1234 | 0xffffffff | ...
-----------------------------------------------------------------
| free | free |array1234[0]|array1234[1]| free | ...
This approach though has a memory overhead of offset of furthest used block in mmap'ed memory x 4 (4 bytes ber int).
What better approaches are there for this specific case?
My ideal requirements for this are:
Memory overhead (any allocation tables + unused space) <= 1.5 bits per element + 4*6 bytes per array
O(1) allocation and growing of arrays

Boost.Interprocess seems to have a neat implementation of managed memory-mapped files, with provisions similar to malloc/free but for mapped files (i.e. you have a handle to a suitably-large memory-mapped file and you can ask the library to sub-allocate an unused part of the file for something, like an array). From the documentation:
Boost.Interprocess offers some basic classes to create shared memory
objects and file mappings and map those mappable classes to the
process' address space.
However, managing those memory segments is not not easy for
non-trivial tasks. A mapped region is a fixed-length memory buffer and
creating and destroying objects of any type dynamically, requires a
lot of work, since it would require programming a memory management
algorithm to allocate portions of that segment. Many times, we also
want to associate names to objects created in shared memory, so all
the processes can find the object using the name.
Boost.Interprocess offers 4 managed memory segment classes:
To manage a shared memory mapped region (basic_managed_shared_memory class).
To manage a memory mapped file (basic_managed_mapped_file).
To manage a heap allocated (operator new) memory buffer (basic_managed_heap_memory class).
To manage a user provided fixed size buffer (basic_managed_external_buffer class).
The most important services of a managed memory segment are:
Dynamic allocation of portions of a memory the segment.
Construction of C++ objects in the memory segment. These objects can be anonymous or we can associate a name to them.
Searching capabilities for named objects.
Customization of many features: memory allocation algorithm, index types or character types.
Atomic constructions and destructions so that if the segment is shared between two processes it's impossible to create two objects
associated with the same name, simplifying synchronization.

How many mmap'ed areas can you afford? If 128 is ok then I'd created 128 areas corresponding to the all possible sizes of your arrays. And ideally a linked list of free entries for each area. In this case you will get fixed size of the record for each area. And growing an array from N to N + 1 is an operation of moving the data from area[N] to area[N + 1] at the end (if the linked list of empty entries for N + 1 is empty) or in an empty slot if not. For area[N] the slot removed is added to its list of free entries
UPDATE: The linked list can be embed in the main structures. So no extra allocations is needed, the first field (int) inside every possible record (from size 1 to 128) can be an index to the next free entry. For allocated entries it's always void (0xffffffff), but if an entry is free then this index becomes a member of the corresponding linked chain.

I devised and ultimately went with a memory allocation algorithm that just about lives up to my requirements, with O(1) amortised, very little fragmentation and very little overhead. Feel free to comment and I'll detail it when I get a chance.

Is the memory allocated by new operated consecutive?

as the title says, I want to know in c++, whether the memory allocated by one new operation is consecutive...

BYTE* data = new BYTE[size];
In this code, whatever size is given, the returned memory region is consecutive. If the heap manager can't allocate consecutive memory of size, it's fail. an exception (or NULL in malloc) will be returned.
Programmers will always see the illusion of consecutive (and yes, infinite :-) memory in a process's address space. This is what virtual memory provides to programmers.
Note that programmers (other than a few embedded systems) always see virtual memory. However, virtually consecutive memory could be mapped (in granularity of 'page' size, which is typically 4KB) in physical memory in arbitrary fashion. That mapping, you can't see, and mostly you don't need to understand it (except for very specific page-level optimizations).
What about this?
BYTE* data1 = new BYTE[size1];
BYTE* data2 = new BYTE[size2];
Sure, you can't say the relative address of data1 and data2. It's generally non-deterministic. It depends on heap manager (such as malloc, often new is just wrapped malloc) policies and current heap status when a request was made.

The memory allocated in your process's address space will be contiguous.
How those bytes are mapped into physical memory is implementation-specific; if you allocate a very large block of memory, it is likely to be mapped to different parts of physical memory.
Edit: Since someone disagrees that the bytes are guaranteed to be contiguous, the standard says (3.7.3.1):
The allocation function attempts to allocate the requested amount of storage. If it is successful, it shall return the address of the start of a block of storage whose length in bytes shall be at least as large as the requested size.

Case 1:
Using "new" to allocate an array, as in
int* foo = new int[10];
In this case, each element of foo will be in contiguous virtual memory.
Case 2:
Using consecutive "new" operations non-atomically, as in
int* foo = new int;
int* bar = new int;
In this case, there is never a guarantee that the memory allocated between calls to "new" will be adjacent in virtual memory.

The virtual addresses of the allocated bytes will be contiguous. They will also be physically contiguous within resident pages backing the address space of your process. The mapping of physical pages to regions of the process virtual space is very OS and platform specific, but in general you cannot assume physically contiguous range larger then or not aligned on a page.

If by your question you mean "Will successive (in time) new() operations return adjacent chunks of memory, with no gaps in between?", this old programmer will suggest, very politely, that you should not rely on it.
The only reason that question would come up was if you intended to walk a pointer "out" of one data object and "into" the next one. This is a really bad idea, since you have no guarantee that the next object in the address space is of anything remotely resembling the same type as the previous one.

Yes.
Don't bother about the "virtual memory" issue: apart that there could be cases when you haven't at all a system that supports virtual memory, from your PoV you get a consecutive memory chunk. That's all.

Physical memory is never contiguous its logical memory which is contiguous.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js