Allocating memory for multiple buffers including alignment - c++

I'd like to allocate memory for contiguous arrays of several different types. That is, like this, but with dynamic sizes:
template <typename T0, std::size_t N0, typename T1, std::size_t N1, ...>
struct {
T0 t0s[N0];
T1 t1s[N1];
...
};
Naively, the size is just sizeof(T0) * N0 + sizeof(T1) * N1 + ... but alignment makes it tricky, I think. You'd align the whole thing on alignof(T0), then after N0 T0s, you have to figure out the alignof(T1) spot to start the T1s. There's std::align for this, but it's geared toward finding where to put them after you already have memory allocated, so it would take a little effort to use it before allocating memory.
Is there some easy off-the-shelf way to calculate the size and offsets of a dynamic struct like this? I'm picturing a function like
// Return the offsets in bytes of the ends of each array
template <typename... Ts>
std::array<std::size_t, sizeof...(Ts)>
calculateOffsets(const std::array<std::size_t, sizeof...(Ts)> sizes);
for example:
auto offsets = calculateOffets<char, float>({3, 2});
// Offsets is now {4, 12} because the layout is [cccXffffFFFF] where c is a char, X is padding, f and F are the two floats.
// Now we can allocate a buffer of size offsets.back() with alignment alignof(char) and placement-new 3 chars starting at buf + 0, and 2 floats at buf + offsets.front()
It's funny because clearly the compiler has this logic built in, since at compile time it knows the layout of the above struct when the sizes are static.
(So far I'm only interested in POD types; to work fully generically I'd additionally need to deal with exception-safe construction and destruction.)

There are no built-in facilities for this -- but this is a solvable problem in C++.
Since what you're effectively looking for is a "packed buffer" of aligned contiguous components, I would recommend doing this by allocating the sum of all buffers at the alignment of the most aligned object. This makes calculation of the length of the buffer in bytes easy -- but requires remapping of the order of the data.
The thing to remember with alignment is that a 16-byte boundary is also aligned to an 8 byte boundary (and so on). On the other hand, it is not true that 4 bytes from an 1-byte boundary will be on an 4-byte boundary1. It may be, but this depends on the initial pointer.
Example
As an example, we want the following in the buffer:
3 chars (c)
2 floats (f)
2 shorts (s)
We assume the following is true:
alignof(float) >= alignof(short) >= alignof(char)
sizeof(float) is 4
sizeof(short) is 2
sizeof(char) is 1
For our needs, we would want a buffer that is sizeof(float) * 2 + sizeof(short) * 2 + sizeof(char) * 3 = 15 bytes.
A buffer that has been sorted by the alignment would look like the following, when packed:
ffff ffff ssss ccc
As long as the initial allocation is aligned to alignof(float), the rest of the bytes in this buffer are guaranteed to be suitably aligned as well.
It would be ideal if you could just arrange your data to always be in descending order of alignment; but if this is not guaranteed to be the case, you could always use a pre-canned template-metaprogramming solution to sort a list of T... types based on the alignof(T) for each object.
1 The issue with your suggestion of aligning in order of the buffers is that a situation like alignof(char), which is 1 byte, does not guarantee that 3 chars later and 1 padding byte from the initial alignment is aligned to a 4 byte alignment. A pointer of 0x00FFFFFE01 is "1-byte aligned", but 4 bytes later is 0x00FFFFFE05 -- which is still unaligned.
Although this may work with some underlying allocators, like new/std::malloc which have a default alignment of alignof(std::max_align_t) -- this is not true for more precise allocation mechanisms such as those found in std::polymorphic_allocator.

Is there some easy off-the-shelf way to calculate the size and offsets of a dynamic struct like this?
There is no such function in the standard library. You can probably implement it using std::align in a loop.

Related

Align struct while minimizing cache line waste

There are many great threads on how to align structs to the cache line (e.g., Aligning to cache line and knowing the cache line size).
Imagine you have a system with 256B cache line size, and a struct of size 17B (e.g., a tightly packed struct with two uint64_t and one uint8_t). If you align the struct to cache line size, you will have exactly one cache line load per struct instance.
For machines with a cache line size of 32B or maybe even 64B, this will be good for performance, because we avoid having to fetch 2 caches lines as we do definitely not cross CL boundaries.
However, on the 256B machine, this wastes lots of memory and results in unnecessary loads when iterating through an array/vector of this struct. In fact, you could store 15 instances of the struct in a single cacheline.
My question is two-fold:
In C++17 and above, using alignas, I can align to cache line size. However, it is unclear to me how I can force alignment in a way that is similar to "put as many instances in a cache line as possible without crossing the cache line boundary, then start at the next cache line". So something like this:
where the upper box is a cache line and the other boxes are instances of our small struct.
Do I actually want this? I cannot really wrap my head around this. Usually, we say if we align our struct to the cache line size, access will be faster, as we just have to load a single cache line. However, seeing my example, I wonder if this is actually true. Wouldn't it be faster to not be aligned, but instead store many more instances in a single cache line?
Thank you so much for your input here. It is much appreciated.
To address (2), it is unclear whether the extra overhead of using packed structs (e.g., unaligned 64-bit accesses) and the extra math to access array elements will be worth it. But if you want to try it, you can create a new struct to pack your struct elements appropriately, then create a small wrapper class to access the elements like you would an array:
#include <array>
#include <iostream>
using namespace std;
template <typename T, size_t BlockAlignment>
struct __attribute__((packed)) Packer
{
static constexpr size_t NUM_ELEMS = BlockAlignment / sizeof(T);
static_assert( NUM_ELEMS > 0, "BlockAlignment too small for one object." );
T &operator[]( size_t index ) { return packed[index]; }
T packed[ NUM_ELEMS ];
uint8_t padding[ BlockAlignment - sizeof(T)*NUM_ELEMS ];
};
template <typename T, size_t NumElements, size_t BlockAlignment>
struct alignas(BlockAlignment) PackedAlignedArray
{
typedef Packer<T, BlockAlignment> PackerType;
std::array< PackerType, NumElements / PackerType::NUM_ELEMS + 1 > packers;
T &operator[]( size_t index ) {
return packers[ index / PackerType::NUM_ELEMS ][ index % PackerType::NUM_ELEMS ];
}
};
struct __attribute__((packed)) Foo
{
uint64_t a;
uint64_t b;
uint8_t c;
};
int main()
{
static_assert( sizeof(Foo) == 17, "Struct not packed for test" );
constexpr size_t NUM_ELEMENTS = 10;
constexpr size_t BLOCK_ALIGNMENT = 64;
PackedAlignedArray<Foo, NUM_ELEMENTS, BLOCK_ALIGNMENT> theArray;
for ( size_t i=0; i<NUM_ELEMENTS; ++i )
{
// Display the memory offset between the current
// element and the start of the array
cout << reinterpret_cast<std::ptrdiff_t>(&theArray[i]) -
reinterpret_cast<std::ptrdiff_t>(&theArray[0]) << std::endl;
}
return 0;
}
The output of the program shows the byte offsets of the addresses in memory of the the 17-byte elements, automatically resetting to a multiple of 64 every four elements:
0
17
34
64
81
98
128
145
162
192
You could pack into a cache line by declaring the struct itself as under-aligned, with GNU C __attribute__((packed)) or something, so it has sizeof(struct) = 17 instead of the usual padding you'd get to make the struct size a multiple of 8 bytes (the alignment it would normally have because of having a uint64_t member, assuming alignof(uint64_t) == 8).
Then put it in an alignas(256) T array[], so only the start of the array is aligned.
Alignment to boundaries wider than the struct object itself is only possible in terms of a larger object containing multiple structs; ISO C++'s alignment system can't specify that an object can only go in containers which start at a certain alignment boundary; that would lead to nonsensical situations like T *arr being the start of a valid array, but arr + 1 no being a valid array, even though it has the same type.
Unless you're worried about false sharing between two threads, you should at most naturally-align your struct, e.g. to 32 bytes. A naturally-aligned object (alignof=sizeof) can't span an alignment boundary larger than itself. But that wastes a lot of space, so probably better to just let the compiler align it by 8, inheriting alignof(struct) = max (alignof(members)).
See also How do I organize members in a struct to waste the least space on alignment? for more detail on how space inside a single struct works.
Depending on how your data is accessed, one option would be to store pairs of uint64_t in one array, and the corresponding byte in a separate array. That way you have no padding bytes and everything is a power of 2 size and alignment. But random access to all 3 member that go with each other could cost 2 cache misses instead of 1.
But for iterating over your data sequentially, that's excellent.

Aligning buffer to an N-byte boundary but not a 2N-byte one?

I would like to allocate some char buffers0, to be passed to an external non-C++ function, that have a specific alignment requirement.
The requirement is that the buffer be aligned to a N-byte1 boundary, but not to a 2N boundary. For example, if N is 64, then an the pointer to this buffer p should satisfy ((uintptr_t)p) % 64 == 0 and ((uintptr_t)p) % 128 != 0 - at least on platforms where pointers have the usual interpretation as a plain address when cast to uintptr_t.
Is there a reasonable way to do this with the standard facilities of C++11?
If not, is there is a reasonable way to do this outside the standard facilities2 which works in practice for modern compilers and platforms?
The buffer will be passed to an outside routine (adhering to the C ABI but written in asm). The required alignment will usually be greater than 16, but less than 8192.
Over-allocation or any other minor wasted-resource issues are totally fine. I'm more interested in correctness and portability than wasting a few bytes or milliseconds.
Something that works on both the heap and stack is ideal, but anything that works on either is still pretty good (with a preference towards heap allocation).
0 This could be with operator new[] or malloc or perhaps some other method that is alignment-aware: whatever makes sense.
1 As usual, N is a power of two.
2 Yes, I understand an answer of this type causes language-lawyers to become apoplectic, so if that's you just ignore this part.
Logically, to satisfy "aligned to N, but not 2N", we align to 2N then add N to the pointer. Note that this will over-allocate N bytes.
So, assuming we want to allocate B bytes, if you just want stack space, alignas would work, perhaps.
alignas(N*2) char buffer[B+N];
char *p = buffer + N;
If you want heap space, std::aligned_storage might do:
typedef std::aligned_storage<B+N,N*2>::type ALIGNED_CHAR;
ALIGNED_CHAR buffer;
char *p = reinterpret_cast<char *>(&buffer) + N;
I've not tested either out, but the documentation suggests it should be OK.
You can use _aligned_malloc(nbytes,alignment) (in MSVC) or _mm_malloc(nbytes,alignment) (on other compilers) to allocate (on the heap) nbytes of memory aligned to alignment bytes, which must be an integer power of two.
Then you can use the trick from Ken's answer to avoid alignment to 2N:
void*ptr_alloc = _mm_malloc(nbytes+N,2*N);
void*ptr = static_cast<void*>(static_cast<char*>(ptr_alloc) + N);
/* do your number crunching */
_mm_free(ptr_alloc);
We must ensure to keep the pointer returned by _mm_malloc() for later de-allocation, which must be done via _mm_free().

C++11 : Does new return contiguous memory?

float* tempBuf = new float[maxVoices]();
Will the above result in
1) memory that is 16-byte aligned?
2) memory that is confirmed to be contiguous?
What I want is the following:
float tempBuf[maxVoices] __attribute__ ((aligned));
but as heap memory, that will be effective for Apple Accelerate framework.
Thanks.
The memory will be aligned for float, but not necessarily for CPU-specific SIMD instructions. I strongly suspect on your system sizeof(float) < 16, though, which means it's not as aligned as you want. The memory will be contiguous: &A[i] == &A[0] + i.
If you need something more specific, new std::aligned_storage<Length, Alignment> will return a suitable region of memory, presuming of course that you did in fact pass a more specific alignment.
Another alternative is struct FourFloats alignas(16) {float[4] floats;}; - this may map more naturally to the framework. You'd now need to do new FourFloats[(maxVoices+3)/4].
Yes, new returns contiguous memory.
As for alignment, no such alignment guarantee is provided. Try this:
template<class T, size_t A>
T* over_aligned(size_t N){
static_assert(A <= alignof(std::max_align_t),
"Over-alignment is implementation-defined."
);
static_assert( std::is_trivially_destructible<T>{},
"Function does not store number of elements to destroy"
);
using Helper=std::aligned_storage_t<sizeof(T), A>;
auto* ptr = new Helper[(N+sizeof(Helper)-1)/sizeof(Helper)];
return new(ptr) T[N];
}
Use:
float* f = over_aligned<float,16>(37);
Makes an array of 37 floats, with the buffer aligned to 16 bytes. Or it fails to compile.
If the assert fails, it can still work. Test and consult your compiler documentation. Once convinced, put compiler-specific version guards around the static assert, so when you change compilers you can test all over again (yay).
If you want real portability, you have to have a fall back to std::align and manage resources separately from data pointers and count the number of T if and only if T has a non-trivial destructor, then store the number of T "before" the start of your buffer. It gets pretty silly.
It's guaranteed to be aligned properly with respect to the type you're allocating. So if it's an array of 4 floats (each supposedly 4 bytes), it's guaranteed to provide a usable sequence of floats. It's not guaranteed to be aligned to 16 bytes.
Yes, it's guaranteed to be contiguous (otherwise what would be the meaning of a single pointer?)
If you want it to be aligned to some K bytes, you can do it manually with std::align. See MSalter's answer for a more efficient way of doing this.
If tempBuf is not nullptr then the C++ standard guarantees that tempBuf points to the zeroth element of least maxVoices contiguous floats.
(Don't forget to call delete[] tempBuf once you're done with it.)

Size of std::array, std::vector and raw array

Lets we have,
std::array <int,5> STDarr;
std::vector <int> VEC(5);
int RAWarr[5];
I tried to get size of them as,
std::cout << sizeof(STDarr) + sizeof(int) * STDarr.max_size() << std::endl;
std::cout << sizeof(VEC) + sizeof(int) * VEC.capacity() << std::endl;
std::cout << sizeof(RAWarr) << std::endl;
The outputs are,
40
20
40
Are these calculations correct? Considering I don't have enough memory for std::vector and no way of escaping dynamic allocation, what should I use? If I would know that std::arrayresults in lower memory requirement I could change the program to make array static.
These numbers are wrong. Moreover, I don't think they represent what you think they represent, either. Let me explain.
First the part about them being wrong. You, unfortunately, don't show the value of sizeof(int) so we must derive it. On the system you are using the size of an int can be computed as
size_t sizeof_int = sizeof(RAWarr) / 5; // => sizeof(int) == 8
because this is essentially the definition of sizeof(T): it is the number of bytes between the start of two adjacent objects of type T in an array. This happens to be inconsistent with the number print for STDarr: the class template std::array<T, n> is specified to have an array of n objects of type T embedded into it. Moreover, std::array<T, n>::max_size() is a constant expression yielding n. That is, we have:
40 // is identical to
sizeof(STDarr) + sizeof(int) * STDarr.max_size() // is bigger or equal to
sizeof(RAWarr) + sizeof_int * 5 // is identical to
40 + 40 // is identical to
80
That is 40 >= 80 - a contradication.
Similarily, the second computation is also inconsistent with the third computation: the std::vector<int> holds at least 5 elements and the capacity() has to be bigger than than the size(). Moreover, the std::vector<int>'s size is non-zero. That is, the following always has to be true:
sizeof(RAWarr) < sizeof(VEC) + sizeof(int) * VEC.capacity()
Anyway, all this is pretty much irrelevant to what your actual question seems to be: What is the overhead of representing n objects of type T using a built-in array of T, an std::array<T, n>, and an std::vector<T>? The answer to this question is:
A built-in array T[n] uses sizeof(T) * n.
An std::array<T, n> uses the same size as a T[n].
A std::vector<T>(n) has needs some control data (the size, the capacity, and possibly and possibly an allocator) plus at least 'n * sizeof(T)' bytes to represent its actual data. It may choose to also have a capacity() which is bigger than n.
In addition to these numbers, actually using any of these data structures may require addition memory:
All objects are aligned at an appropriate address. For this there may be additional byte in front of the object.
When the object is allocated on the heap, the memory management system my include a couple of bytes in addition to the memory made avaiable. This may be just a word with the size but it may be whatever the allocation mechanism fancies. Also, this memory may live somewhere else than the allocate memory, e.g. in a hash table somewhere.
OK, I hope this provided some insight. However, here comes the important message: if std::vector<T> isn't capable of holding the amount of data you have there are two situations:
You have extremely low memory and most of this discussion is futile because you need entirely different approaches to cope with the few bytes you have. This would be the case if you are working on extremely resource constrained embedded systems.
You have too much data an using T[n] or std::array<T, n> won't be of much help because the overhead we are talking of is typically less than 32 bytes.
Maybe you can describe what you are actually trying to do and why std::vector<T> is not an option.

Understanding sizeof(char) in 32 bit C compilers

(sizeof) char always returns 1 in 32 bit GCC compiler.
But since the basic block size in 32 bit compiler is 4, How does char occupy a single byte when the basic size is 4 bytes???
Considering the following :
struct st
{
int a;
char c;
};
sizeof(st) returns as 8 as agreed with the default block size of 4 bytes (since 2 blocks are allotted)
I can never understand why sizeof(char) returns as 1 when it is allotted a block of size 4.
Can someone pls explain this???
I would be very thankful for any replies explaining it!!!
EDIT : The typo of 'bits' has been changed to 'bytes'. I ask Sorry to the person who made the first edit. I rollbacked the EDIT since I did not notice the change U made.
Thanks to all those who made it a point that It must be changed especially #Mike Burton for downvoting the question and to #jalf who seemed to jump to conclusions over my understanding of concepts!!
sizeof(char) is always 1. Always. The 'block size' you're talking about is just the native word size of the machine - usually the size that will result in most efficient operation. Your computer can still address each byte individually - that's what the sizeof operator is telling you about. When you do sizeof(int), it returns 4 to tell you that an int is 4 bytes on your machine. Likewise, your structure is 8 bytes long. There is no information from sizeof about how many bits there are in a byte.
The reason your structure is 8 bytes long rather than 5 (as you might expect), is that the compiler is adding padding to the structure in order to keep everything nicely aligned to that native word length, again for greater efficiency. Most compilers give you the option to pack a structure, either with a #pragma directive or some other compiler extension, in which case you can force your structure to take minimum size, regardless of your machine's word length.
char is size 1, since that's the smallest access size your computer can handle - for most machines an 8-bit value. The sizeof operator gives you the size of all other quantities in units of how many char objects would be the same size as whatever you asked about. The padding (see link below) is added by the compiler to your data structure for performance reasons, so it is larger in practice than you might think from just looking at the structure definition.
There is a wikipedia article called Data structure alignment which has a good explanation and examples.
It is structure alignment with padding. c uses 1 byte, 3 bytes are non used. More here
Sample code demonstrating structure alignment:
struct st
{
int a;
char c;
};
struct stb
{
int a;
char c;
char d;
char e;
char f;
};
struct stc
{
int a;
char c;
char d;
char e;
char f;
char g;
};
std::cout<<sizeof(st) << std::endl; //8
std::cout<<sizeof(stb) << std::endl; //8
std::cout<<sizeof(stc) << std::endl; //12
The size of the struct is bigger than the sum of its individual components, since it was set to be divisible by 4 bytes by the 32 bit compiler. These results may be different on different compilers, especially if they are on a 64 bit compiler.
First of all, sizeof returns a number of bytes, not bits. sizeof(char) == 1 tells you that a char is eight bits (one byte) long. All of the fundamental data types in C are at least one byte long.
Your structure returns a size of 8. This is a sum of three things: the size of the int, the size of the char (which we know is 1), and the size of any extra padding that the compiler added to the structure. Since many implementations use a 4-byte int, this would imply that your compiler is adding 3 bytes of padding to your structure. Most likely this is added after the char in order to make the size of the structure a multiple of 4 (a 32-bit CPU access data most efficiently in 32-bit chunks, and 32 bits is four bytes).
Edit: Just because the block size is four bytes doesn't mean that a data type can't be smaller than four bytes. When the CPU loads a one-byte char into a 32-bit register, the value will be sign-extended automatically (by the hardware) to make it fill the register. The CPU is smart enough to handle data in N-byte increments (where N is a power of 2), as long as it isn't larger than the register. When storing the data on disk or in memory, there is no reason to store every char as four bytes. The char in your structure happened to look like it was four bytes long because of the padding added after it. If you changed your structure to have two char variables instead of one, you should see that the size of the structure is the same (you added an extra byte of data, and the compiler added one fewer byte of padding).
All object sizes in C and C++ are defined in terms of bytes, not bits. A byte is the smallest addressable unit of memory on the computer. A bit is a single binary digit, a 0 or a 1.
On most computers, a byte is 8 bits (so a byte can store values from 0 to 256), although computers exist with other byte sizes.
A memory address identifies a byte, even on 32-bit machines. Addresses N and N+1 point to two subsequent bytes.
An int, which is typically 32 bits covers 4 bytes, meaning that 4 different memory addresses exist that each point to part of the int.
In a 32-bit machine, all the 32 actually means is that the CPU is designed to work efficiently with 32-bit values, and that an address is 32 bits long. It doesn't mean that memory can only be addressed in blocks of 32 bits.
The CPU can still address individual bytes, which is useful when dealing with chars, for example.
As for your example:
struct st
{
int a;
char c;
};
sizeof(st) returns 8 not because all structs have a size divisible by 4, but because of alignment. For the CPU to efficiently read an integer, its must be located on an address that is divisible by the size of the integer (4 bytes). So an int can be placed on address 8, 12 or 16, but not on address 11.
A char only requires its address to be divisible by the size of a char (1), so it can be placed on any address.
So in theory, the compiler could have given your struct a size of 5 bytes... Except that this wouldn't work if you created an array of st objects.
In an array, each object is placed immediately after the previous one, with no padding. So if the first object in the array is placed at an address divisible by 4, then the next object would be placed at a 5 bytes higher address, which would not be divisible by 4, and so the second struct in the array would not be properly aligned.
To solve this, the compiler inserts padding inside the struct, so its size becomes a multiple of its alignment requirement.
Not because it is impossible to create objects that don't have a size that is a multiple of 4, but because one of the members of your st struct requires 4-byte alignment, and so every time the compiler places an int in memory, it has to make sure it is placed at an address that is divisible by 4.
If you create a struct of two chars, it won't get a size of 4. It will usually get a size of 2, because when it contains only chars, the object can be placed at any address, and so alignment is not an issue.
Sizeof returns the value in bytes. You were talking about bits. 32 bit architectures are word aligned and byte referenced. It is irrelevant how the architecture stores a char, but to compiler, you must reference chars 1 byte at a time, even if they use up less than 1 byte.
This is why sizeof(char) is 1.
ints are 32 bit, hence sizeof(int)= 4, doubles are 64 bit, hence sizeof(double) = 8, etc.
Because of optimisation padding is added so size of an object is 1, 2 or n*4 bytes (or something like that, talking about x86). That's why there is added padding to 5-byte object and to 1-byte not. Single char doesn't have to be padded, it can be allocated on 1 byte, we can store it on space allocated with malloc(1). st cannot be stored on space allocated with malloc(5) because when st struct is being copied whole 8 bytes are being copied.
It works the same way as using half a piece of paper. You use one part for a char and the other part for something else. The compiler will hide this from you since loading and storing a char into a 32bit processor register depends on the processor.
Some processors have instructions to load and store only parts of the 32bit others have to use binary operations to extract the value of a char.
Addressing a char works as it is AFAIR by definition the smallest addressable memory. On a 32bit system pointers to two different ints will be at least 4 address points apart, char addresses will be only 1 apart.