Why is dynamically allocated memory always 16 bytes aligned? - c++

I wrote a simple example:
#include <iostream>
int main() {
void* byte1 = ::operator new(1);
void* byte2 = ::operator new(1);
void* byte3 = malloc(1);
std::cout << "byte1: " << byte1 << std::endl;
std::cout << "byte2: " << byte2 << std::endl;
std::cout << "byte3: " << byte3 << std::endl;
return 0;
}
Running the example, I get the following results:
byte1: 0x1f53e70
byte2: 0x1f53e90
byte3: 0x1f53eb0
Each time I allocate a single byte of memory, it's always 16 bytes aligned. Why does this happen?
I tested this code on GCC 5.4.0 as well as GCC 7.4.0, and got the same results.

Why does this happen?
Because the standard says so. More specifically, it says that the dynamic allocations1 are aligned to at least the maximum fundamental2 alignment (it may have stricter alignment). There is a pre-defined macro (since C++17) just for the purpose of telling you exactly what this guaranteed alignment is: __STDCPP_DEFAULT_NEW_ALIGNMENT__. Why this might be 16 in your example... that is a choice of the language implementation, restricted by what is allowed by the target hardware architecture.
This is (was) a necessary design, considering that there is (was) no way to pass information about the needed alignment to the allocation function (until C++17 which introduced aligned-new syntax for the purpose of allocating "over-aligned" memory).
malloc doesn't know anything about the types of objects that you intend to create into the memory. One might think that new could in theory deduce the alignment since it is given a type... but what if you wanted to reuse that memory for other objects with stricter alignment, like for example in implementation of std::vector? And once you know the API of the operator new: void* operator new ( std::size_t count ), you can see that the type or its alignment are not an argument that could affect the alignment of the allocation.
1 Made by the default allocator, or malloc family of functions.
2 The maximum fundamental alignment is alignof(std::max_align_t). No fundamental type (arithmetic types, pointers) has stricter alignment than this.

There are actually two reasons. The first reason is, that there are some alignment requirements for some kinds of objects. Usually, these alignment requirements are soft: A misaligned access is "just" slower (possibly by orders of magnitude). They can also be hard: On the PPC, for instance, you simply could not access a vector in memory if that vector was not aligned to 16 bytes. Alignment is not something optional, it is something that must be considered when allocating memory. Always.
Note that there is no way to specify an alignment to malloc(). There's simply no argument for it. As such, malloc() must be implemented to provide a pointer that is correctly aligned for any purposes on the platform. The ::operator new() in C++ follows the same principle.
How much alignment is needed is fully platform dependent. On a PPC, there is no way that you can get away with less than 16 bytes alignment. X86 is a bit more lenient in this, afaik.
The second reason is the inner workings of an allocator function. Typical implementations have an allocator overhead of at least 2 pointers: Whenever you request a byte from malloc() it will usually need to allocate space for at least two additional pointers to do its own bookkeeping (the exact amount depends on the implementation). On a 64 bit architecture, that's 16 bytes. As such, it is not sensible for malloc() to think in terms of bytes, it's more efficient to think in terms of 16 byte blocks. At least. You see that with your example code: The resulting pointers are actually 32 bytes apart. Each memory block occupies 16 bytes payload + 16 bytes internal bookkeeping memory.
Since the allocators request entire memory pages from the kernel (4096 bytes, 4096 bytes aligned!), the resulting memory blocks are naturally 16 bytes aligned on a 64 bit platform. It's simply not practical to provide less aligned memory allocations.
So, taken these two reasons together, it is both practical and required to provide seriously aligned memory blocks from an allocator function. The exact amount of alignment depends on the platform, but will usually not be less than the size of two pointers.

It's probably the way the memory allocator manages to get the necessary information to the deallocation function: the issue of the deallocation function (like free or the general, global operator delete) is that there is exactly one argument, the pointer to the allocated memory and no indication of the size of the block that was requested (or the size that was allocated if it's larger), so that indication (and much more) needs to be provided in some other form to the deallocation function.
The most simple yet efficient approach is to allocate room for that additional information plus the requested bytes, and return a pointer to the end of the information block, let's call it IB. The size and alignment of IB automatically aligns the address returned by either malloc or operator new, even if you allocate a minuscule amount: the real amount allocated by malloc(s) is sizeof(IB)+s.
For such small allocations the approach is relatively wasteful and other strategies might be used, but having multiple allocation methods complicate deallocation as the function must first determine which method was used.

Why does this happens?
Because in general case library does not know what kind of data you are going to store in that memory so it has to be aligned to the biggest data type on that platform. And if you store data unaligned you will get significant penalty of hardware performance. On some platforms you will even get segfault if you try to access data unaligned.

Due to the platform. On X86 it isn't necessary but gains performance of the operations. As I know on newer models it doesn't make a difference but compiler goes for the optimum. When not aligned properly for example a long not aligned 4 byte on a m68k processor will crash.

It isn't. It depends on the OS/CPU requirements. In the case of 32bit version of linux/win32, the allocated memory is always 8 byte aligned. In the case of 64bit versions of linux/win32, since all 64bit CPUs have SSE2 at a minimum, it kinda made sense at the time to align all memory to 16bytes (because working with SSE2 was less efficient when using unaligned memory). With the latest AVX based CPUs, this performance penalty for unaligned memory has been removed, so really they could allocate on any boundary.
If you think of it, aligning the addresses for memory allocations to 16bytes gives you 4bits of blank space in the pointer address. This may be useful internally for storing some additional flags (e.g. readable, writable, executable, etc).
At the end of the day, the reasoning is entirely dictated by the OS and/or hardware requirements. It's nothing to do with the language.

Related

Memory alignment, structs and malloc

It's a bit hard to formulate what I want to know in a single question, so I'll try to break it down.
For example purposes, let's say we have the following struct:
struct X {
uint8_t a;
uint16_t b;
uint32_t c;
};
Is it true that the compiler is guaranteed to never rearrange the order of X's members, only add padding where necessary? In other words, is it always true that offsetof(X, a) < offsetof(X, c)?
Is it true that the compiler will pick the largest alignment among X's members and use it to align objects of type X (i.e. addresses of X instances will be divisible by the largest alignment among X's members)?
Since malloc does not know anything about the type of objects that we're going to store when we allocate a buffer, how does it choose an alignment for the returned address? Does it simply return an adress that is divisible by the largest alignment possible (in which case, no matter what structure we put in the buffer, the memory accesses will always be aligned)?
Yes
No, the compiler will use its knowledge of the target host hardware to select the best alignment.
See question 2.
Since malloc does not know anything about the type of objects that we're going to store when we allocate a buffer, how does it choose an alignment for the returned address?
malloc(3) returns "memory that is suitably aligned for any kind of variable."
Does it simply return an adress that is divisible by the largest alignment possible (in which case, no matter what structure we put in the buffer, the memory accesses will always be aligned)?
Yes, but watch your compliance with the strict aliasing rule.
The compiler will do whatever is most beneficial on that computer in the largest number of circumstances. On most platforms loading bus wide values on bus width offsets is fastest.
That means that generally on 32-bit computers compilers will chose to align 32-bit numbers on 4 byte offsets. On 64-bit computers 64-bit values are aligned on 8 byte offsets.
On most computers smaller values like 8-bit and 16-bit values are slower to load. Probably all 4 or 8 bytes around it are loaded and the byte or two bytes you need are masked off.
When you have special circumstances you can override the compiler by specifying the alignment and the padding. You might do this when you know fast loading is not important, but you really want to pack the data tightly. Or when you are playing very subtle tricks with casting and unions.
Memory allocation routines on almost any modern computer will always return memory that is aligned on at least the bus width of the platform ( e.g. 4 or 8 bytes ) - or even more - like 16 byte alignment.
When you call "malloc" you are responsible for knowing the size of the structures you need. Luckily the compiler will tell you the size of any structure with "sizeof". That means that if you pack a structure to save memory, sizeof will return a smaller value than an unpacked structure. So you really will save memory - if you are allocating small structures in large arrays of them.
If you allocate small packed structures one at a time - then yes - if you pack them or not it won't make any difference. That is because when you allocate some odd small piece of memory - the allocator will actually use significantly more memory than that. It will allocate an convenient sized block of memory for you, and then an additional block of memory for itself to keep track of you allocation.
Which is why if you care about memory use and want to pack your structures - you definitely don't want to allocate them one at time.

Why does Malloc() care about boundary alignments?

I've heard that malloc() aligns memory based on the type that is being allocated. For example, from the book Understanding and Using C Pointers:
The memory allocated will be aligned according to the pointer's data type. Fore example, a four-byte integer would be allocated on an address boundary evenly divisible by four.
If I follow, this means that
int *integer=malloc(sizeof(int)); will be allocated on an address boundary evenly divisible by four. Even without casting (int *) on malloc.
I was working on a chat server; I read of a similar effect with structs.
And I have to ask: logically, why does it matter what the address boundary itself is divisible on? What's wrong with allocating a group of memory to the tune of n*sizeof(int) using an integer on address 129?
I know how pointer arithmetic works *(integer+1), but I can't work out the importance of boundaries...
The memory allocated will be aligned according to the pointer's data
type.
If you are talking about malloc, this is false. malloc doesn't care what you do with the data and will allocate memory aligned to fit the most stringent native type of the implementation.
From the standard:
The pointer returned if the allocation succeeds is suitably aligned so
that it may be assigned to a pointer to any type of object with a
fundamental alignment requirement and then used to access such an
object or an array of such objects in the space allocated (until the
space is explicitly deallocated)
And:
Logically, why does it matter what the address boundary itself is
divisible on
Due to the workings of the underlying machine, accessing unaligned data might be more expensive (e.g. x86) or illegal (e.g. arm). This lets the hardware take shortcuts that improve performance / simplify implementation.
In many processors, data that isn't aligned will cause a "trap" or "exception" (this is a different form of exception than those understood by the C++ compiler. Even on processors that don't trap when data isn't aligned, it is typically slower (twice as slow, for example) when the data is not correctly aligned. So it's in the compiler's/runtime library's best interest to ensure that things are nicely aligned.
And by the way, malloc (typically) doesn't know what you are allocating. Insteat, malloc will align ALL data, no matter what size it is, to some suitable boundary that is "good enough" for general data-access - typically 8 or 16 bytes in modern OS/processor combinations, 4 bytes in older systems.
This is because malloc won't know if you do char* p = malloc(1000); or double* p = malloc(1000);, so it has to assume you are storing double or whatever is the item with the largest alignment requirement.
The importance of alignment is not a language issue but a hardware issue. Some machines are incapable of reading a data value that is not properly aligned. Others can do it but do so less efficiently, e.g., requiring two reads to read one misaligned value.
The book quote is wrong; the memory returned by malloc is guaranteed to be aligned correctly for any type. Even if you write char *ch = malloc(37);, it is still aligned for int or any other type.
You seem to be asking "What is alignment?" If so, there are several questions on SO about this already, e.g. here, or a good explanation from IBM here.
It depends on the hardware. Even assuming int is 32 bits, malloc(sizeof(int)) could return an address divisible by 1, 2, or 4. Different processors handle unaligned access differently.
Processors don't read directly from RAM any more, that's too slow (it takes hundreds of cycles). So when they do grab RAM, they grab it in big chunks, like 64 bytes at a time. If your address isn't aligned, the 4-byte integer might straddle two 64-byte cache lines, so your processor has to do two loads and fix up the result. Or maybe the engineers decided that building the hardware to fix up unaligned loads isn't necessary, so the processor signals an exception: either your program crashes, or the operating system catches the exception and fixes up the operation (hundreds of wasted cycles).
Aligning addresses means your program plays nicely with hardware.
Because it's more fast; Most processor likes data which is aligned. Even, Some processor CANNOT access data which is not aligned! (If you try to access this data, processor may occur fault)

Is there a memory overhead associated with heap memory allocations (eg markers in the heap)?

Thinking in particular of C++ on Windows using a recent Visual Studio C++ compiler, I am wondering about the heap implementation:
Assuming that I'm using the release compiler, and I'm not concerned with memory fragmentation/packing issues, is there a memory overhead associated with allocating memory on the heap? If so, roughly how many bytes per allocation might this be?
Would it be larger in 64-bit code than 32-bit?
I don't really know a lot about modern heap implementations, but am wondering whether there are markers written into the heap with each allocation, or whether some kind of table is maintained (like a file allocation table).
On a related point (because I'm primarily thinking about standard-library features like 'map'), does the Microsoft standard-library implementation ever use its own allocator (for things like tree nodes) in order to optimize heap usage?
Yes, absolutely.
Every block of memory allocated will have a constant overhead of a "header", as well as a small variable part (typically at the end). Exactly how much that is depends on the exact C runtime library used. In the past, I've experimentally found it to be around 32-64 bytes per allocation. The variable part is to cope with alignment - each block of memory will be aligned to some nice even 2^n base-address - typically 8 or 16 bytes.
I'm not familiar with how the internal design of std::map or similar works, but I very much doubt they have special optimisations there.
You can quite easily test the overhead by:
char *a, *b;
a = new char;
b = new char;
ptrdiff_t diff = a - b;
cout << "a=" << a << " b=" << b << " diff=" << diff;
[Note to the pedants, which is probably most of the regulars here, the above a-b expression invokes undefined behaviour, since subtracting the address of one piece of allocated and the address of another, is undefined behaviour. This is to cope with machines that don't have linear memory addresses, e.g. segmented memory or "different types of data is stored in locations based on their type". The above should definitely work on any x86-based OS that doesn't use a segmented memory model with multiple data segments in for the heap - which means it works for Windows and Linux in 32- and 64-bit mode for sure].
You may want to run it with varying types - just bear in mind that the diff is in "number of the type, so if you make it int *a, *b will be in "four bytes units". You could make a reinterpret_cast<char*>(a) - reinterpret_cast<char *>(b);
[diff may be negative, and if you run this in a loop (without deleting a and b), you may find sudden jumps where one large section of memory is exhausted, and the runtime library allocated another large block]
Visual C++ embeds control information (links/sizes and possibly some checksums) near the boundaries of allocated buffers. That also helps to catch some buffer overflows during memory allocation and deallocation.
On top of that you should remember that malloc() needs to return pointers suitably aligned for all fundamental types (char, int, long long, double, void*, void(*)()) and that alignment is typically of the size of the largest type, so it could be 8 or even 16 bytes. If you allocate a single byte, 7 to 15 bytes can be lost to alignment only. I'm not sure if operator new has the same behavior, but it may very well be the case.
This should give you an idea. The precise memory waste can only be determined from the documentation (if any) or testing. The language standard does not define it in any terms.
Yes. All practical dynamic memory allocators have a minimal granularity1. For example, if the granularity is 16 bytes and you request only 1 byte, the whole 16 bytes is allocated nonetheless. If you ask for 17 bytes, a block whose size is 32 bytes is allocated etc...
There is also a (related) issue of alignment.2
Quite a few allocators seem to be a combination of a size map and free lists - they split potential allocation sizes to "buckets" and keep a separate free list for each of them. Take a look at Doug Lea's malloc. There are many other allocation techniques with various tradeoffs but that goes beyond the scope here...
1 Typically 8 or 16 bytes. If the allocator uses a free list then it must encode two pointers inside every free slot, so a free slot cannot be smaller than 8 bytes (on 32-bit) or 16 byte (on 16-bit). For example, if allocator tried to split a 8-byte slot to satisfy a 4-byte request, the remaining 4 bytes would not have enough room to encode the free list pointers.
2 For example, if the long long on your platform is 8 bytes, then even if the allocator's internal data structures can handle blocks smaller than that, actually allocating the smaller block might push the next 8-byte allocation to an unaligned memory address.

Smaller per allocation overhead in allocators library

I'm currently writing a memory management library for c++ that is based around the concept of allocators. It's relatively simple, for now, all allocators implement these 2 member functions:
virtual void * alloc( std::size_t size ) = 0;
virtual void dealloc( void * ptr ) = 0;
As you can see, I do not support alignment in the interface but that's actually my next step :) and the reason why I'm asking this question.
I want the allocators to be responsible for alignment because each one can be specialized. For example, the block allocator can only return block-sized aligned memory so it can handle failure and return NULL if a different alignment is asked for.
Some of my allocators are in fact sub-allocators. For example, one of them is a linear/sequential allocator that just pointer-bumps on allocation. This allocator is constructed by passing in a char * pBegin and char * pEnd and it allocates from within that region in memory. For now, it works great but I get stuff that is 1-byte aligned. It works on x86 but I heard it can be disastrous on other CPUs (consoles?). It's also somewhat slower for reads and writes on x86.
The only sane way I know of implementing aligned memory management is to allocate an extra sizeof( void * ) + (alignement - 1) bytes and do pointer bit-masking to return the aligned address while keeping the original allocated address in the bytes before the user-data (the void * bytes, see above).
OK, my question...
That overhead, per allocation, seems big to me. For 4-bytes alignment, I would have 7 bytes of overhead on a 32-bit cpu and 11 bytes on a 64-bit one. That seems like a lot.
First, is it a lot? Am I on par with other memory management libs you might have used in the past or are currently using? I've looked into malloc and it seems to have a minimum of 16-byte overhead, is that right?
Do you know of a better way, smaller overhead, of returning aligned memory to my lib's users?
You could store an offset, rather than a pointer, which would only need to be large enough to store the largest supported alignment. A byte might even be sufficient if you only support smallish alignments.
How about you implement a buddy system which can be x-byte aligned depending on your requirement.
General Idea:
When your lib is initialized, allocate a big chunk of memory. For our example lets assume 16B. (Only this block needs to be aligned, the algorithm will not require you to align any other block)
Maintain lists for memory chunks of power 2. i.e 4B, 8B, 16B, ... 64KB, ... 1MB, 2MB, ... 512MB.
If a user asks for 8B of data, check the list for 8B, if not available, check list of 16B and split it into 2 blocks of 8B. Give one back to user and the other, gets appended to the list of 8B.
If the user asks for 16B, check if you have at least 2 8B available. If yes, combine them and give back to user. If not, the system does not have enough memory.
Pros:
No internal or external fragmentation.
No alignment required.
Fast access to memory chunks as they are pre-allocated.
If the list is an array, direct access to memory chunks of different size
Cons:
Overhead for list of memory.
If the list is a linked-list, traversal would be slow.

C++ Memory alignment in custom stack allocator

Usually data is aligned at power of two addresses depending on its size.
How should I align a struct or class with size of 20 bytes or another non-power-of-two size?
I'm creating a custom stack allocator so I guess that the compiler wont align data for me since I'm working with a continuous block of memory.
Some more context:
I have an Allocator class that uses malloc() to allocate a large amount of data.
Then I use void* allocate(U32 size_of_object) method to return the pointer that where I can store whether objects I need to store.
This way all objects are stored in the same region of memory and it will hopefully fit in the cache reducing cache misses.
C++11 has the alignof operator specifically for this purpose. Don't use any of the tricks mentioned in other posts, as they all have edge cases or may fail for certain compiler optimisations. The alignof operator is implemented by the compiler and knows the exact alignment being used.
See this description of c++11's new alignof operator
Although the compiler (or interpreter) normally allocates individual data items on aligned boundaries, data structures often have members with different alignment requirements. To maintain proper alignment the translator normally inserts additional unnamed data members so that each member is properly aligned. In addition the data structure as a whole may be padded with a final unnamed member. This allows each member of an array of structures to be properly aligned. http://en.wikipedia.org/wiki/Data_structure_alignment#Typical_alignment_of_C_structs_on_x86
This says that the compiler takes care of it for you, 99.9% of the time. As for how to force an object to align a specific way, that is compiler specific, and only works in certain circumstances.
MSVC: http://msdn.microsoft.com/en-us/library/83ythb65.aspx
__declspec(align(20))
struct S{ int a, b, c, d; };
//must be less than or equal to 20 bytes
GCC: http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Type-Attributes.html
struct S{ int a, b, c, d; }
__attribute__ ((aligned (20)));
I don't know of a cross-platform way (including macros!) to do this, but there's probably neat macro somewhere.
Unless you want to access memory directly, or squeeze maximum data in a block of memory you don't worry about alignment -- the compiler takes case of that for you.
Due to the way processor data buses work, what you want to avoid is 'mis-aligned' access. Usually you can read a 32 bit value in a single access from addresses which are multiples of four; if you try to read it from an address that's not such a multiple, the CPU may have to grab it in two or more pieces. So if you're really worrying about things at this level of detail, what you need to be concerned about is not so much the overall struct, as the pieces within it. You'll find that compilers will frequently pad out structures with dummy bytes to ensure aligned access, unless you specifically force them not to with a pragma.
Since you've now added that you actually want to write your own allocator, the answer is straight-forward: Simply ensure that your allocator returns a pointer whose value is a multiple of the requested size. The object's size itself will already come suitably adjusted (via internal padding) so that all member objects themselves are properly aligned, so if you request sizeof(T) bytes, all your allocator needs to do is to return a pointer whose value is divisible by sizeof(T).
If your object does indeed have size 20 (as reported by sizeof), then you have nothing further to worry about. (On a 64-bit platform, the object would probably be padded to 24 bytes.)
Update: In fact, as I only now came to realize, strictly speaking you only need to ensure that the pointer is aligned, recursively, for the largest member of your type. That may be more efficient, but aligning to the size of the entire type is definitely not getting it wrong.
How should I align a struct or class with size of 20 bytes or another non-power-of-two size?
Alignment is CPU-specific, so there is no answer to this question without, at least, knowing the target CPU.
Generally speaking, alignment isn't something that you have to worry about; your compiler will have the rules implemented for you. It does come up once in a while, like when writing an allocator. The classic solution is discussed in The C Programming Language (K&R): use the worst possible alignment. malloc does this, although it's phrased as, "the pointer returned if the allocation succeeds shall be suitably aligned so that it may be assigned to a pointer to any type of object."
The way to do that is to use a union (the elements of a union are all allocated at the union's base address, and the union must therefore be aligned in such a way that each element could exist at that address; i.e., the union's alignment will be the same as the alignment of the element with the strictest rules):
typedef Align long;
union header {
// the inner struct has the important bookeeping info
struct {
unsigned size;
header* next;
} s;
// the align member only exists to make sure header_t's are always allocated
// using the alignment of a long, which is probably the worst alignment
// for the target architecture ("worst" == "strictest," something that meets
// the worst alignment will also meet all better alignment requirements)
Align align;
};
Memory is allocated by creating an array (using somthing like sbrk()) of headers large enough to satisfy the request, plus one additional header element that actually contains the bookkeeping information. If the array is called arry, the bookkeeping information is at arry[0], while the pointer returned points at arry[1] (the next element is meant for walking the free list).
This works, but can lead to wasted space ("In Sun's HotSpot JVM, object storage is aligned to the nearest 64-bit boundary"). I'm aware of a better approach that tries to get a type-specific alignment instead of "the alignment that will work for anything."
Compilers also often have compiler-specific commands. They aren't standard, and they require that you know the correct alignment requirements for the types in question. I would avoid them.