Why does Malloc() care about boundary alignments?

Why does Malloc() care about boundary alignments? - c++

I've heard that malloc() aligns memory based on the type that is being allocated. For example, from the book Understanding and Using C Pointers:
The memory allocated will be aligned according to the pointer's data type. Fore example, a four-byte integer would be allocated on an address boundary evenly divisible by four.
If I follow, this means that
int *integer=malloc(sizeof(int)); will be allocated on an address boundary evenly divisible by four. Even without casting (int *) on malloc.
I was working on a chat server; I read of a similar effect with structs.
And I have to ask: logically, why does it matter what the address boundary itself is divisible on? What's wrong with allocating a group of memory to the tune of n*sizeof(int) using an integer on address 129?
I know how pointer arithmetic works *(integer+1), but I can't work out the importance of boundaries...

The memory allocated will be aligned according to the pointer's data
type.
If you are talking about malloc, this is false. malloc doesn't care what you do with the data and will allocate memory aligned to fit the most stringent native type of the implementation.
From the standard:
The pointer returned if the allocation succeeds is suitably aligned so
that it may be assigned to a pointer to any type of object with a
fundamental alignment requirement and then used to access such an
object or an array of such objects in the space allocated (until the
space is explicitly deallocated)
And:
Logically, why does it matter what the address boundary itself is
divisible on
Due to the workings of the underlying machine, accessing unaligned data might be more expensive (e.g. x86) or illegal (e.g. arm). This lets the hardware take shortcuts that improve performance / simplify implementation.

In many processors, data that isn't aligned will cause a "trap" or "exception" (this is a different form of exception than those understood by the C++ compiler. Even on processors that don't trap when data isn't aligned, it is typically slower (twice as slow, for example) when the data is not correctly aligned. So it's in the compiler's/runtime library's best interest to ensure that things are nicely aligned.
And by the way, malloc (typically) doesn't know what you are allocating. Insteat, malloc will align ALL data, no matter what size it is, to some suitable boundary that is "good enough" for general data-access - typically 8 or 16 bytes in modern OS/processor combinations, 4 bytes in older systems.
This is because malloc won't know if you do char* p = malloc(1000); or double* p = malloc(1000);, so it has to assume you are storing double or whatever is the item with the largest alignment requirement.

The importance of alignment is not a language issue but a hardware issue. Some machines are incapable of reading a data value that is not properly aligned. Others can do it but do so less efficiently, e.g., requiring two reads to read one misaligned value.

The book quote is wrong; the memory returned by malloc is guaranteed to be aligned correctly for any type. Even if you write char *ch = malloc(37);, it is still aligned for int or any other type.
You seem to be asking "What is alignment?" If so, there are several questions on SO about this already, e.g. here, or a good explanation from IBM here.

It depends on the hardware. Even assuming int is 32 bits, malloc(sizeof(int)) could return an address divisible by 1, 2, or 4. Different processors handle unaligned access differently.
Processors don't read directly from RAM any more, that's too slow (it takes hundreds of cycles). So when they do grab RAM, they grab it in big chunks, like 64 bytes at a time. If your address isn't aligned, the 4-byte integer might straddle two 64-byte cache lines, so your processor has to do two loads and fix up the result. Or maybe the engineers decided that building the hardware to fix up unaligned loads isn't necessary, so the processor signals an exception: either your program crashes, or the operating system catches the exception and fixes up the operation (hundreds of wasted cycles).
Aligning addresses means your program plays nicely with hardware.

Because it's more fast; Most processor likes data which is aligned. Even, Some processor CANNOT access data which is not aligned! (If you try to access this data, processor may occur fault)

Related

Why is dynamically allocated memory always 16 bytes aligned?

I wrote a simple example:
#include <iostream>
int main() {
void* byte1 = ::operator new(1);
void* byte2 = ::operator new(1);
void* byte3 = malloc(1);
std::cout << "byte1: " << byte1 << std::endl;
std::cout << "byte2: " << byte2 << std::endl;
std::cout << "byte3: " << byte3 << std::endl;
return 0;
}
Running the example, I get the following results:
byte1: 0x1f53e70
byte2: 0x1f53e90
byte3: 0x1f53eb0
Each time I allocate a single byte of memory, it's always 16 bytes aligned. Why does this happen?
I tested this code on GCC 5.4.0 as well as GCC 7.4.0, and got the same results.

Why does this happen?
Because the standard says so. More specifically, it says that the dynamic allocations1 are aligned to at least the maximum fundamental2 alignment (it may have stricter alignment). There is a pre-defined macro (since C++17) just for the purpose of telling you exactly what this guaranteed alignment is: __STDCPP_DEFAULT_NEW_ALIGNMENT__. Why this might be 16 in your example... that is a choice of the language implementation, restricted by what is allowed by the target hardware architecture.
This is (was) a necessary design, considering that there is (was) no way to pass information about the needed alignment to the allocation function (until C++17 which introduced aligned-new syntax for the purpose of allocating "over-aligned" memory).
malloc doesn't know anything about the types of objects that you intend to create into the memory. One might think that new could in theory deduce the alignment since it is given a type... but what if you wanted to reuse that memory for other objects with stricter alignment, like for example in implementation of std::vector? And once you know the API of the operator new: void* operator new ( std::size_t count ), you can see that the type or its alignment are not an argument that could affect the alignment of the allocation.
1 Made by the default allocator, or malloc family of functions.
2 The maximum fundamental alignment is alignof(std::max_align_t). No fundamental type (arithmetic types, pointers) has stricter alignment than this.

There are actually two reasons. The first reason is, that there are some alignment requirements for some kinds of objects. Usually, these alignment requirements are soft: A misaligned access is "just" slower (possibly by orders of magnitude). They can also be hard: On the PPC, for instance, you simply could not access a vector in memory if that vector was not aligned to 16 bytes. Alignment is not something optional, it is something that must be considered when allocating memory. Always.
Note that there is no way to specify an alignment to malloc(). There's simply no argument for it. As such, malloc() must be implemented to provide a pointer that is correctly aligned for any purposes on the platform. The ::operator new() in C++ follows the same principle.
How much alignment is needed is fully platform dependent. On a PPC, there is no way that you can get away with less than 16 bytes alignment. X86 is a bit more lenient in this, afaik.
The second reason is the inner workings of an allocator function. Typical implementations have an allocator overhead of at least 2 pointers: Whenever you request a byte from malloc() it will usually need to allocate space for at least two additional pointers to do its own bookkeeping (the exact amount depends on the implementation). On a 64 bit architecture, that's 16 bytes. As such, it is not sensible for malloc() to think in terms of bytes, it's more efficient to think in terms of 16 byte blocks. At least. You see that with your example code: The resulting pointers are actually 32 bytes apart. Each memory block occupies 16 bytes payload + 16 bytes internal bookkeeping memory.
Since the allocators request entire memory pages from the kernel (4096 bytes, 4096 bytes aligned!), the resulting memory blocks are naturally 16 bytes aligned on a 64 bit platform. It's simply not practical to provide less aligned memory allocations.
So, taken these two reasons together, it is both practical and required to provide seriously aligned memory blocks from an allocator function. The exact amount of alignment depends on the platform, but will usually not be less than the size of two pointers.

It's probably the way the memory allocator manages to get the necessary information to the deallocation function: the issue of the deallocation function (like free or the general, global operator delete) is that there is exactly one argument, the pointer to the allocated memory and no indication of the size of the block that was requested (or the size that was allocated if it's larger), so that indication (and much more) needs to be provided in some other form to the deallocation function.
The most simple yet efficient approach is to allocate room for that additional information plus the requested bytes, and return a pointer to the end of the information block, let's call it IB. The size and alignment of IB automatically aligns the address returned by either malloc or operator new, even if you allocate a minuscule amount: the real amount allocated by malloc(s) is sizeof(IB)+s.
For such small allocations the approach is relatively wasteful and other strategies might be used, but having multiple allocation methods complicate deallocation as the function must first determine which method was used.

Why does this happens?
Because in general case library does not know what kind of data you are going to store in that memory so it has to be aligned to the biggest data type on that platform. And if you store data unaligned you will get significant penalty of hardware performance. On some platforms you will even get segfault if you try to access data unaligned.

Due to the platform. On X86 it isn't necessary but gains performance of the operations. As I know on newer models it doesn't make a difference but compiler goes for the optimum. When not aligned properly for example a long not aligned 4 byte on a m68k processor will crash.

It isn't. It depends on the OS/CPU requirements. In the case of 32bit version of linux/win32, the allocated memory is always 8 byte aligned. In the case of 64bit versions of linux/win32, since all 64bit CPUs have SSE2 at a minimum, it kinda made sense at the time to align all memory to 16bytes (because working with SSE2 was less efficient when using unaligned memory). With the latest AVX based CPUs, this performance penalty for unaligned memory has been removed, so really they could allocate on any boundary.
If you think of it, aligning the addresses for memory allocations to 16bytes gives you 4bits of blank space in the pointer address. This may be useful internally for storing some additional flags (e.g. readable, writable, executable, etc).
At the end of the day, the reasoning is entirely dictated by the OS and/or hardware requirements. It's nothing to do with the language.

Memory alignment, structs and malloc

It's a bit hard to formulate what I want to know in a single question, so I'll try to break it down.
For example purposes, let's say we have the following struct:
struct X {
uint8_t a;
uint16_t b;
uint32_t c;
};
Is it true that the compiler is guaranteed to never rearrange the order of X's members, only add padding where necessary? In other words, is it always true that offsetof(X, a) < offsetof(X, c)?
Is it true that the compiler will pick the largest alignment among X's members and use it to align objects of type X (i.e. addresses of X instances will be divisible by the largest alignment among X's members)?
Since malloc does not know anything about the type of objects that we're going to store when we allocate a buffer, how does it choose an alignment for the returned address? Does it simply return an adress that is divisible by the largest alignment possible (in which case, no matter what structure we put in the buffer, the memory accesses will always be aligned)?

Yes
No, the compiler will use its knowledge of the target host hardware to select the best alignment.
See question 2.

Since malloc does not know anything about the type of objects that we're going to store when we allocate a buffer, how does it choose an alignment for the returned address?
malloc(3) returns "memory that is suitably aligned for any kind of variable."
Does it simply return an adress that is divisible by the largest alignment possible (in which case, no matter what structure we put in the buffer, the memory accesses will always be aligned)?
Yes, but watch your compliance with the strict aliasing rule.

The compiler will do whatever is most beneficial on that computer in the largest number of circumstances. On most platforms loading bus wide values on bus width offsets is fastest.
That means that generally on 32-bit computers compilers will chose to align 32-bit numbers on 4 byte offsets. On 64-bit computers 64-bit values are aligned on 8 byte offsets.
On most computers smaller values like 8-bit and 16-bit values are slower to load. Probably all 4 or 8 bytes around it are loaded and the byte or two bytes you need are masked off.
When you have special circumstances you can override the compiler by specifying the alignment and the padding. You might do this when you know fast loading is not important, but you really want to pack the data tightly. Or when you are playing very subtle tricks with casting and unions.
Memory allocation routines on almost any modern computer will always return memory that is aligned on at least the bus width of the platform ( e.g. 4 or 8 bytes ) - or even more - like 16 byte alignment.
When you call "malloc" you are responsible for knowing the size of the structures you need. Luckily the compiler will tell you the size of any structure with "sizeof". That means that if you pack a structure to save memory, sizeof will return a smaller value than an unpacked structure. So you really will save memory - if you are allocating small structures in large arrays of them.
If you allocate small packed structures one at a time - then yes - if you pack them or not it won't make any difference. That is because when you allocate some odd small piece of memory - the allocator will actually use significantly more memory than that. It will allocate an convenient sized block of memory for you, and then an additional block of memory for itself to keep track of you allocation.
Which is why if you care about memory use and want to pack your structures - you definitely don't want to allocate them one at time.

Address of start block storage

The allocation function attempts to allocate the requested amount of
storage.
If it is successful, it shall return the address of the start of a
block of storage whose length in bytes shall be at least as large as
the requested size.
What does that constraint mean? Could you get an example it violates?
It seems, my question is unclear.
UPD:
Why "at least"? What is the point of allocation more than requested size? Could you get suitable example?

The allowance for allocating "more than required" is there to allow:
Good alignment of the next block of data.
Reduce restrictions on what platforms are able to run code compiled from C and C++.
Flexibility in the design of the memory allocation functionality.
An example of point one is:
char *p1 = new char[1];
int *p2 = new int[1];
If we allocate exactly 1 byte at address 0x1000 for the first allocation, and follow that exactly with a second allocation of 4 bytes for an int, the int will start at address 0x1001. This is "valid" on some architectures, but often leads to a "slower load of the value", on other architectures it will directly lead to a crash, because an int is not accessible on an address that isn't an even multiple of 4. Since the underlying architecture of new doesn't actually know what the memory is eventually going to be used for, it's best to allocate it at "the highest alignment", which in most architectures means 8 or 16 bytes. (If the memory is used for example to store SSE data, it will need an alignment of 16 bytes)
The second case would be where "pointers can only point to whole blocks of 32-bit words". There have been architectures like that in the past. In this case, even if we ignore the above problem with alignment, the memory location specified by a generic pointer is two parts, one for the actual address, and one for the "which byte within that word". In the memory allocator, since typical allocations are much larger than a single byte, we decide to only use the "full word" pointer, so all allocations are by design always rounded up to whole words.
The third case, for example, would be to use a "pre-sized block" allocator. Some real-time OS's for example will have a fixed number of predefined sizes that they allocate - for example 16, 32, 64, 256, 1024, 16384, 65536, 1M, 16M bytes. Allocations are then rounded up to the nearest equal or larger size, so an allocation for 257 bytes would be allocated from the 1024 size. The idea here is to a) provide fast allocation, by keeping track of free blocks in each size, rather than the traditional model of having a large number of blocks in any size to search through to see if there is a big enough block. It also helps against fragmentation (when lots of memory is "free", but the wrong size, so can't be used - for example, if run a loop until the system is out of memory that allocates blocks of 64 bytes, then free every other, and try to allocate a 128 byte block, there is not a single 128 byte block free, because ALL of the memory is carved up into little 64-byte sections).

It means that the allocation function shall return an address of a block of memory whose size is at least the size you requested.
Most of the allocation functions, however, shall return an address of a memory block whose size is bigger than the one you requested and the next allocations will return an address inside this block until it reaches the end of it.
The main reasons for this behavior are:
Minimize the number of new memory block allocations (each block can contain several allocations), which are expensive in terms of time complexity.
Specific alignment issues.

The two most common reasons why allocation
returns a block larger than requested is
Alignment
Bookkeeping

It may be such because in modern operating systems it is much effective to allocate memory page that can be of 512 kb. So internal malloc functionality can just alloc this page of memory, fill its beginning with some service information as to in how many sub-blocks it's divided into, theirs sizes, etc. And only after that it returns to you an address suitable for your needs. Next call to malloc will return another portion of this allocated page for instance. It doesn't matter that the block of memory is restricted to the size you requested. In fact you can heap-overflow this buffer because of no safe mechanism to prevent this sort of activity. Also you can consider alignment questions that other responders stated above. There are enough of types of memory management. You can google it if you are interested enough(Best Fit, First Fit, Last Fit, correct me if I'm wrong).

Is there a memory overhead associated with heap memory allocations (eg markers in the heap)?

Thinking in particular of C++ on Windows using a recent Visual Studio C++ compiler, I am wondering about the heap implementation:
Assuming that I'm using the release compiler, and I'm not concerned with memory fragmentation/packing issues, is there a memory overhead associated with allocating memory on the heap? If so, roughly how many bytes per allocation might this be?
Would it be larger in 64-bit code than 32-bit?
I don't really know a lot about modern heap implementations, but am wondering whether there are markers written into the heap with each allocation, or whether some kind of table is maintained (like a file allocation table).
On a related point (because I'm primarily thinking about standard-library features like 'map'), does the Microsoft standard-library implementation ever use its own allocator (for things like tree nodes) in order to optimize heap usage?

Yes, absolutely.
Every block of memory allocated will have a constant overhead of a "header", as well as a small variable part (typically at the end). Exactly how much that is depends on the exact C runtime library used. In the past, I've experimentally found it to be around 32-64 bytes per allocation. The variable part is to cope with alignment - each block of memory will be aligned to some nice even 2^n base-address - typically 8 or 16 bytes.
I'm not familiar with how the internal design of std::map or similar works, but I very much doubt they have special optimisations there.
You can quite easily test the overhead by:
char *a, *b;
a = new char;
b = new char;
ptrdiff_t diff = a - b;
cout << "a=" << a << " b=" << b << " diff=" << diff;
[Note to the pedants, which is probably most of the regulars here, the above a-b expression invokes undefined behaviour, since subtracting the address of one piece of allocated and the address of another, is undefined behaviour. This is to cope with machines that don't have linear memory addresses, e.g. segmented memory or "different types of data is stored in locations based on their type". The above should definitely work on any x86-based OS that doesn't use a segmented memory model with multiple data segments in for the heap - which means it works for Windows and Linux in 32- and 64-bit mode for sure].
You may want to run it with varying types - just bear in mind that the diff is in "number of the type, so if you make it int *a, *b will be in "four bytes units". You could make a reinterpret_cast<char*>(a) - reinterpret_cast<char *>(b);
[diff may be negative, and if you run this in a loop (without deleting a and b), you may find sudden jumps where one large section of memory is exhausted, and the runtime library allocated another large block]

Visual C++ embeds control information (links/sizes and possibly some checksums) near the boundaries of allocated buffers. That also helps to catch some buffer overflows during memory allocation and deallocation.
On top of that you should remember that malloc() needs to return pointers suitably aligned for all fundamental types (char, int, long long, double, void*, void(*)()) and that alignment is typically of the size of the largest type, so it could be 8 or even 16 bytes. If you allocate a single byte, 7 to 15 bytes can be lost to alignment only. I'm not sure if operator new has the same behavior, but it may very well be the case.
This should give you an idea. The precise memory waste can only be determined from the documentation (if any) or testing. The language standard does not define it in any terms.

Yes. All practical dynamic memory allocators have a minimal granularity1. For example, if the granularity is 16 bytes and you request only 1 byte, the whole 16 bytes is allocated nonetheless. If you ask for 17 bytes, a block whose size is 32 bytes is allocated etc...
There is also a (related) issue of alignment.2
Quite a few allocators seem to be a combination of a size map and free lists - they split potential allocation sizes to "buckets" and keep a separate free list for each of them. Take a look at Doug Lea's malloc. There are many other allocation techniques with various tradeoffs but that goes beyond the scope here...
1 Typically 8 or 16 bytes. If the allocator uses a free list then it must encode two pointers inside every free slot, so a free slot cannot be smaller than 8 bytes (on 32-bit) or 16 byte (on 16-bit). For example, if allocator tried to split a 8-byte slot to satisfy a 4-byte request, the remaining 4 bytes would not have enough room to encode the free list pointers.
2 For example, if the long long on your platform is 8 bytes, then even if the allocator's internal data structures can handle blocks smaller than that, actually allocating the smaller block might push the next 8-byte allocation to an unaligned memory address.

why does size of the struct need to be a multiple of the largest alignment of any struct member

I understand the padding that takes place between the members of a struct to ensure correct alignment of individual types. However, why does the data structure have to be a multiple of alignment of largest member? I don't understand the padding is needed at the end.
Reference:
http://en.wikipedia.org/wiki/Data_structure_alignment

Good question. Consider this hypothetical type:
struct A {
int n;
bool flag;
};
So, an object of type A should take five bytes (four for the int plus one for the bool), but in fact it takes eight. Why?
The answer is seen if you use the type like this:
const size_t N = 100;
A a[N];
If each A were only five bytes, then a[0] would align but a[1], a[2] and most of the other elements would not.
But why does alignment even matter? There are several reasons, all hardware-related. One reason is that recently/frequently used memory is cached in cache lines on the CPU silicon for rapid access. An aligned object smaller than a cache line always fits in a single line (but see the interesting comments appended below), but an unaligned object may straddle two lines, wasting cache.
There are actually even more fundamental hardware reasons, having to do with the way byte-addressable data is transferred down a 32- or 64-bit data bus, quite apart from cache lines. Not only will misalignment clog the bus with extra fetches (due as before to straddling), but it will also force registers to shift bytes as they come in. Even worse, misalignment tends to confuse optimization logic (at least, Intel's optimization manual says that it does, though I have no personal knowledge of this last point). So, misalignment is very bad from a performance standpoint.
It usually is worth it to waste the padding bytes for these reasons.
Update: The comments below are all useful. I recommend them.

Depending on the hardware, alignment might be necessary or just help speeding up execution.
There is a certain number of processors (ARM I believe) in which an unaligned access leads to a hardware exception. Plain and simple.
Even though typical x86 processors are more lenient, there is still a penalty in accessing unaligned fundamental types, as the processor has to do more work to bring the bits into the register before being able to operate on it. Compilers usually offer specific attributes/pragmas when packing is desirable nonetheless.

Because of virtual addressing.
"...aligning a page on a page-sized boundary lets the
hardware map a virtual address to a physical address by substituting
the higher bits in the address, rather than doing complex arithmetic."
By the way, I found the Wikipedia page on this quite well written.

If the register size of the CPU is 32 bits, then it can grab memory that is on 32 bit boundaries with a single assembly instruction. It is slower to grab 32 bits, and then get the byte that starts at bit 8.
BTW: There doesn't have to be padding. You can ask that structures be packed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js