What does the alignment parameter for Boost aligned_allocator mean?

What does the alignment parameter for Boost aligned_allocator mean? - c++

There is a Boost tutorial giving approximately the following code, slightly modified for my question:
#include <boost/align/aligned_allocator.hpp>
#include <vector>
int main()
{
std::vector<int, boost::alignment::
aligned_allocator<int, 16> > v(100);
}
In this example, an alignment parameter of 16 is given. Does this indicate 16 bytes of alignment, or 16*sizeof(int) bytes of alignment?

It would represent 16 bytes of alignment.
On some processors, access to a non-aligned memory address can result in an exception. On others, a non-aligned memory access might work, but may be suboptimal, possibly requiring extra reads of memory at aligned addresses. The actual alignment needed or desired varies depending on context.
For example, on a 32-bit x86 processor a 32-bit (4 byte) non-aligned access can result in two aligned memory accesses. If a 4 byte read was done at address 1, the processor may need to read bytes 0..3, followed by a read of bytes 4..7, and then combine bytes 1..4 into the result, discarding the extra data read.
For SIMD instructions the alignment is greater. A 64-bit MMX instruction should access memory that is 64-bit (8 byte) aligned. A 128-bit XMM instruction should access memory that is 128-bit (16 byte) aligned.
On a SPARC processor an unaligned memory access would result in a processor exception. I believe ARM also generates exceptions for unaligned access. On x86 you can also get exceptions in some cases. In particular, if the stack is not properly aligned, it can cause a program crash. A detail that is usually handled by the compiler.

The number 16 refers to the number of bytes. From the Boost.Align documentation (that uses the same terminology as the C++ Standard)
[basic.align]
Object types have alignment requirements which place
restrictions on the addresses at which an object of that type may be
allocated. An alignment is an implementation-defined integer value
representing the number of bytes between successive addresses at which
a given object can be allocated. An object type imposes an alignment
requirement on every object of that type; stricter alignment can be
requested using the alignment specifier.

Related

Why is dynamically allocated memory always 16 bytes aligned?

I wrote a simple example:
#include <iostream>
int main() {
void* byte1 = ::operator new(1);
void* byte2 = ::operator new(1);
void* byte3 = malloc(1);
std::cout << "byte1: " << byte1 << std::endl;
std::cout << "byte2: " << byte2 << std::endl;
std::cout << "byte3: " << byte3 << std::endl;
return 0;
}
Running the example, I get the following results:
byte1: 0x1f53e70
byte2: 0x1f53e90
byte3: 0x1f53eb0
Each time I allocate a single byte of memory, it's always 16 bytes aligned. Why does this happen?
I tested this code on GCC 5.4.0 as well as GCC 7.4.0, and got the same results.

Why does this happen?
Because the standard says so. More specifically, it says that the dynamic allocations1 are aligned to at least the maximum fundamental2 alignment (it may have stricter alignment). There is a pre-defined macro (since C++17) just for the purpose of telling you exactly what this guaranteed alignment is: __STDCPP_DEFAULT_NEW_ALIGNMENT__. Why this might be 16 in your example... that is a choice of the language implementation, restricted by what is allowed by the target hardware architecture.
This is (was) a necessary design, considering that there is (was) no way to pass information about the needed alignment to the allocation function (until C++17 which introduced aligned-new syntax for the purpose of allocating "over-aligned" memory).
malloc doesn't know anything about the types of objects that you intend to create into the memory. One might think that new could in theory deduce the alignment since it is given a type... but what if you wanted to reuse that memory for other objects with stricter alignment, like for example in implementation of std::vector? And once you know the API of the operator new: void* operator new ( std::size_t count ), you can see that the type or its alignment are not an argument that could affect the alignment of the allocation.
1 Made by the default allocator, or malloc family of functions.
2 The maximum fundamental alignment is alignof(std::max_align_t). No fundamental type (arithmetic types, pointers) has stricter alignment than this.

There are actually two reasons. The first reason is, that there are some alignment requirements for some kinds of objects. Usually, these alignment requirements are soft: A misaligned access is "just" slower (possibly by orders of magnitude). They can also be hard: On the PPC, for instance, you simply could not access a vector in memory if that vector was not aligned to 16 bytes. Alignment is not something optional, it is something that must be considered when allocating memory. Always.
Note that there is no way to specify an alignment to malloc(). There's simply no argument for it. As such, malloc() must be implemented to provide a pointer that is correctly aligned for any purposes on the platform. The ::operator new() in C++ follows the same principle.
How much alignment is needed is fully platform dependent. On a PPC, there is no way that you can get away with less than 16 bytes alignment. X86 is a bit more lenient in this, afaik.
The second reason is the inner workings of an allocator function. Typical implementations have an allocator overhead of at least 2 pointers: Whenever you request a byte from malloc() it will usually need to allocate space for at least two additional pointers to do its own bookkeeping (the exact amount depends on the implementation). On a 64 bit architecture, that's 16 bytes. As such, it is not sensible for malloc() to think in terms of bytes, it's more efficient to think in terms of 16 byte blocks. At least. You see that with your example code: The resulting pointers are actually 32 bytes apart. Each memory block occupies 16 bytes payload + 16 bytes internal bookkeeping memory.
Since the allocators request entire memory pages from the kernel (4096 bytes, 4096 bytes aligned!), the resulting memory blocks are naturally 16 bytes aligned on a 64 bit platform. It's simply not practical to provide less aligned memory allocations.
So, taken these two reasons together, it is both practical and required to provide seriously aligned memory blocks from an allocator function. The exact amount of alignment depends on the platform, but will usually not be less than the size of two pointers.

It's probably the way the memory allocator manages to get the necessary information to the deallocation function: the issue of the deallocation function (like free or the general, global operator delete) is that there is exactly one argument, the pointer to the allocated memory and no indication of the size of the block that was requested (or the size that was allocated if it's larger), so that indication (and much more) needs to be provided in some other form to the deallocation function.
The most simple yet efficient approach is to allocate room for that additional information plus the requested bytes, and return a pointer to the end of the information block, let's call it IB. The size and alignment of IB automatically aligns the address returned by either malloc or operator new, even if you allocate a minuscule amount: the real amount allocated by malloc(s) is sizeof(IB)+s.
For such small allocations the approach is relatively wasteful and other strategies might be used, but having multiple allocation methods complicate deallocation as the function must first determine which method was used.

Why does this happens?
Because in general case library does not know what kind of data you are going to store in that memory so it has to be aligned to the biggest data type on that platform. And if you store data unaligned you will get significant penalty of hardware performance. On some platforms you will even get segfault if you try to access data unaligned.

Due to the platform. On X86 it isn't necessary but gains performance of the operations. As I know on newer models it doesn't make a difference but compiler goes for the optimum. When not aligned properly for example a long not aligned 4 byte on a m68k processor will crash.

It isn't. It depends on the OS/CPU requirements. In the case of 32bit version of linux/win32, the allocated memory is always 8 byte aligned. In the case of 64bit versions of linux/win32, since all 64bit CPUs have SSE2 at a minimum, it kinda made sense at the time to align all memory to 16bytes (because working with SSE2 was less efficient when using unaligned memory). With the latest AVX based CPUs, this performance penalty for unaligned memory has been removed, so really they could allocate on any boundary.
If you think of it, aligning the addresses for memory allocations to 16bytes gives you 4bits of blank space in the pointer address. This may be useful internally for storing some additional flags (e.g. readable, writable, executable, etc).
At the end of the day, the reasoning is entirely dictated by the OS and/or hardware requirements. It's nothing to do with the language.

Memory alignment, structs and malloc

It's a bit hard to formulate what I want to know in a single question, so I'll try to break it down.
For example purposes, let's say we have the following struct:
struct X {
uint8_t a;
uint16_t b;
uint32_t c;
};
Is it true that the compiler is guaranteed to never rearrange the order of X's members, only add padding where necessary? In other words, is it always true that offsetof(X, a) < offsetof(X, c)?
Is it true that the compiler will pick the largest alignment among X's members and use it to align objects of type X (i.e. addresses of X instances will be divisible by the largest alignment among X's members)?
Since malloc does not know anything about the type of objects that we're going to store when we allocate a buffer, how does it choose an alignment for the returned address? Does it simply return an adress that is divisible by the largest alignment possible (in which case, no matter what structure we put in the buffer, the memory accesses will always be aligned)?

Yes
No, the compiler will use its knowledge of the target host hardware to select the best alignment.
See question 2.

Since malloc does not know anything about the type of objects that we're going to store when we allocate a buffer, how does it choose an alignment for the returned address?
malloc(3) returns "memory that is suitably aligned for any kind of variable."
Does it simply return an adress that is divisible by the largest alignment possible (in which case, no matter what structure we put in the buffer, the memory accesses will always be aligned)?
Yes, but watch your compliance with the strict aliasing rule.

The compiler will do whatever is most beneficial on that computer in the largest number of circumstances. On most platforms loading bus wide values on bus width offsets is fastest.
That means that generally on 32-bit computers compilers will chose to align 32-bit numbers on 4 byte offsets. On 64-bit computers 64-bit values are aligned on 8 byte offsets.
On most computers smaller values like 8-bit and 16-bit values are slower to load. Probably all 4 or 8 bytes around it are loaded and the byte or two bytes you need are masked off.
When you have special circumstances you can override the compiler by specifying the alignment and the padding. You might do this when you know fast loading is not important, but you really want to pack the data tightly. Or when you are playing very subtle tricks with casting and unions.
Memory allocation routines on almost any modern computer will always return memory that is aligned on at least the bus width of the platform ( e.g. 4 or 8 bytes ) - or even more - like 16 byte alignment.
When you call "malloc" you are responsible for knowing the size of the structures you need. Luckily the compiler will tell you the size of any structure with "sizeof". That means that if you pack a structure to save memory, sizeof will return a smaller value than an unpacked structure. So you really will save memory - if you are allocating small structures in large arrays of them.
If you allocate small packed structures one at a time - then yes - if you pack them or not it won't make any difference. That is because when you allocate some odd small piece of memory - the allocator will actually use significantly more memory than that. It will allocate an convenient sized block of memory for you, and then an additional block of memory for itself to keep track of you allocation.
Which is why if you care about memory use and want to pack your structures - you definitely don't want to allocate them one at time.

Why does Malloc() care about boundary alignments?

I've heard that malloc() aligns memory based on the type that is being allocated. For example, from the book Understanding and Using C Pointers:
The memory allocated will be aligned according to the pointer's data type. Fore example, a four-byte integer would be allocated on an address boundary evenly divisible by four.
If I follow, this means that
int *integer=malloc(sizeof(int)); will be allocated on an address boundary evenly divisible by four. Even without casting (int *) on malloc.
I was working on a chat server; I read of a similar effect with structs.
And I have to ask: logically, why does it matter what the address boundary itself is divisible on? What's wrong with allocating a group of memory to the tune of n*sizeof(int) using an integer on address 129?
I know how pointer arithmetic works *(integer+1), but I can't work out the importance of boundaries...

The memory allocated will be aligned according to the pointer's data
type.
If you are talking about malloc, this is false. malloc doesn't care what you do with the data and will allocate memory aligned to fit the most stringent native type of the implementation.
From the standard:
The pointer returned if the allocation succeeds is suitably aligned so
that it may be assigned to a pointer to any type of object with a
fundamental alignment requirement and then used to access such an
object or an array of such objects in the space allocated (until the
space is explicitly deallocated)
And:
Logically, why does it matter what the address boundary itself is
divisible on
Due to the workings of the underlying machine, accessing unaligned data might be more expensive (e.g. x86) or illegal (e.g. arm). This lets the hardware take shortcuts that improve performance / simplify implementation.

In many processors, data that isn't aligned will cause a "trap" or "exception" (this is a different form of exception than those understood by the C++ compiler. Even on processors that don't trap when data isn't aligned, it is typically slower (twice as slow, for example) when the data is not correctly aligned. So it's in the compiler's/runtime library's best interest to ensure that things are nicely aligned.
And by the way, malloc (typically) doesn't know what you are allocating. Insteat, malloc will align ALL data, no matter what size it is, to some suitable boundary that is "good enough" for general data-access - typically 8 or 16 bytes in modern OS/processor combinations, 4 bytes in older systems.
This is because malloc won't know if you do char* p = malloc(1000); or double* p = malloc(1000);, so it has to assume you are storing double or whatever is the item with the largest alignment requirement.

The importance of alignment is not a language issue but a hardware issue. Some machines are incapable of reading a data value that is not properly aligned. Others can do it but do so less efficiently, e.g., requiring two reads to read one misaligned value.

The book quote is wrong; the memory returned by malloc is guaranteed to be aligned correctly for any type. Even if you write char *ch = malloc(37);, it is still aligned for int or any other type.
You seem to be asking "What is alignment?" If so, there are several questions on SO about this already, e.g. here, or a good explanation from IBM here.

It depends on the hardware. Even assuming int is 32 bits, malloc(sizeof(int)) could return an address divisible by 1, 2, or 4. Different processors handle unaligned access differently.
Processors don't read directly from RAM any more, that's too slow (it takes hundreds of cycles). So when they do grab RAM, they grab it in big chunks, like 64 bytes at a time. If your address isn't aligned, the 4-byte integer might straddle two 64-byte cache lines, so your processor has to do two loads and fix up the result. Or maybe the engineers decided that building the hardware to fix up unaligned loads isn't necessary, so the processor signals an exception: either your program crashes, or the operating system catches the exception and fixes up the operation (hundreds of wasted cycles).
Aligning addresses means your program plays nicely with hardware.

Because it's more fast; Most processor likes data which is aligned. Even, Some processor CANNOT access data which is not aligned! (If you try to access this data, processor may occur fault)

why does size of the struct need to be a multiple of the largest alignment of any struct member

I understand the padding that takes place between the members of a struct to ensure correct alignment of individual types. However, why does the data structure have to be a multiple of alignment of largest member? I don't understand the padding is needed at the end.
Reference:
http://en.wikipedia.org/wiki/Data_structure_alignment

Good question. Consider this hypothetical type:
struct A {
int n;
bool flag;
};
So, an object of type A should take five bytes (four for the int plus one for the bool), but in fact it takes eight. Why?
The answer is seen if you use the type like this:
const size_t N = 100;
A a[N];
If each A were only five bytes, then a[0] would align but a[1], a[2] and most of the other elements would not.
But why does alignment even matter? There are several reasons, all hardware-related. One reason is that recently/frequently used memory is cached in cache lines on the CPU silicon for rapid access. An aligned object smaller than a cache line always fits in a single line (but see the interesting comments appended below), but an unaligned object may straddle two lines, wasting cache.
There are actually even more fundamental hardware reasons, having to do with the way byte-addressable data is transferred down a 32- or 64-bit data bus, quite apart from cache lines. Not only will misalignment clog the bus with extra fetches (due as before to straddling), but it will also force registers to shift bytes as they come in. Even worse, misalignment tends to confuse optimization logic (at least, Intel's optimization manual says that it does, though I have no personal knowledge of this last point). So, misalignment is very bad from a performance standpoint.
It usually is worth it to waste the padding bytes for these reasons.
Update: The comments below are all useful. I recommend them.

Depending on the hardware, alignment might be necessary or just help speeding up execution.
There is a certain number of processors (ARM I believe) in which an unaligned access leads to a hardware exception. Plain and simple.
Even though typical x86 processors are more lenient, there is still a penalty in accessing unaligned fundamental types, as the processor has to do more work to bring the bits into the register before being able to operate on it. Compilers usually offer specific attributes/pragmas when packing is desirable nonetheless.

Because of virtual addressing.
"...aligning a page on a page-sized boundary lets the
hardware map a virtual address to a physical address by substituting
the higher bits in the address, rather than doing complex arithmetic."
By the way, I found the Wikipedia page on this quite well written.

If the register size of the CPU is 32 bits, then it can grab memory that is on 32 bit boundaries with a single assembly instruction. It is slower to grab 32 bits, and then get the byte that starts at bit 8.
BTW: There doesn't have to be padding. You can ask that structures be packed.

Virtual memory and alignment - how do they factor together?

I think I understand memory alignment, but what confuses me is that the address of a pointer on some systems is going to be in virtual memory, right? So most of the checking/ensuring of alignment I have seen seem to just use the pointer address. Is it not possible that the physical memory address will not be aligned? Isn't that problematic for things like SSE?

The physical address will be aligned because virtual memory only maps aligned pages to physical memory (and the pages are typically 4KB).
So unless you need alignment > page size, the physical memory will be aligned as per your requirements.
In the specific case of SSE, everything works fine because you only need 16 byte alignment.

I am not aware of any actual system in which an aligned virtual memory address can result in a misaligned physical memory address.
Typically, all alignments on a given platform will be powers of two. For example, on x86 32-bit integers have a natural alignment of 4 bytes (2^2). The page size - which defines how fine a block you can map in physical memory - is generally a large power of two. On x86, the smallest allowable page size is 4096 bytes (2^12). The largest datatype that might need alignment on x86 is 128 bits (for XMM registers and CMPXCHG16B) 32 bytes (for AVX) - 2^5. Since 2^12 is divisible by 2^5, you'll find that everything aligns right at the start of a page, and since pages are aligned both in virtual and physical memory, a virtual-aligned address will always be physical-aligned.
On a more practical level, allowing aligned virtual addresses to map to unaligned physical addresses not only would make it really hard to generate code, it would also make the CPU architecture more complex than simply allowing any alignment (since now we have odd-sized pages and other weirdness...)
Note that you may have reason to ask for larger alignments than a page from time to time. Typically, for user space coding, it doesn't matter if this is aligned in physical RAM (for that matter, if you're requesting multiple pages, it's unlikely to be even contiguous!). Problems here only arise if you're writing a device driver and need a large, aligned, contiguous block for DMA. But even then usually the device isn't a stickler about larger-than-page-size alignment.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js