Virtual memory and alignment - how do they factor together?

Virtual memory and alignment - how do they factor together? - c++

I think I understand memory alignment, but what confuses me is that the address of a pointer on some systems is going to be in virtual memory, right? So most of the checking/ensuring of alignment I have seen seem to just use the pointer address. Is it not possible that the physical memory address will not be aligned? Isn't that problematic for things like SSE?

The physical address will be aligned because virtual memory only maps aligned pages to physical memory (and the pages are typically 4KB).
So unless you need alignment > page size, the physical memory will be aligned as per your requirements.
In the specific case of SSE, everything works fine because you only need 16 byte alignment.

I am not aware of any actual system in which an aligned virtual memory address can result in a misaligned physical memory address.
Typically, all alignments on a given platform will be powers of two. For example, on x86 32-bit integers have a natural alignment of 4 bytes (2^2). The page size - which defines how fine a block you can map in physical memory - is generally a large power of two. On x86, the smallest allowable page size is 4096 bytes (2^12). The largest datatype that might need alignment on x86 is 128 bits (for XMM registers and CMPXCHG16B) 32 bytes (for AVX) - 2^5. Since 2^12 is divisible by 2^5, you'll find that everything aligns right at the start of a page, and since pages are aligned both in virtual and physical memory, a virtual-aligned address will always be physical-aligned.
On a more practical level, allowing aligned virtual addresses to map to unaligned physical addresses not only would make it really hard to generate code, it would also make the CPU architecture more complex than simply allowing any alignment (since now we have odd-sized pages and other weirdness...)
Note that you may have reason to ask for larger alignments than a page from time to time. Typically, for user space coding, it doesn't matter if this is aligned in physical RAM (for that matter, if you're requesting multiple pages, it's unlikely to be even contiguous!). Problems here only arise if you're writing a device driver and need a large, aligned, contiguous block for DMA. But even then usually the device isn't a stickler about larger-than-page-size alignment.

Related

Will process load into memory with 4 or 8 alignment rule

I just learnt about 4 or 8 memory alignment and came about this question.
Will Memory alignment happen in virtual memory space or absolute addresss?
I guess the answer is virtual memory space， and the os will load the process to the position that the absolute address ends with '0X00' or '0X0'.
If not, please show me why. Thanks a lot. XD

Both virtual and actual addresses will be word-aligned to the CPU's native word-size where appropriate(*). (The reason for that is that the virtual-to-physical mapping is done on a per-page basis, and the size of a memory page is always an even multiple of the CPU's native word-size).
(*) the exception would be for items that are smaller than a word, and are packed together consecutively to save memory; e.g. many of the individual elements inside of char and uint_8 arrays will necessarily not be word-aligned.

How does std::alignas optimize the performance of a program?

In 32-bit machine, One memory read cycle gets 4 bytes of data.
So for reading below buffer, It should take 32 read-cycle to read a buffer of 128 bytes mentioned below.
char buffer[128];
Now, Suppose if I have aligned this buffer as mentioned below then please let me know how will it make it faster to read?
alignas(128) char buffer[128];
I am assuming the memory read cycle will remain 4 bytes only.

The size of the registers used for memory access is only one part of the story, the other part is the size of the cache-line.
If a cache-line is 64 bytes and your char[128] is naturally aligned, the CPU generally needs to manipulate three different cache-lines. With alignas(64) or alignas(128), only two cache-lines need to be touched.
If you are working with memory mapped file, or under swapping conditions, the next level of alignment kicks in: the size of a memory page. This would call for 4096 or 8192 byte alignments.
However, I seriously doubt that alignas() has any significant positive effect if the specified alignment is larger than the natural alignment that the compiler uses anyway: It significantly increases memory consumption, which may be enough to trigger more cache-lines/memory pages being touched in the first place. It's only the small misalignments that need to be avoided because they may trigger huge slowdowns on some CPUs, or might be downright illegal/impossible on others.
Thus, truth is only in measurement: If you need all the speedup you can get, try it, measure the runtime difference, and see whether it works out.

In 32 bit machine, One memory read cycle gets 4 bytes of data.
It's not that simple. Just the term "32 bit machine" is already too broad and can mean many things. 32b registers (GP registers? ALU registers? Address registers?)? 32b address bus? 32b data bus? 32b instruction word size?
And "memory read" by whom. CPU? Cache? DMA chip?
If you have a HW platform where memory is read by 4 bytes (aligned by 4) in single cycle and without any cache, then alignas(128) will do no difference (than alignas(4)).

Why does Malloc() care about boundary alignments?

I've heard that malloc() aligns memory based on the type that is being allocated. For example, from the book Understanding and Using C Pointers:
The memory allocated will be aligned according to the pointer's data type. Fore example, a four-byte integer would be allocated on an address boundary evenly divisible by four.
If I follow, this means that
int *integer=malloc(sizeof(int)); will be allocated on an address boundary evenly divisible by four. Even without casting (int *) on malloc.
I was working on a chat server; I read of a similar effect with structs.
And I have to ask: logically, why does it matter what the address boundary itself is divisible on? What's wrong with allocating a group of memory to the tune of n*sizeof(int) using an integer on address 129?
I know how pointer arithmetic works *(integer+1), but I can't work out the importance of boundaries...

The memory allocated will be aligned according to the pointer's data
type.
If you are talking about malloc, this is false. malloc doesn't care what you do with the data and will allocate memory aligned to fit the most stringent native type of the implementation.
From the standard:
The pointer returned if the allocation succeeds is suitably aligned so
that it may be assigned to a pointer to any type of object with a
fundamental alignment requirement and then used to access such an
object or an array of such objects in the space allocated (until the
space is explicitly deallocated)
And:
Logically, why does it matter what the address boundary itself is
divisible on
Due to the workings of the underlying machine, accessing unaligned data might be more expensive (e.g. x86) or illegal (e.g. arm). This lets the hardware take shortcuts that improve performance / simplify implementation.

In many processors, data that isn't aligned will cause a "trap" or "exception" (this is a different form of exception than those understood by the C++ compiler. Even on processors that don't trap when data isn't aligned, it is typically slower (twice as slow, for example) when the data is not correctly aligned. So it's in the compiler's/runtime library's best interest to ensure that things are nicely aligned.
And by the way, malloc (typically) doesn't know what you are allocating. Insteat, malloc will align ALL data, no matter what size it is, to some suitable boundary that is "good enough" for general data-access - typically 8 or 16 bytes in modern OS/processor combinations, 4 bytes in older systems.
This is because malloc won't know if you do char* p = malloc(1000); or double* p = malloc(1000);, so it has to assume you are storing double or whatever is the item with the largest alignment requirement.

The importance of alignment is not a language issue but a hardware issue. Some machines are incapable of reading a data value that is not properly aligned. Others can do it but do so less efficiently, e.g., requiring two reads to read one misaligned value.

The book quote is wrong; the memory returned by malloc is guaranteed to be aligned correctly for any type. Even if you write char *ch = malloc(37);, it is still aligned for int or any other type.
You seem to be asking "What is alignment?" If so, there are several questions on SO about this already, e.g. here, or a good explanation from IBM here.

It depends on the hardware. Even assuming int is 32 bits, malloc(sizeof(int)) could return an address divisible by 1, 2, or 4. Different processors handle unaligned access differently.
Processors don't read directly from RAM any more, that's too slow (it takes hundreds of cycles). So when they do grab RAM, they grab it in big chunks, like 64 bytes at a time. If your address isn't aligned, the 4-byte integer might straddle two 64-byte cache lines, so your processor has to do two loads and fix up the result. Or maybe the engineers decided that building the hardware to fix up unaligned loads isn't necessary, so the processor signals an exception: either your program crashes, or the operating system catches the exception and fixes up the operation (hundreds of wasted cycles).
Aligning addresses means your program plays nicely with hardware.

Because it's more fast; Most processor likes data which is aligned. Even, Some processor CANNOT access data which is not aligned! (If you try to access this data, processor may occur fault)

Can a pointer point to an address after 4GB?

If we compile and execute the code below:
int *p;
printf("%d\n", (int)sizeof(p));
it seems that the size of a pointer to whatever the type is 4 bytes, which means 32 bit, so 232 adresses are possible to store in a pointer. Since every address is associated to 1 byte, 232 bytes give 4 GB.
So, how can a pointer point to the address after 4 GB of memory? And how can a program use more than 4 GB of memory?

By principle, if you can't represent an address which goes over 2^X-1 then you can't address more than 2^X bytes of memory.
This is true for x86 even if some workarounds have been implemented and used (like PAE) that allows to have more physical memory even if with limits imposed by the fact that these are more hacks than real solutions to the problem.
With a 64 bit architecture the standard size of a pointer is doubled, so you don't have to worry anymore.
Mind that, in any case, virtual memory translates addresses from the process space to the physical space so it's easy to see that a hardware could support more memory even if the maximum addressable memory from the process point of view is still limited by the size of a pointer.

"How can a pointer point to the address after 4GB of memory?"
There is a difference between the physical memory available to the processor and the "virtual memory" seen by the process. A 32 bit process (which has a pointer of size 4 bytes) is limited to 4GB however the processor maintains a mapping (controlled by the OS) that lets each process have its own memory space, up to 4GB each.
That way 8GB of memory could be used on a 32 bit system, if there were two processes each using 4GB.

To access >4GB of address space you can do one of the following:
Compile in x86_64 (64 bit) on a 64 bit OS. This is the easiest.
Use AWE memory. AWE allows mapping a window of memory which (usually) resides above 4GB. The window address can be mapped and remapped again and again. Was used in large database applications and RAM drives in the 32 bit era.
Note that a memory address where the MSB is 1 is reserved for the kernel. Windows allows under several conditions to use up to 3GB (per process), the top 1GB is always for the kernel.
By default a 32 bit process has 2GB of user mode address space. It's possible to get 3GB via a special linker flag (in VS: /LARGEADDRESSAWARE).

why does size of the struct need to be a multiple of the largest alignment of any struct member

I understand the padding that takes place between the members of a struct to ensure correct alignment of individual types. However, why does the data structure have to be a multiple of alignment of largest member? I don't understand the padding is needed at the end.
Reference:
http://en.wikipedia.org/wiki/Data_structure_alignment

Good question. Consider this hypothetical type:
struct A {
int n;
bool flag;
};
So, an object of type A should take five bytes (four for the int plus one for the bool), but in fact it takes eight. Why?
The answer is seen if you use the type like this:
const size_t N = 100;
A a[N];
If each A were only five bytes, then a[0] would align but a[1], a[2] and most of the other elements would not.
But why does alignment even matter? There are several reasons, all hardware-related. One reason is that recently/frequently used memory is cached in cache lines on the CPU silicon for rapid access. An aligned object smaller than a cache line always fits in a single line (but see the interesting comments appended below), but an unaligned object may straddle two lines, wasting cache.
There are actually even more fundamental hardware reasons, having to do with the way byte-addressable data is transferred down a 32- or 64-bit data bus, quite apart from cache lines. Not only will misalignment clog the bus with extra fetches (due as before to straddling), but it will also force registers to shift bytes as they come in. Even worse, misalignment tends to confuse optimization logic (at least, Intel's optimization manual says that it does, though I have no personal knowledge of this last point). So, misalignment is very bad from a performance standpoint.
It usually is worth it to waste the padding bytes for these reasons.
Update: The comments below are all useful. I recommend them.

Depending on the hardware, alignment might be necessary or just help speeding up execution.
There is a certain number of processors (ARM I believe) in which an unaligned access leads to a hardware exception. Plain and simple.
Even though typical x86 processors are more lenient, there is still a penalty in accessing unaligned fundamental types, as the processor has to do more work to bring the bits into the register before being able to operate on it. Compilers usually offer specific attributes/pragmas when packing is desirable nonetheless.

Because of virtual addressing.
"...aligning a page on a page-sized boundary lets the
hardware map a virtual address to a physical address by substituting
the higher bits in the address, rather than doing complex arithmetic."
By the way, I found the Wikipedia page on this quite well written.

If the register size of the CPU is 32 bits, then it can grab memory that is on 32 bit boundaries with a single assembly instruction. It is slower to grab 32 bits, and then get the byte that starts at bit 8.
BTW: There doesn't have to be padding. You can ask that structures be packed.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js