What i don't understand, is why we have to align data in memory on boundaries larger than 4 bytes since all the other boundaries are multiples of 4. Assuming a CPU can read 4 bytes in a cycle, it will be basically no difference in performance if that data is 8 bytes large and is aligned on a 4 byte / 8 byte / 16 byte, etc.
When an x86 CPU reads a double, it reads 8 bytes in a cycle. When it reads an SSE vector, it reads 16 bytes. When it reads an AVX vector, it reads 32.
When the CPU fetches a cache line from memory, it also reads at least 32 bytes.
Your assumption that the CPU reads 4 bytes per cycle is false.
First: x86 CPUs don't read stuff in 4 bytes only, they can read 8 bytes in a cycle or even more with SIMD extensions.
But to answer your question "why are there alignment boundaries multiple than 4?", assuming a generic architecture (you didn't specify one and you wrote that x86 was just an example) I'll present a specific case: GPUs.
NVIDIA GPU memory can only be accessed (store/load) if the address is aligned on a multiple of the access size (PTX ISA ld/st). There are different kinds of loads and the most performant ones happen when the address is aligned to a multiple of the access size so if you're trying to load a double from memory (8 bytes) you would have (pseudocode):
ld.double [48dec] // Works, 8 bytes aligned
ld.double [17dec] // Fails, not 8 bytes aligned
in the above case when trying to access (r/w) memory that is not properly aligned the process will actually cause an error. If you want speed you'll have to provide some safety guarantees.
That might answer your question on why alignment boundaries larger than 4 exist in the first place. On such an architecture an access size of 1 is always safe (every address is aligned to 1). That isn't true for every n>1.
Related
Why does the value of the address in C and C++ is always even?
For example: I declare a variable int x and x has a memory address 0x6ffe1c (in hexa decimal). No matter what. That value is never odd number it is always an even number. Why is that so??
Computer memory is composed of bits. The bits are organized into groups. A computer may have a gigabyte of memory, which is over 1,000,000,000 bytes or 8,000,000,000 bits, but the physical connections to memory cannot simply get any one particular bit from that memory.
When the processor wants data from memory, it puts a signal on a bus that asks for a particular word of memory. A bus is largely a set of wires that connects different parts of the computer. A word of memory is a group of bits of some size particular to that hardware, perhaps 32 bits. When the memory device sees a request for a word, it gets those bits and puts them on the bus, all at once. (The bus for that will have 32 or more wires, so it can carry all the data for one word at one time.)
Let’s continue with the example of 32-bit words. Since memory is grouped into words of 32 bits, each word has a memory address that is a multiple of 32 bits, or four bytes. And every address that is a multiple of four (0, 4, 8, 12, 16, … 4096, 4100, 4104,…) is the address of a word. The processor always reads or writes memory in units of words—that is the only interaction the hardware can do; the processor cannot read individual bytes from memory. If your int is in a single word, then the processor can get it from memory by asking for that word.
On the other hand, suppose your int starts at address 99. Then one byte of it is in the word that starts at address 96 (addresses 96 to 99), and three bytes of it are in the word that starts at address 100 (addresses 100 to 103). In order to get your int, the processor has to read two words and then stitch together bytes from them to make one int.
First, that is a waste of time. Doing two reads from memory takes longer than doing one read. Second, if the processor has to have extra wires and circuits for doing that, it makes the processor more expensive and use more energy, and it takes resources away from other things the processor could be doing, like adding or multiplying.
So processors are designed to prefer aligned data. They may have components for handling unaligned data, but using those components may take extra time or resources. So compilers are designed to align objects in ways that are preferable for the target architecture.
Why does the value of the address in C and C++ is always even?
This is not true in general. For example, if you have an array char[2], you'll find that exactly one of those elements has an even address, and the other must have an odd address.
I declare a variable int x and x has a memory address 0x6ffe1c
Every type in C and C++ have have some alignment requirement. That is, objects of the type must be stored in an address that is divisible by that alignment. Alignment requirement is an integer that is always a power of two.
There is exactly one power of two that is not even: 20 == 1. Objects with an alignment requirement of 1 can be stored in odd addresses. char always has size and alignment of 1 byte. int typically has a higher alignment requirement in which case it will be stored in an even address.
The reason why alignment is important is that there are CPU instruction sets which only allow reading and writing to memory addresses that are aligned to the width of the CPU word. Other CPU instruction sets may support operations on addresses aligned to some fractions of the word size. Further, some CPU instruction sets, such as the x86 support operating on entirely unaligned address but (at least on older models), such operations may be much slower.
In 32-bit machine, One memory read cycle gets 4 bytes of data.
So for reading below buffer, It should take 32 read-cycle to read a buffer of 128 bytes mentioned below.
char buffer[128];
Now, Suppose if I have aligned this buffer as mentioned below then please let me know how will it make it faster to read?
alignas(128) char buffer[128];
I am assuming the memory read cycle will remain 4 bytes only.
The size of the registers used for memory access is only one part of the story, the other part is the size of the cache-line.
If a cache-line is 64 bytes and your char[128] is naturally aligned, the CPU generally needs to manipulate three different cache-lines. With alignas(64) or alignas(128), only two cache-lines need to be touched.
If you are working with memory mapped file, or under swapping conditions, the next level of alignment kicks in: the size of a memory page. This would call for 4096 or 8192 byte alignments.
However, I seriously doubt that alignas() has any significant positive effect if the specified alignment is larger than the natural alignment that the compiler uses anyway: It significantly increases memory consumption, which may be enough to trigger more cache-lines/memory pages being touched in the first place. It's only the small misalignments that need to be avoided because they may trigger huge slowdowns on some CPUs, or might be downright illegal/impossible on others.
Thus, truth is only in measurement: If you need all the speedup you can get, try it, measure the runtime difference, and see whether it works out.
In 32 bit machine, One memory read cycle gets 4 bytes of data.
It's not that simple. Just the term "32 bit machine" is already too broad and can mean many things. 32b registers (GP registers? ALU registers? Address registers?)? 32b address bus? 32b data bus? 32b instruction word size?
And "memory read" by whom. CPU? Cache? DMA chip?
If you have a HW platform where memory is read by 4 bytes (aligned by 4) in single cycle and without any cache, then alignas(128) will do no difference (than alignas(4)).
There is a Boost tutorial giving approximately the following code, slightly modified for my question:
#include <boost/align/aligned_allocator.hpp>
#include <vector>
int main()
{
std::vector<int, boost::alignment::
aligned_allocator<int, 16> > v(100);
}
In this example, an alignment parameter of 16 is given. Does this indicate 16 bytes of alignment, or 16*sizeof(int) bytes of alignment?
It would represent 16 bytes of alignment.
On some processors, access to a non-aligned memory address can result in an exception. On others, a non-aligned memory access might work, but may be suboptimal, possibly requiring extra reads of memory at aligned addresses. The actual alignment needed or desired varies depending on context.
For example, on a 32-bit x86 processor a 32-bit (4 byte) non-aligned access can result in two aligned memory accesses. If a 4 byte read was done at address 1, the processor may need to read bytes 0..3, followed by a read of bytes 4..7, and then combine bytes 1..4 into the result, discarding the extra data read.
For SIMD instructions the alignment is greater. A 64-bit MMX instruction should access memory that is 64-bit (8 byte) aligned. A 128-bit XMM instruction should access memory that is 128-bit (16 byte) aligned.
On a SPARC processor an unaligned memory access would result in a processor exception. I believe ARM also generates exceptions for unaligned access. On x86 you can also get exceptions in some cases. In particular, if the stack is not properly aligned, it can cause a program crash. A detail that is usually handled by the compiler.
The number 16 refers to the number of bytes. From the Boost.Align documentation (that uses the same terminology as the C++ Standard)
[basic.align]
Object types have alignment requirements which place
restrictions on the addresses at which an object of that type may be
allocated. An alignment is an implementation-defined integer value
representing the number of bytes between successive addresses at which
a given object can be allocated. An object type imposes an alignment
requirement on every object of that type; stricter alignment can be
requested using the alignment specifier.
I've read about coalesced memory access(In CUDA, what is memory coalescing, and how is it achieved?) and its performance importance. However I don't know what a typical GPU does when a non coalesced memory access occur. When a thread "asks" for a byte in position P and the other threads asks for something far away the GPU gets a complete block of 128 bytes for that thread? If the reading is aligned can I read the other 127 bytes for "free"?
General rules:
memory access instructions are issued warp-wide, just like any other instruction
each thread in a warp provides an address to read from
assuming these addresses don't "hit" in any of the caches, the memory controller collects all addresses and determines how many "segments" (roughly analogous to a cacheline) are required from DRAM. A "segment" is either 32 bytes or 128 bytes, depending on cache and device specifics.
the memory controller then requests those lines/segments from DRAM
If a single thread generates an address that is not near any of the other addresses generated in the warp, then the memory controller will need to request a whole line/segment from DRAM, which may be either 32 bytes or 128 bytes, depending on device and which caches are involved (i.e. what type of "miss" occurred) just to satisfy that one address from that one thread. Therefore regardless of whether that thread is requesting a minimum of 1 byte or up to the maximum of 16 bytes possible in a single thread read transaction, the memory controller must read either 32 bytes or 128 bytes from DRAM to satisfy the read originating from that thread. Similar logic will apply to every other address emanating from that particular "warp read".
This type of scattered or isolated access pattern is "uncoalesced", because no other thread in the warp needs an address close enough so that it can fulfill its needs from the same segment/line.
When a thread "asks" for a byte in position P and the other threads asks for something far away the GPU gets a complete block of 128 bytes for that thread?
Yes, either 32 bytes or 128 bytes is the minimum granularity of request that can be made from DRAM.
If the reading is aligned can I read the other 127 bytes for "free"?
Whether you need it or not, and regardless of alignment of requests within the line/segment, you will get either 32 bytes or 128 bytes from any DRAM read transaction.
This doesn't cover every case, but a general breakdown of the 32byte/128byte difference is as follows:
cc2.x devices have an enabled L1 cache, and so a cache "miss" will generally trigger a read of 128 bytes
cc3.x devices have only L2 cache enabled (for global memory transactions) and the L2 cacheline size is 32 bytes. A "miss" here will require a 32-byte load from DRAM, but a fully coalesced read across a warp will still ultimately require a load of 128 bytes (for int or float, for example) so ultimately four L2 cachelines will still be needed. (There is no free lunch.)
cc5.x devices once again have the L1 enabled, so should be back to needing a full 128 byte load on a "miss"
This presentation will be instructive. In particular, slide 17 shows one example of "perfect" coalescing, whereas slide 25 shows an example of a "fully uncoalesced" load.
I am using a hardware interface to send data that requires me to set up a DMA buffer, which needs to be aligned on 64 bits boundaries.
The DMA engine expects buffers to be aligned on at least 32 bits boundaries (4 bytes). For optimal performance the buffer should be aligned on 64 bits boundaries (8 bytes). The transfer size must be a multiple of 4 bytes.
I create buffers using posix_memalign, as demonstrated in the snippet bellow.
posix_memalign ((void**)&pPattern, 0x1000, DmaBufferSizeinInt32s * sizeof(int))
pPattern is a pointer to an int, and is the start of my buffer which is DmaBufferSizeinInt32s deep.
Is my buffer aligned on 64bits?
Yes, your buffer IS aligned on 64-bits. It's ALSO aligned on a 4 KByte boundary (hence the 0x1000). If you don't want the 4 KB alignement then pass 0x8 instead of 0x1000 ...
Edit: I would also note that usually when writing DMA chains you are writing them through uncached memory or through some kind of non-cache based write queue. If this is the case you want to align your DMA chains to the cache line size as well to prevent a cache write-back overwriting the start or end of your DMA chain.
As Goz pointed out, but (imo) a bit less clearly: you're asking for alignment by 0x1000 bytes (the second argument), which is much more than 64 bits.
You could change the call to just:
posix_memalign ((void**)&pPattern, 8, DmaBufferSizeinInt32s * sizeof(int)))
This might make the call cheaper (less wasted memory), and in any case is clearer, since you ask for something that more closely matches what you actually want.
I don't know your hardware and I don't know how you are getting your pPattern pointer, but this seems risky all around. Most DMA I am familiar with requires physical continuous RAM. The operating system only provides virtually continuous RAM to user programs. That means that a memory allocation of 1 MB might be composed of up to 256 unconnected 4K RAM pages.
Much of the time memory allocations will be made of continuous physical pieces which can lead to things working most of the time but not always. You need a kernel device driver to provide safe DMA.
I wonder about this because if your pPattern pointer is coming from a device driver, then why do you need to align it more?