Coalesced memory access performance - opengl

I've read about coalesced memory access(In CUDA, what is memory coalescing, and how is it achieved?) and its performance importance. However I don't know what a typical GPU does when a non coalesced memory access occur. When a thread "asks" for a byte in position P and the other threads asks for something far away the GPU gets a complete block of 128 bytes for that thread? If the reading is aligned can I read the other 127 bytes for "free"?

General rules:
memory access instructions are issued warp-wide, just like any other instruction
each thread in a warp provides an address to read from
assuming these addresses don't "hit" in any of the caches, the memory controller collects all addresses and determines how many "segments" (roughly analogous to a cacheline) are required from DRAM. A "segment" is either 32 bytes or 128 bytes, depending on cache and device specifics.
the memory controller then requests those lines/segments from DRAM
If a single thread generates an address that is not near any of the other addresses generated in the warp, then the memory controller will need to request a whole line/segment from DRAM, which may be either 32 bytes or 128 bytes, depending on device and which caches are involved (i.e. what type of "miss" occurred) just to satisfy that one address from that one thread. Therefore regardless of whether that thread is requesting a minimum of 1 byte or up to the maximum of 16 bytes possible in a single thread read transaction, the memory controller must read either 32 bytes or 128 bytes from DRAM to satisfy the read originating from that thread. Similar logic will apply to every other address emanating from that particular "warp read".
This type of scattered or isolated access pattern is "uncoalesced", because no other thread in the warp needs an address close enough so that it can fulfill its needs from the same segment/line.
When a thread "asks" for a byte in position P and the other threads asks for something far away the GPU gets a complete block of 128 bytes for that thread?
Yes, either 32 bytes or 128 bytes is the minimum granularity of request that can be made from DRAM.
If the reading is aligned can I read the other 127 bytes for "free"?
Whether you need it or not, and regardless of alignment of requests within the line/segment, you will get either 32 bytes or 128 bytes from any DRAM read transaction.
This doesn't cover every case, but a general breakdown of the 32byte/128byte difference is as follows:
cc2.x devices have an enabled L1 cache, and so a cache "miss" will generally trigger a read of 128 bytes
cc3.x devices have only L2 cache enabled (for global memory transactions) and the L2 cacheline size is 32 bytes. A "miss" here will require a 32-byte load from DRAM, but a fully coalesced read across a warp will still ultimately require a load of 128 bytes (for int or float, for example) so ultimately four L2 cachelines will still be needed. (There is no free lunch.)
cc5.x devices once again have the L1 enabled, so should be back to needing a full 128 byte load on a "miss"
This presentation will be instructive. In particular, slide 17 shows one example of "perfect" coalescing, whereas slide 25 shows an example of a "fully uncoalesced" load.

Related

why the value of address in c s always even?

Why does the value of the address in C and C++ is always even?
For example: I declare a variable int x and x has a memory address 0x6ffe1c (in hexa decimal). No matter what. That value is never odd number it is always an even number. Why is that so??
Computer memory is composed of bits. The bits are organized into groups. A computer may have a gigabyte of memory, which is over 1,000,000,000 bytes or 8,000,000,000 bits, but the physical connections to memory cannot simply get any one particular bit from that memory.
When the processor wants data from memory, it puts a signal on a bus that asks for a particular word of memory. A bus is largely a set of wires that connects different parts of the computer. A word of memory is a group of bits of some size particular to that hardware, perhaps 32 bits. When the memory device sees a request for a word, it gets those bits and puts them on the bus, all at once. (The bus for that will have 32 or more wires, so it can carry all the data for one word at one time.)
Let’s continue with the example of 32-bit words. Since memory is grouped into words of 32 bits, each word has a memory address that is a multiple of 32 bits, or four bytes. And every address that is a multiple of four (0, 4, 8, 12, 16, … 4096, 4100, 4104,…) is the address of a word. The processor always reads or writes memory in units of words—that is the only interaction the hardware can do; the processor cannot read individual bytes from memory. If your int is in a single word, then the processor can get it from memory by asking for that word.
On the other hand, suppose your int starts at address 99. Then one byte of it is in the word that starts at address 96 (addresses 96 to 99), and three bytes of it are in the word that starts at address 100 (addresses 100 to 103). In order to get your int, the processor has to read two words and then stitch together bytes from them to make one int.
First, that is a waste of time. Doing two reads from memory takes longer than doing one read. Second, if the processor has to have extra wires and circuits for doing that, it makes the processor more expensive and use more energy, and it takes resources away from other things the processor could be doing, like adding or multiplying.
So processors are designed to prefer aligned data. They may have components for handling unaligned data, but using those components may take extra time or resources. So compilers are designed to align objects in ways that are preferable for the target architecture.
Why does the value of the address in C and C++ is always even?
This is not true in general. For example, if you have an array char[2], you'll find that exactly one of those elements has an even address, and the other must have an odd address.
I declare a variable int x and x has a memory address 0x6ffe1c
Every type in C and C++ have have some alignment requirement. That is, objects of the type must be stored in an address that is divisible by that alignment. Alignment requirement is an integer that is always a power of two.
There is exactly one power of two that is not even: 20 == 1. Objects with an alignment requirement of 1 can be stored in odd addresses. char always has size and alignment of 1 byte. int typically has a higher alignment requirement in which case it will be stored in an even address.
The reason why alignment is important is that there are CPU instruction sets which only allow reading and writing to memory addresses that are aligned to the width of the CPU word. Other CPU instruction sets may support operations on addresses aligned to some fractions of the word size. Further, some CPU instruction sets, such as the x86 support operating on entirely unaligned address but (at least on older models), such operations may be much slower.

Cache mapping techniques

Im trying to understand hardware Caches. I have a slight idea, but i would like to ask on here whether my understanding is correct or not.
So i understand that there are 3 types of cache mapping, direct, full associative and set associative.
I would like to know is the type of mapping implemented with logic gates in hardware and specific to say some computer system and in order to change the mapping, one would be required to changed the electrical connections?
My current understanding is that in RAM, there exists a memory address to refer to each block of memory. Within a block contains words, each words contain a number of bytes. We can represent the number of options with number of bits.
So for example, 4096 memory locations, each memory location contains 16 bytes. If we were to refer to each byte then 2^12*2^4 = 2^16
16 bit memory address would be required to refer to each byte.
The cache also has a memory address, valid bit, tag, and some data capable of storing a block of main memory of n words and thus m bytes. Where m = n*i (bytes per word)
For an example, direct mapping
1 block of main memory can only be at one particular memory location in cache. When the CPU requests for some data using a 16bit memory location of RAM, it checks for cache first.
How does it know that this particular 16bit memory address can only be in a few places?
My thoughts are, there could be some electrical connection between every RAM address to a cache address. The 16bit address could then be split into parts, for example only compare the left 8bits with every cache memory address, then if match compare the byte bits, then tag bits then valid bit
Is my understanding correct? Thank you!
Really do appreciate if someone read this long post
You may want to read 3.3.1 Associativity in What Every Programmer Should Know About Memory from Ulrich Drepper.
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf#subsubsection.3.3.1
The title is a little bit catchy, but it explains everything you ask in detail.
In short:
the problem of caches is the number of comparisons. If your cache holds 100 blocks, you need to perform 100 comparisons in one cycle. You can reduce this number with the introduction of sets. if A specific memory-region can only be placed in slot 1-10, you reduce the number of comparisons to 10.
The sets are addressed by an additional bit-field inside the memory-address called index.
So for instance your 16 Bit (from your example) could be splitted into:
[15:6] block-address; stored in the `cache` as the `tag` to identify the block
[5:4] index-bits; 2Bit-->4 sets
[3:0] block-offset; byte position inside the block
so the choice of the method depends on the availability of hardware-resources and the access-time you want to archive. Its pretty much hardwired, since you want to reduce the comparison-logic.
There are few mapping functions used for map cache lines with main memory
Direct Mapping
Associative Mapping
Set-Associative Mapping
you have to have an slight idea about these three mapping functions

How does std::alignas optimize the performance of a program?

In 32-bit machine, One memory read cycle gets 4 bytes of data.
So for reading below buffer, It should take 32 read-cycle to read a buffer of 128 bytes mentioned below.
char buffer[128];
Now, Suppose if I have aligned this buffer as mentioned below then please let me know how will it make it faster to read?
alignas(128) char buffer[128];
I am assuming the memory read cycle will remain 4 bytes only.
The size of the registers used for memory access is only one part of the story, the other part is the size of the cache-line.
If a cache-line is 64 bytes and your char[128] is naturally aligned, the CPU generally needs to manipulate three different cache-lines. With alignas(64) or alignas(128), only two cache-lines need to be touched.
If you are working with memory mapped file, or under swapping conditions, the next level of alignment kicks in: the size of a memory page. This would call for 4096 or 8192 byte alignments.
However, I seriously doubt that alignas() has any significant positive effect if the specified alignment is larger than the natural alignment that the compiler uses anyway: It significantly increases memory consumption, which may be enough to trigger more cache-lines/memory pages being touched in the first place. It's only the small misalignments that need to be avoided because they may trigger huge slowdowns on some CPUs, or might be downright illegal/impossible on others.
Thus, truth is only in measurement: If you need all the speedup you can get, try it, measure the runtime difference, and see whether it works out.
In 32 bit machine, One memory read cycle gets 4 bytes of data.
It's not that simple. Just the term "32 bit machine" is already too broad and can mean many things. 32b registers (GP registers? ALU registers? Address registers?)? 32b address bus? 32b data bus? 32b instruction word size?
And "memory read" by whom. CPU? Cache? DMA chip?
If you have a HW platform where memory is read by 4 bytes (aligned by 4) in single cycle and without any cache, then alignas(128) will do no difference (than alignas(4)).

Why are there alignment boundaries larger than 4?

What i don't understand, is why we have to align data in memory on boundaries larger than 4 bytes since all the other boundaries are multiples of 4. Assuming a CPU can read 4 bytes in a cycle, it will be basically no difference in performance if that data is 8 bytes large and is aligned on a 4 byte / 8 byte / 16 byte, etc.
When an x86 CPU reads a double, it reads 8 bytes in a cycle. When it reads an SSE vector, it reads 16 bytes. When it reads an AVX vector, it reads 32.
When the CPU fetches a cache line from memory, it also reads at least 32 bytes.
Your assumption that the CPU reads 4 bytes per cycle is false.
First: x86 CPUs don't read stuff in 4 bytes only, they can read 8 bytes in a cycle or even more with SIMD extensions.
But to answer your question "why are there alignment boundaries multiple than 4?", assuming a generic architecture (you didn't specify one and you wrote that x86 was just an example) I'll present a specific case: GPUs.
NVIDIA GPU memory can only be accessed (store/load) if the address is aligned on a multiple of the access size (PTX ISA ld/st). There are different kinds of loads and the most performant ones happen when the address is aligned to a multiple of the access size so if you're trying to load a double from memory (8 bytes) you would have (pseudocode):
ld.double [48dec] // Works, 8 bytes aligned
ld.double [17dec] // Fails, not 8 bytes aligned
in the above case when trying to access (r/w) memory that is not properly aligned the process will actually cause an error. If you want speed you'll have to provide some safety guarantees.
That might answer your question on why alignment boundaries larger than 4 exist in the first place. On such an architecture an access size of 1 is always safe (every address is aligned to 1). That isn't true for every n>1.

How do I ensure buffer memory is aligned?

I am using a hardware interface to send data that requires me to set up a DMA buffer, which needs to be aligned on 64 bits boundaries.
The DMA engine expects buffers to be aligned on at least 32 bits boundaries (4 bytes). For optimal performance the buffer should be aligned on 64 bits boundaries (8 bytes). The transfer size must be a multiple of 4 bytes.
I create buffers using posix_memalign, as demonstrated in the snippet bellow.
posix_memalign ((void**)&pPattern, 0x1000, DmaBufferSizeinInt32s * sizeof(int))
pPattern is a pointer to an int, and is the start of my buffer which is DmaBufferSizeinInt32s deep.
Is my buffer aligned on 64bits?
Yes, your buffer IS aligned on 64-bits. It's ALSO aligned on a 4 KByte boundary (hence the 0x1000). If you don't want the 4 KB alignement then pass 0x8 instead of 0x1000 ...
Edit: I would also note that usually when writing DMA chains you are writing them through uncached memory or through some kind of non-cache based write queue. If this is the case you want to align your DMA chains to the cache line size as well to prevent a cache write-back overwriting the start or end of your DMA chain.
As Goz pointed out, but (imo) a bit less clearly: you're asking for alignment by 0x1000 bytes (the second argument), which is much more than 64 bits.
You could change the call to just:
posix_memalign ((void**)&pPattern, 8, DmaBufferSizeinInt32s * sizeof(int)))
This might make the call cheaper (less wasted memory), and in any case is clearer, since you ask for something that more closely matches what you actually want.
I don't know your hardware and I don't know how you are getting your pPattern pointer, but this seems risky all around. Most DMA I am familiar with requires physical continuous RAM. The operating system only provides virtually continuous RAM to user programs. That means that a memory allocation of 1 MB might be composed of up to 256 unconnected 4K RAM pages.
Much of the time memory allocations will be made of continuous physical pieces which can lead to things working most of the time but not always. You need a kernel device driver to provide safe DMA.
I wonder about this because if your pPattern pointer is coming from a device driver, then why do you need to align it more?