How do I ensure buffer memory is aligned? - c++

I am using a hardware interface to send data that requires me to set up a DMA buffer, which needs to be aligned on 64 bits boundaries.
The DMA engine expects buffers to be aligned on at least 32 bits boundaries (4 bytes). For optimal performance the buffer should be aligned on 64 bits boundaries (8 bytes). The transfer size must be a multiple of 4 bytes.
I create buffers using posix_memalign, as demonstrated in the snippet bellow.
posix_memalign ((void**)&pPattern, 0x1000, DmaBufferSizeinInt32s * sizeof(int))
pPattern is a pointer to an int, and is the start of my buffer which is DmaBufferSizeinInt32s deep.
Is my buffer aligned on 64bits?

Yes, your buffer IS aligned on 64-bits. It's ALSO aligned on a 4 KByte boundary (hence the 0x1000). If you don't want the 4 KB alignement then pass 0x8 instead of 0x1000 ...
Edit: I would also note that usually when writing DMA chains you are writing them through uncached memory or through some kind of non-cache based write queue. If this is the case you want to align your DMA chains to the cache line size as well to prevent a cache write-back overwriting the start or end of your DMA chain.

As Goz pointed out, but (imo) a bit less clearly: you're asking for alignment by 0x1000 bytes (the second argument), which is much more than 64 bits.
You could change the call to just:
posix_memalign ((void**)&pPattern, 8, DmaBufferSizeinInt32s * sizeof(int)))
This might make the call cheaper (less wasted memory), and in any case is clearer, since you ask for something that more closely matches what you actually want.

I don't know your hardware and I don't know how you are getting your pPattern pointer, but this seems risky all around. Most DMA I am familiar with requires physical continuous RAM. The operating system only provides virtually continuous RAM to user programs. That means that a memory allocation of 1 MB might be composed of up to 256 unconnected 4K RAM pages.
Much of the time memory allocations will be made of continuous physical pieces which can lead to things working most of the time but not always. You need a kernel device driver to provide safe DMA.
I wonder about this because if your pPattern pointer is coming from a device driver, then why do you need to align it more?

Related

Allocate small struct to 32 bit aligned in 64 bit system

The problem: I'm implementing a non-blocking data structure, where threads alter a shared pointer using a CAS operation. As pointers can be recycled, we have the ABA issue. To avoid this, I want to attach a version to each pointer. This is called a versioned pointer. A CAS128 is considered more expensive than a CAS64, so I'm trying to avoid going above 64 bits.
I'm trying to implement a versioned pointer. In a 32b system, the versioned pointer is a 64b struct, where the top 32 bits are the pointer and the bottom 32 is its version. This allows me to use CAS64 to atomically alter the pointer.
I'm having issues with a 64b system. In this case, I still want to use CAS64 instead of CAS128, so I'm trying to allocate a pointer aligned to 4GB (i.e., 32 zeros). I can then use masks to infer the pointer and version.
The solutions I've tried using alligned_malloc, padding, and std::align, but these involve allocating very large amounts of memory, e.g., alligned_malloc(1LL << 32, (1LL << 32)* sizeof(void*)) allocates 4GB of memory. Another solution is using a memory mapped file, but this involves synchronization that we're trying to avoid.
Is there a way to allocate 8B of memory aligned to 4GB that I'm missing?
First off, a non-portable solution that limits the code complexity creep to the point of allocation (see below for another approach that makes point of use more complicated, but should be portable); it only works on POSIX systems (not Windows), but you could reduce your overhead to the size of a page (not 8 bytes, but in the context of a 64 bit system, wasting 4088 bytes isn't too bad if you're not doing it too often; obviously, the nature of your problem means that you can't possibly waste more than sysconf(_SC_PAGESIZE) - 8 bytes per 4 GB, so that's not too bad) by the following mechanism:
mmap 4 GB of memory anonymously (not file-backed; pass fd of -1 and include the MAP_ANONYMOUS flag)
Compute the address of the 4 GB aligned pointer within that block
munmap the memory preceding that address, and the memory beginning sysconf(_SC_PAGE_SIZE) bytes after that address
This works because memory mappings aren't monolithic; they can be unmapped piecemeal, individual pages can be remapped without error, etc.
Note that if you're short on swap space, the brief request for 4 GB might cause problems (e.g. on a Linux system with heuristic overcommit disabled, it might fail to allocate the memory if it can't back it with swap, even though you never use most of it). You can experiment with passing MAP_NORESERVE to the original request, then performing the unmapping, then remapping that single page with MAP_FIXED (without MAP_NORESERVE) to ensure the allocation can be used without triggering a SIGSEGV at time of writing.
If you can't use POSIX mmap, should it really be impossible to use CAS128, you may want to consider a segmented memory model like the old x86 scheme for these pointers. You block allocate 4 GB segments (they don't need any special alignment) up front, and have your "pointers" be 32 bit offsets from the base address of the segment; you can't use the whole 64 bit address space (unless you allow for multiple selectors, possibly by repurposing part of the version number field for example; you can probably make do with a few million versions rather than four billion after all), but if you don't need to do so, this lets you have a base address that never changes after allocation (so no atomics needed), with offsets that fit within your desired 32 bit field. So instead of getting your data via:
data = *mystruct.pointer;
you have a segment pointer like this initialized early:
char *const base_address = new char[1ULL << 32]; // Or use smart pointer of your choosing
wrap it in a suballocator to partition the space, and now lookup is instead:
data = *reinterpret_cast<REAL_TYPE_HERE*>(&base_address[mystruct.pointer]);
I'm sure there are nifty ways to wrap this up better with custom allocators, custom operator news, what have you, but I've never had to do this in C++ (I've done similar magic in C, where there are no facilities to make it "pretty"), and I'd probably get it wrong, so I'll leave that as an exercise.

How does std::alignas optimize the performance of a program?

In 32-bit machine, One memory read cycle gets 4 bytes of data.
So for reading below buffer, It should take 32 read-cycle to read a buffer of 128 bytes mentioned below.
char buffer[128];
Now, Suppose if I have aligned this buffer as mentioned below then please let me know how will it make it faster to read?
alignas(128) char buffer[128];
I am assuming the memory read cycle will remain 4 bytes only.
The size of the registers used for memory access is only one part of the story, the other part is the size of the cache-line.
If a cache-line is 64 bytes and your char[128] is naturally aligned, the CPU generally needs to manipulate three different cache-lines. With alignas(64) or alignas(128), only two cache-lines need to be touched.
If you are working with memory mapped file, or under swapping conditions, the next level of alignment kicks in: the size of a memory page. This would call for 4096 or 8192 byte alignments.
However, I seriously doubt that alignas() has any significant positive effect if the specified alignment is larger than the natural alignment that the compiler uses anyway: It significantly increases memory consumption, which may be enough to trigger more cache-lines/memory pages being touched in the first place. It's only the small misalignments that need to be avoided because they may trigger huge slowdowns on some CPUs, or might be downright illegal/impossible on others.
Thus, truth is only in measurement: If you need all the speedup you can get, try it, measure the runtime difference, and see whether it works out.
In 32 bit machine, One memory read cycle gets 4 bytes of data.
It's not that simple. Just the term "32 bit machine" is already too broad and can mean many things. 32b registers (GP registers? ALU registers? Address registers?)? 32b address bus? 32b data bus? 32b instruction word size?
And "memory read" by whom. CPU? Cache? DMA chip?
If you have a HW platform where memory is read by 4 bytes (aligned by 4) in single cycle and without any cache, then alignas(128) will do no difference (than alignas(4)).

Coalesced memory access performance

I've read about coalesced memory access(In CUDA, what is memory coalescing, and how is it achieved?) and its performance importance. However I don't know what a typical GPU does when a non coalesced memory access occur. When a thread "asks" for a byte in position P and the other threads asks for something far away the GPU gets a complete block of 128 bytes for that thread? If the reading is aligned can I read the other 127 bytes for "free"?
General rules:
memory access instructions are issued warp-wide, just like any other instruction
each thread in a warp provides an address to read from
assuming these addresses don't "hit" in any of the caches, the memory controller collects all addresses and determines how many "segments" (roughly analogous to a cacheline) are required from DRAM. A "segment" is either 32 bytes or 128 bytes, depending on cache and device specifics.
the memory controller then requests those lines/segments from DRAM
If a single thread generates an address that is not near any of the other addresses generated in the warp, then the memory controller will need to request a whole line/segment from DRAM, which may be either 32 bytes or 128 bytes, depending on device and which caches are involved (i.e. what type of "miss" occurred) just to satisfy that one address from that one thread. Therefore regardless of whether that thread is requesting a minimum of 1 byte or up to the maximum of 16 bytes possible in a single thread read transaction, the memory controller must read either 32 bytes or 128 bytes from DRAM to satisfy the read originating from that thread. Similar logic will apply to every other address emanating from that particular "warp read".
This type of scattered or isolated access pattern is "uncoalesced", because no other thread in the warp needs an address close enough so that it can fulfill its needs from the same segment/line.
When a thread "asks" for a byte in position P and the other threads asks for something far away the GPU gets a complete block of 128 bytes for that thread?
Yes, either 32 bytes or 128 bytes is the minimum granularity of request that can be made from DRAM.
If the reading is aligned can I read the other 127 bytes for "free"?
Whether you need it or not, and regardless of alignment of requests within the line/segment, you will get either 32 bytes or 128 bytes from any DRAM read transaction.
This doesn't cover every case, but a general breakdown of the 32byte/128byte difference is as follows:
cc2.x devices have an enabled L1 cache, and so a cache "miss" will generally trigger a read of 128 bytes
cc3.x devices have only L2 cache enabled (for global memory transactions) and the L2 cacheline size is 32 bytes. A "miss" here will require a 32-byte load from DRAM, but a fully coalesced read across a warp will still ultimately require a load of 128 bytes (for int or float, for example) so ultimately four L2 cachelines will still be needed. (There is no free lunch.)
cc5.x devices once again have the L1 enabled, so should be back to needing a full 128 byte load on a "miss"
This presentation will be instructive. In particular, slide 17 shows one example of "perfect" coalescing, whereas slide 25 shows an example of a "fully uncoalesced" load.

Why are there alignment boundaries larger than 4?

What i don't understand, is why we have to align data in memory on boundaries larger than 4 bytes since all the other boundaries are multiples of 4. Assuming a CPU can read 4 bytes in a cycle, it will be basically no difference in performance if that data is 8 bytes large and is aligned on a 4 byte / 8 byte / 16 byte, etc.
When an x86 CPU reads a double, it reads 8 bytes in a cycle. When it reads an SSE vector, it reads 16 bytes. When it reads an AVX vector, it reads 32.
When the CPU fetches a cache line from memory, it also reads at least 32 bytes.
Your assumption that the CPU reads 4 bytes per cycle is false.
First: x86 CPUs don't read stuff in 4 bytes only, they can read 8 bytes in a cycle or even more with SIMD extensions.
But to answer your question "why are there alignment boundaries multiple than 4?", assuming a generic architecture (you didn't specify one and you wrote that x86 was just an example) I'll present a specific case: GPUs.
NVIDIA GPU memory can only be accessed (store/load) if the address is aligned on a multiple of the access size (PTX ISA ld/st). There are different kinds of loads and the most performant ones happen when the address is aligned to a multiple of the access size so if you're trying to load a double from memory (8 bytes) you would have (pseudocode):
ld.double [48dec] // Works, 8 bytes aligned
ld.double [17dec] // Fails, not 8 bytes aligned
in the above case when trying to access (r/w) memory that is not properly aligned the process will actually cause an error. If you want speed you'll have to provide some safety guarantees.
That might answer your question on why alignment boundaries larger than 4 exist in the first place. On such an architecture an access size of 1 is always safe (every address is aligned to 1). That isn't true for every n>1.

Virtual memory and alignment - how do they factor together?

I think I understand memory alignment, but what confuses me is that the address of a pointer on some systems is going to be in virtual memory, right? So most of the checking/ensuring of alignment I have seen seem to just use the pointer address. Is it not possible that the physical memory address will not be aligned? Isn't that problematic for things like SSE?
The physical address will be aligned because virtual memory only maps aligned pages to physical memory (and the pages are typically 4KB).
So unless you need alignment > page size, the physical memory will be aligned as per your requirements.
In the specific case of SSE, everything works fine because you only need 16 byte alignment.
I am not aware of any actual system in which an aligned virtual memory address can result in a misaligned physical memory address.
Typically, all alignments on a given platform will be powers of two. For example, on x86 32-bit integers have a natural alignment of 4 bytes (2^2). The page size - which defines how fine a block you can map in physical memory - is generally a large power of two. On x86, the smallest allowable page size is 4096 bytes (2^12). The largest datatype that might need alignment on x86 is 128 bits (for XMM registers and CMPXCHG16B) 32 bytes (for AVX) - 2^5. Since 2^12 is divisible by 2^5, you'll find that everything aligns right at the start of a page, and since pages are aligned both in virtual and physical memory, a virtual-aligned address will always be physical-aligned.
On a more practical level, allowing aligned virtual addresses to map to unaligned physical addresses not only would make it really hard to generate code, it would also make the CPU architecture more complex than simply allowing any alignment (since now we have odd-sized pages and other weirdness...)
Note that you may have reason to ask for larger alignments than a page from time to time. Typically, for user space coding, it doesn't matter if this is aligned in physical RAM (for that matter, if you're requesting multiple pages, it's unlikely to be even contiguous!). Problems here only arise if you're writing a device driver and need a large, aligned, contiguous block for DMA. But even then usually the device isn't a stickler about larger-than-page-size alignment.