How to use this macro to test if memory is aligned? - c++

I'm a simd beginner, I've read this article about the topic (since I'm using a AVX2-compatible machine).
Now, I've read in this question to check if your pointer is aligned.
I'm testing it with this toy example main.cpp:
#include <iostream>
#include <immintrin.h>
#define is_aligned(POINTER, BYTE_COUNT) \
(((uintptr_t)(const void *)(POINTER)) % (BYTE_COUNT) == 0)
int main()
{
float a[8];
for(int i=0; i<8; i++){
a[i]=i;
}
__m256 evens = _mm256_set_ps(2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0);
std::cout<<is_aligned(a, 16)<<" "<<is_aligned(&evens, 16)<<std::endl;
std::cout<<is_aligned(a, 32)<<" "<<is_aligned(&evens, 32)<<std::endl;
}
And compile it with icpc -std=c++11 -o main main.cpp.
The resulting printing is:
1 1
1 1
However, if I add thhese 3 lines before the 4 prints:
for(int i=0; i<8; i++)
std::cout<<a[i]<<" ";
std::cout<<std::endl;
This is the result:
0 1 2 3 4 5 6 7
1 1
0 1
In particular, I don't understand that last 0. Why is it different from the last printing? What am I missing?

Your is_aligned (which is a macro, not a function) determines whether the object has been aligned to particular boundary. It does not determine the alignment requirement of the type of the object.
The compiler will guarantee for a float array, that it be aligned to at least the alignment requirement of a float - which is typically 4. 32 is not a factor of 4, so there is no guarantee that the array be aligned to 32 byte boundary. However, there are many memory addresses that are divisible by both 4 and 32, so it is possible that a memory address at a 4 byte boundary happens to also be at a 32 byte boundary. This is what happened in your first test, but as explained, there is no guarantee that it would happen. In your latter test you added some local variables, and the array ended up in another memory location. It so happened that the other memory location wasn't at the 32 byte boundary.
To request a stricter alignment that may be required by SIMD instructions, you can use the alignas specifier:
alignas(32) float a[8];

Related

Custom data size for memory alignment

Each datatype has a certain range, based on the hardware. For example, on a 32bit machine an int has the range -2147483648 to 2147483647.
C++ compilers 'pad' object memory to fit into certain sizes. I'm pretty sure it's 2, 4, 8, 16, 32, 64 etc. This also probably depends on the machine.
I want to manually align my objects to meet padding requirements. Is there a way to:
Determine what machine a program is running on
Determine padding sizes
Set custom data-type based on bitsize
I've used bitsets before in Java, but I'm not familiar with C++. As for machine requirements, I know programs for different hardware sets are usually compiled differently in C++, so I'm wondering if its even possible at all.
Example->
/*getHardwarePack size obviously doesn't exist, just here to explain. What I'm trying to get
here would be the minimum alignment size for the machine the program is running on*/
#define PACK_SIZE = getHardwarePackSize();
#define MONTHS = 12;
class date{
private:
//Pseudo code that represents making a custom type
customType monthType = MONTH/PACK_SIZE;
monthType.remainder = MONTH % PACK_SIZE;
monthType months = 12;
};
The idea is to be able to fit every variable into the minimum bit size and track how many bits are left over.
Theoretically, it would be possible to make use of every unused bit and improve memory efficiency. Obviously this would never work anything like this, but the example is just to explain the concept.
This is a lot more complex than what you are trying to describe, as there are requirements for alignment on objects and items within objects. For example, if the compiler decides that an integer item is 16 bytes into a struct or class, it may well decide that "ah, I can use an aligned SSE instruction to load this data, because it is aligned at 16 bytes" (or something similar in ARM, PowerPC, etc). So if you do not satisfy AT LEAST that alignment in your code, you will cause the program to go wrong (crash or misread the data, depending on the architecture).
Typically, the alignment used and given by the compiler will be "right" for whatever architecture the compiler is targeting. Changing it will often lead to worse performance. Not always, of course, but you'd better know exactly what you are doing before you fiddle with it. And measure the performance before/after, and test thoroughly that nothing has been broken.
The padding is typically just to the next "minimum alignment for the largest type" - e.g. if a struct contains only int and a couple of char variables, it will be padded to 4 bytes [inside the struct and at the end, as required]. For double, padding to 8 bytes is done to ensure, but three double will, typically, take up 8 * 3 bytes with no further padding.
Also, determining what hardware you are executing on (or will execute on) is probably better done during compilation, than during runtime. At runtime, your code will have been generated, and the code is already loaded. You can't really change the offsets and alignments of things at this point.
If you are using the gcc or clang compilers, you can use the __attribute__((aligned(n))), e.g. int x[4] __attribute__((aligned(32))); would create a 16-byte array that is aligned to 32 bytes. This can be done inside structures or classes as well as for any variable you are using. But this is a compile-time option, can not be used at runtime.
It is also possible, in C++11 onwards, to find out the alignment of a type or variable with alignof.
Note that it gives the alignment required for the type, so if you do something daft like:
int x;
char buf[4 * sizeof(int)];
int *p = (int *)buf + 7;
std::cout << alignof(*p) << std::endl;
the code will print 4, although the alignment of buf+7 is probably 3 (7 modulo 4).
Types can not be chosen at runtime. C++ is a statically typed language: the type of something is determined at runtime - sure, classes that derive from a baseclass can be created at runtime, but for any given object, it has ONE TYPE, always and forever until it is no longer allocated.
It is better to make such choices at compile-time, as it makes the code much more straight forward for the compiler, and will allow better optimisation than if the choices are made at runtime, since you then have to make a runtime decision to use branch A or branch B of some piece of code.
As an example of aligned vs. unaligned access:
#include <cstdio>
#include <cstdlib>
#include <vector>
#define LOOP_COUNT 1000
unsigned long long rdtscl(void)
{
unsigned int lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
struct A
{
long a;
long b;
long d;
char c;
};
struct B
{
long a;
long b;
long d;
char c;
} __attribute__((packed));
std::vector<A> arr1(LOOP_COUNT);
std::vector<B> arr2(LOOP_COUNT);
int main()
{
for (int i = 0; i < LOOP_COUNT; i++)
{
arr1[i].a = arr2[i].a = rand();
arr1[i].b = arr2[i].b = rand();
arr1[i].c = arr2[i].c = rand();
arr1[i].d = arr2[i].d = rand();
}
printf("align A %zd, size %zd\n", alignof(A), sizeof(A));
printf("align B %zd, size %zd\n", alignof(B), sizeof(B));
for(int loops = 0; loops < 10; loops++)
{
printf("Run %d\n", loops);
size_t sum = 0;
size_t sum2 = 0;
unsigned long long before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
sum += arr1[i].a + arr1[i].b + arr1[i].c + arr1[i].d;
unsigned long long after = rdtscl();
printf("ARR1 %lld sum=%zd\n",(after - before), sum);
before = rdtscl();
for (int i = 0; i < LOOP_COUNT; i++)
sum2 += arr2[i].a + arr2[i].b + arr2[i].c + arr2[i].d;
after = rdtscl();
printf("ARR2 %lld sum=%zd\n",(after - before), sum2);
}
}
[Part of that code is taken from another project, so it's perhaps not the neatest C++ code ever written, but it saved me writing code from scratch, that isn't relevant to the project]
Then the results:
$ ./a.out
align A 8, size 32
align B 1, size 25
Run 0
ARR1 5091 sum=3218410893518
ARR2 5051 sum=3218410893518
Run 1
ARR1 3922 sum=3218410893518
ARR2 4258 sum=3218410893518
Run 2
ARR1 3898 sum=3218410893518
ARR2 4241 sum=3218410893518
Run 3
ARR1 3876 sum=3218410893518
ARR2 4184 sum=3218410893518
Run 4
ARR1 3875 sum=3218410893518
ARR2 4191 sum=3218410893518
Run 5
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518
Run 6
ARR1 3875 sum=3218410893518
ARR2 4189 sum=3218410893518
Run 7
ARR1 3925 sum=3218410893518
ARR2 4229 sum=3218410893518
Run 8
ARR1 3884 sum=3218410893518
ARR2 4210 sum=3218410893518
Run 9
ARR1 3876 sum=3218410893518
ARR2 4186 sum=3218410893518
As you can see, the code that is aligned, using arr1 takes around 3900 clock-cycles, and the one using arr2 takes around 4200 cycles. So 300 cycles in roughly 4000 cycles, some 7.5% if my "menthol arithmetic" is works correctly.
Of course, like so many different things, it really depends on the exact situation, how the objects are used, what the cache-size is, exactly what processor it is, how much other code and data in other places around it also using cache-space. The only way to be certain is to experiment with YOUR code.
[I ran the code several times, and although I didn't always get the same results, I always got similar proportional results]

How to quickly replicate a 6-byte unsigned integer into a memory region?

I need to replicate a 6-byte integer value into a memory region, starting with its beginning and as quickly as possible. If such an operation is supported in hardware, I'd like to use it (I'm on x64 processors now, compiler is GCC 4.6.3).
The memset doesn't suit the job, because it can replicate bytes only. The std::fill isn't good either, because I even can't define an iterator, jumping between 6 byte-width positions in the memory region.
So, I'd like to have a function:
void myMemset(void* ptr, uint64_t value, uint8_t width, size_t num)
This looks like memset, but there is an additional argument width to define how much bytes from the value to replicate. If something like that could be expressed in C++, that would be even better.
I already know about obvious myMemset implementation, which would call the memcpy in loop with last argument (bytes to copy) equal to the width. Also I know, that I can define a temporary memory region with size 6 * 8 = 48 bytes, fill it up with 6-byte integers and then memcpy it to the destination area.
Can we do better?
Something along #Mark Ransom comment:
Copy 6 bytes, then 6, 12, 24, 48, 96, etc.
void memcpy6(void *dest, const void *src, size_t n /* number of 6 byte blocks */) {
if (n-- == 0) {
return;
}
memcpy(dest, src, 6);
size_t width = 1;
while (n >= width) {
memcpy(&((char *) dest)[width * 6], dest, width * 6);
n -= width;
width <<= 1; // double w
}
if (n > 0) {
memcpy(&((char *) dest)[width * 6], dest, n * 6);
}
}
Optimization: scale n and width by 6.
[Edit]
Corrected destination #SchighSchagh
Added cast (char *)
Determine the most efficient write size that the CPU supports; then find the smallest number that can be evenly divided by both 6 and that write size and call that "block size".
Now split the memory region up into blocks of that size. Each block will be identical and all writes will be correctly aligned (assuming the memory region itself is correctly aligned).
For example, if the most efficient write size that the CPU supports is 4 bytes (e.g. ancient 80486) then the "size of block" would be 12 bytes. You'd set 3 general purpose registers and do 3 stores per block.
For another example, if the most efficient write size that the CPU supports is 16 bytes (e.g. SSE) then the "size of block" would be 48 bytes. You'd set 3 SSE registers and do 3 stores per block.
Also, I'd recommend rounding the size of the memory region up to ensure it is a multiple of the block size (with some "not strictly necessary" padding). A few unnecessary writes are less expensive than code to fill a "partial block".
The second most efficient method might be to use a memory copy (but not memcpy() or memmove()). In this case you'd write the initial 6 bytes (or 12 bytes or 48 bytes or whatever), then copy from (e.g.) &area[0] to &area[6] (working from lowest to highest) until you reach the end. For this memmove() will not work because it will notice the area is overlapping and work from highest to lowest instead; and memcpy() will not work because it assumes the source and destination do not overlap; so you'd have to create your own memory copy to suit. The main problem with this is that you double the number of memory accesses - "reading and writing" is slower than "writing alone".
If your Num is large enough, you can try using the AVX vector instructions that will handle 32 bytes at a time (_mm256_load_si256/_mm256_store_si256 or their unaligned variants).
As 32 is not a multiple of 6, you will have to first replicate the 6 bytes pattern 16 times using short memcpy's or 32/64 bits moves.
ABCDEF
ABCDEF|ABCDEF
ABCD EFAB CDEF|ABCD EFAB CDEF
ABCDEFAB CDEFABCD EFABCDEF|ABCDEFAB CDEFABCD EFABCDE
ABCDEFABCDEFABCD EFABCDEFABCDEFAB CDEFABCDEFABCDEF|ABCDEFABCDEFABCD EFABCDEFABCDEFAB CDEFABCDEFABCDEF
You will also finish with a short memcpy.
Try the __movsq intrinsic (x64 only; in assembly, rep movsq) that will move 8 bytes at a time, with a suitable repetition factor, and setting the destination address 6 bytes after the source. Check that overlapping addresses are handled smartly.
Write 8 bytes at a time.
Being on a 64-bit machine, certainly the generated code can operate well with 8-byte writes. After dealing with some set-up issues, in a tight loop, write 8-bytes per write about num times. Assumptions apply - see code.
// assume little endian
void myMemset(void* ptr, uint64_t value, uint8_t width, size_t num) {
assert(width > 0 && width <= 8);
uint64_t *ptr64 = (uint64_t *) ptr;
// # to stop early to prevent writing past array end
static const unsigned stop_early[8 + 1] = { 0, 8, 3, 2, 1, 1, 1, 1, 0 };
size_t se = stop_early[width];
if (num > se) {
num -= se;
// assume no bus-fault with 64-bit write # `ptr64, ptr64+1, ... ptr64+7`
while (num > 0) { // tight loop
num--;
*ptr64 = value;
ptr64 = (uint64_t *) ((char *) ptr64 + width);
}
ptr = ptr64;
num = se;
}
// Cope with last few writes
while (num-- > 0) {
memcpy(ptr, &value, width);
ptr = (char *) ptr + width;
}
}
Further optimization includes writing 2 blocks at a time width == 3 or 4, 4 blocks at a time when width == 2 and 8 blocks at a time width == 1.

AVX alignment in array

I'm using MSVC12 (Visual Studio 2013 Express) and I try to implemenent a fast multiplication of 8*8 float values. The problem is the alignment: The vector has actually 9*n values, but I always just need the first 8, so e.g. for n=0 the alignment of 32 bytes is guaranteed (when I use _mm_malloc), for n=1 the "first" value is aligned at 4*9 = 36 bytes.
for(unsigned i = 0; i < n; i++) {
float *coeff_set = (float *)_mm_malloc(909 * 100 *sizeof(float), 32);
// this works for n=0, not n=1, n=2, ...
__m256 coefficients = _mm256_load_ps(&coeff_set[9 * i]);
__m256 result = _mm256_mul_ps(coefficients, coefficients);
...
}
Is there any possibility to solve this? I would like to keep the structure of my data, but if not possible, I would change it. One solution I found was to copy the 8 floats first in an aligned array, and then load it, but the performance-loss is way too high then.
You have two choices:
Pad each set of coefficients to 16 values to maintain alignment
Use the _mm256_loadu_ps intrinsic for unaligned accesses
The first choice is more speed-efficient, while the second is more space-efficient.

Is there a way to simulate integer bitwise operations for _m256 types on AVX?

I have a boolean expression that I have managed to implement in SSE2. Now I would have liked to try implementing it in AVX exploiting an additional factor 2 in parallelism increase (from 128 bit SIMD type to 256). However, AVX does not support integer operation (which AVX2 does, but I am working on a Sandy Bridge processor so it is not an option currently). However, since there are AVX intrinsics for bitwise operations. I figured I could make a try by just converting my integer types to float types and see if it works.
First test was a success:
__m256 ones = _mm256_set_ps(1,1,1,1,1,1,1,1);
__m256 twos = _mm256_set_ps(2,2,2,2,2,2,2,2);
__m256 result = _mm256_and_ps(ones, twos);
I'm guetting all 0's as I am supposed to. Simularly AND'ing the twos instead I get a result of 2. But when trying 11 XOR 4 accordingly:
__m256 elevens = _mm256_set_ps(11,11,11,11,11,11,11,11);
__m256 fours = _mm256_set_ps(4,4,4,4,4,4,4,4);
__m256 result2 = _mm256_xor_ps(elevens, fours);
The result is 6.46e-46 (i.e. close to 0) and not 15. Simularly doing 11 OR 4 gives me a value of 22 and not 15 as it should be. I don't understand why this is. Is it a bug or some configuration I am missing?
I was actually expecting my hypothesis of working with float as if they were integers to not work since the integer initialized to a float value might not actually be the precise value but a close approximation. But even then, I am surprised by the result I get.
Does anyone have a solution to this problem or must I upgrade my CPU to get AVX2 support enable this?
The first test worked by accident.
1 as a float is 0x3f800000, 2 is 0x40000000. In general, it wouldn't work that way.
But you can absolutely do it, you just have to make sure that you're working with the right bit-pattern. Don't convert your integers to floats - reinterpret-cast them. That corresponds to intrinsics such as _mm256_castsi256_ps, or storing your ints to memory and reading them as floats (that won't change them, in general only math operations care about what the floats mean, the rest work with the raw bit patterns, check the list of exceptions that an instruction can make to make sure).
You don't need AVX2 to use the AVX integer load and store operations: see the intel intrinsic guide. So you can load your integers using AVX, reinterpret-cast to float, use float bitwise operations, and then reinterpret-cast back to int. The reinterpret-casts don't generate any instructions, they just make the compiler happy. Try this:
//compiled and ran on an Ivy Bridge system with AVX but without AVX2
#include <stdio.h>
#include <immintrin.h>
int main() {
int a[8] = {0, 2, 4, 6, 8, 10, 12, 14};
int b[8] = {1, 1, 1, 1, 1, 1, 1, 1};
int c[8];
__m256i a8 = _mm256_loadu_si256((__m256i*)a);
__m256i b8 = _mm256_loadu_si256((__m256i*)b);
__m256i c8 = _mm256_castps_si256(
_mm256_or_ps(_mm256_castsi256_ps(a8), _mm256_castsi256_ps(b8)));
_mm256_storeu_si256((__m256i*)c, c8);
for(int i=0; i<8; i++) printf("%d ", c[i]); printf("\n");
//output: 1 3 5 7 9 11 13 15
}
Of course, as Mystical pointed out this might not be worth doing but that does not mean you can't do it.

Autovectorization alignment

From Intel's Compiler Autovectorization Guide there's an example related to alignment that I don't understand. The code is
double a[N], b[N];
...
for(i = 0; i < N; i++)
a[i+1] = b[i] * 3;
And it says
If the first element of both arrays is aligned at a 16-byte boundary,
then either an unaligned load of elements from b or an unaligned
store of elements into a, has to be used after vectorization.
However, the programmer can enforce the alignment shown below, which
will result in two aligned access patterns after vectorization
(assuming an 8-byte size for doubles)
_declspec(align(16, 8)) double a[N];
_declspec(align(16, 0)) double b[N];
How to see where the misalignment comes after vectorization? Wouldn't the alignment depend on the size of the arrays?
Hans Passant essentially covers all the right ideas, but let me explain a bit more:
Say a and b are both aligned to 16 bytes. say, they have address 0x100 and 0x200, for the sake of example.
Now, let's see how the code looks like with i=3 (odd) and i=6 (even)...
a[i+1] = b[i] * 3; will do [0x120] = [0x318] * 3 (i=3, sizeof double is 8)
or
a[i+1] = b[i] * 3; will do [0x138] = [0x330] * 3
In both cases, either the left hand side or the right hand side is aligned, while the other one is misaligned (aligned accesses would always end in 0 in hex, misaligned something else).
Now... Let's purposefully misalign a to a 8 modulo 16 address (say to 0x108, to keep our example).
Let's see how the code looks like with i=3 (odd) and i=6 (even)...
a[i+1] = b[i] * 3; will do [0x128] = [0x318] * 3 (i=3, sizeof double is 8)
or
a[i+1] = b[i] * 3; will do [0x140] = [0x330] * 3
both keep the actual accesses aligned and misaligned at the same time.