Autovectorization alignment - c++

From Intel's Compiler Autovectorization Guide there's an example related to alignment that I don't understand. The code is
double a[N], b[N];
...
for(i = 0; i < N; i++)
a[i+1] = b[i] * 3;
And it says
If the first element of both arrays is aligned at a 16-byte boundary,
then either an unaligned load of elements from b or an unaligned
store of elements into a, has to be used after vectorization.
However, the programmer can enforce the alignment shown below, which
will result in two aligned access patterns after vectorization
(assuming an 8-byte size for doubles)
_declspec(align(16, 8)) double a[N];
_declspec(align(16, 0)) double b[N];
How to see where the misalignment comes after vectorization? Wouldn't the alignment depend on the size of the arrays?

Hans Passant essentially covers all the right ideas, but let me explain a bit more:
Say a and b are both aligned to 16 bytes. say, they have address 0x100 and 0x200, for the sake of example.
Now, let's see how the code looks like with i=3 (odd) and i=6 (even)...
a[i+1] = b[i] * 3; will do [0x120] = [0x318] * 3 (i=3, sizeof double is 8)
or
a[i+1] = b[i] * 3; will do [0x138] = [0x330] * 3
In both cases, either the left hand side or the right hand side is aligned, while the other one is misaligned (aligned accesses would always end in 0 in hex, misaligned something else).
Now... Let's purposefully misalign a to a 8 modulo 16 address (say to 0x108, to keep our example).
Let's see how the code looks like with i=3 (odd) and i=6 (even)...
a[i+1] = b[i] * 3; will do [0x128] = [0x318] * 3 (i=3, sizeof double is 8)
or
a[i+1] = b[i] * 3; will do [0x140] = [0x330] * 3
both keep the actual accesses aligned and misaligned at the same time.

Related

Optimal Manipulation of Long Bitwise Structures [duplicate]

Is there a better (faster/more efficient) way to perform a bitwise operation on a large memory block than using a for loop? After looking it to options I noticed that std has a member std::bitset, and was also wondering if it would be better (or even possible) to convert a large region of memory into a bitset without changing its values, then perform the operations, and then switch its type back to normal?
Edit / update: I think union might apply here, such that the memory block is allocated a new array of int or something and then manipulated as a large bitset. Operations seem to be able to be done over the entire set based on what is said here: http://www.cplusplus.com/reference/bitset/bitset/operators/ .
In general, there is no magical way faster than a for loop. However, you can make it easier for the compiler to optimize the loop by keeping a few things in mind:
Load the largest available integer type into memory at a time. However, you need to be careful if your buffer has a length which does not divide evenly by the size of that integer type.
If possible, operate on multiple values in one loop iteration - this should make vectorization much simpler for the compiler. Again, you need to be careful about the buffer length.
If the loop is to be run many times on short sections of code, use a loop index that counts downwards to zero rather than upwards, and subtract it from the array length - this makes it easier for the CPU's branch predictor to figure out what's going on.
You can use explicit vector extensions provided by the compiler, but this will make your code less portable.
Ultimately, you can write the loop in assembly and use vector instructions provided by your CPU, but this is completely unportable.
[edit] Additionally, you can use OpenMP or a similar API to divide the loop between multiple threads, but this will only cause an improvement if you are performing the operation on a very large amount of memory.
C99 example of xoring memory with a constant byte, assuming long long is 128-bit, the start of the buffer is aligned to 16 bytes, and without considering point 3. Bitwise operations on two memory buffers are very similar.
size_t len = ...;
char *buffer = ...;
size_t const loadd_per_i = 4
size_t iters = len / sizeof(long long) / loads_per_i;
long long *ptr = (long long *) buffer;
long long xorvalue = 0x5e5e5e5e5e5e5e5e5e5e5e5e5e5e5e5eLL;
// run in multiple threads if there are more than 4 MB to xor
#pragma omp parallel for if(iters > 65536)
for (size_t i = 0; i < iters; ++i) {
size_t j = loads_per_i*i;
ptr[j ] ^= xorvalue;
ptr[j+1] ^= xorvalue;
ptr[j+2] ^= xorvalue;
ptr[j+3] ^= xorvalue;
}
// finish long longs which don't align to 4
for (size_t i = iters * loads_per_i; i < len / sizeof(long long); ++i) {
ptr[i] ^= xorvalue;
}
// finish bytes which don't align to long
for (size_t i = (len / sizeof(long long)) * sizeof(long long); i < len; ++i) {
buffer[i] ^= xorvalue;
}

NEON emulation of VNNI instructions

There is new AVX-512 VNNI instructions in Cascade Lake Intel CPU which can accelerate inference of neural networks on CPU.
I integrated them into Simd Library to accelerate Synet (my small framework for inference of neural networks) and obtained significant performance boost.
In fact I used only one instruction _mm512_dpbusd_epi32 (vpdpbusd) which allows to perform multiplication of 8-bit signed and unsigned integers and then accumulates them into 32-bit integer accumulators.
It will be great to to perform analogue optimizations for NEON (ARM platform).
So there is a question:
Is exist any analogue of NEON instruction to emulate vpdpbusd? If there is no analogue what is the best way to emulate the instruction ?
There is a scalar implementation below (to best understand what the function must do):
inline void pdpbusd(int32x4_t& sum, uint8x16_t input, int8x16_t weight)
{
for (size_t i = 0; i < 4; ++i)
for (size_t j = 0; j < 4; ++j)
sum[i] += int32_t(input[i * 4 + j]) * int32_t(weight[i * 4 + j]);
}
The most straightforward implementation of that requires 3 instructions; vmovl.s8, vmovl.u8 to extend the signed and unsigned 8 bit values to 16 bit, followed by vmlal.s16, to do a signed lengthening 16 bit multiplication, accumulated into a 32 bit register. And as the vmlal.s16 only handles 4 elements, you'd need a second vmlal.s16 to multiply and accumulate the following 4 elements - so 4 instructions for 4 elements.
For aarch64 syntax, the corresponding instructions are sxtl, uxtl and smlal.
Edit:
If the output elements should be aggregated horizontally, one can't use the fused multiply-accumulate instructions vmlal. Then the solution would be vmovl.s8 and vmovl.u8, followed by vmul.i16 (for 8 input elements), vpaddl.s16 (to aggregate two elements horizontally), followed by another vpadd.i32 to get the sum of 4 elements horizontally. So 5 instructions for 8 input elements, or 10 instructions for a full 128 bit vector, followed by one final vadd.s32 to accumulate the final result to the accumulator. On AArch64, the equivalent of vpadd.i32, addp, can handle 128 bit vectors, so it's one instruction less there.
If you're using instrinsics, the implementation could look something like this:
int32x4_t vpdpbusd(int32x4_t sum, uint8x16_t input, int8x16_t weight) {
int16x8_t i1 = vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(input)));
int16x8_t i2 = vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(input)));
int16x8_t w1 = vmovl_s8(vget_low_s8(weight));
int16x8_t w2 = vmovl_s8(vget_high_s8(weight));
int16x8_t p1 = vmulq_s16(i1, w1);
int16x8_t p2 = vmulq_s16(i2, w2);
int32x4_t s1 = vpaddlq_s16(p1);
int32x4_t s2 = vpaddlq_s16(p2);
#if defined(__aarch64__)
int32x4_t s3 = vpaddq_s32(s1, s2);
#else
int32x4_t s3 = vcombine_s32(
vpadd_s32(vget_low_s32(s1), vget_high_s32(s1)),
vpadd_s32(vget_low_s32(s2), vget_high_s32(s2))
);
#endif
sum = vaddq_s32(sum, s3);
return sum;
}

What's the fastest way to pack 32 0/1 values into the bits of a single 32-bit variable?

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).
What's the fastest way to do this in C?
The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.
I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:
uint32_t bitpack(const bool array[32]) {
__mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
return _mm256_movemask_epi8(tmp);
}
Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.
If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication like this
inline int pack8b(bool* a)
{
uint64_t t = *((uint64_t*)a);
return (0x8040201008040201*t >> 56) & 0xFF;
}
int pack32b(bool* a)
{
return (pack8b(a + 0) << 24) | (pack8b(a + 8) << 16) |
(pack8b(a + 16) << 8) | (pack8b(a + 24) << 0);
}
Explanation:
Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)
| a7 || a6 || a4 || a4 || a3 || a2 || a1 || a0 |
.......h.......g.......f.......e.......d.......c.......b.......a
× 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
↑......h.↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑.....g..↑....f...↑...e....↑..d.....↑.c......↑b.......a
↑....f...↑...e....↑..d.....↑.c......↑b.......a
+ ↑...e....↑..d.....↑.c......↑b.......a
↑..d.....↑.c......↑b.......a
↑.c......↑b.......a
↑b.......a
a
────────────────────────────────────────────────────────────────
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out
So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the above code
Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits
Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack multiple uint32_t or uint16_t values (or other distribution of bits) at the same time but it's a lot less efficient than packing bytes
On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with
_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);
And to pack 2 uint32_t as the question requires use
_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);
Other answers contain an obvious loop implementation.
Here's a first variant:
unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
result = (result<<1) + a[i];
On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See #Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).
Let us try something more radical.
Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.
Then the following code packs m*2 booleans into an int:
(i1<<m+i2)
Using this we can pack 2^n bits as follows:
unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer
a2[0]=(a1[0]<<1)+a2[1]; // the original bits are a1[k]; can be scalar variables or ints
a2[1]=(a1[2]<<1)+a1[3]; // yes, you can use "|" instead of "+"
...
a2[15]=(a1[30]<<1)+a1[31];
a4[0]=(a2[0]<<2)+a2[1];
a4[1]=(a2[2]<<2)+a2[3];
...
a4[7]=(a2[14]<<2)+a2[15];
a8[0]=(a4[0]<<4)+a4[1];
a8[1]=(a4[2]<<4)+a4[3];
a8[1]=(a4[4]<<4)+a4[5];
a8[1]=(a4[6]<<4)+a4[7];
a16[0]=(a8[0]<<8)+a8[1]);
a16[1]=(a8[2]<<8)+a8[3]);
a32[0]=(a16[0]<<16)+a16[1];
Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).
On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.
On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.
If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:
bool a1[32];
then we can abuse our knowledge of memory layout to fetch several at a time:
a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
...
a4[7]=((unsigned int)a1)[7];
a8[0]=(a4[0]<<1)+a4[1];
a8[1]=(a4[2]<<1)+a4[3];
a8[2]=(a4[4]<<1)+a4[5];
a8[3]=(a8[6]<<1)+a4[7];
a16[0]=(a8[0]<<2)+a8[1];
a16[0]=(a8[2]<<2)+a8[3];
a32[0]=(a16[0]<<4)+a16[1];
Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).
To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).
As always, these solutions needed to performance tested.
I would probably go for this:
unsigned a[32] =
{
1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1
};
int main()
{
unsigned b = 0;
for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
b |= a[i] << i;
printf("b: %u\n", b);
}
Compiler optimization may well unroll that but just in case you can always try:
int main()
{
unsigned b = 0;
b |= a[0];
b |= a[1] << 1;
b |= a[2] << 2;
b |= a[3] << 3;
// ... etc
b |= a[31] << 31;
printf("b: %u\n", b);
}
To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):
unsigned int bits[32][2] = {
{0,0x80000000},{0,0x40000000},{0,0x20000000},{0,0x10000000},
{0,0x8000000},{0,0x4000000},{0,0x2000000},{0,0x1000000},
{0,0x800000},{0,0x400000},{0,0x200000},{0,0x100000},
{0,0x80000},{0,0x40000},{0,0x20000},{0,0x10000},
{0,0x8000},{0,0x4000},{0,0x2000},{0,0x1000},
{0,0x800},{0,0x400},{0,0x200},{0,0x100},
{0,0x80},{0,0x40},{0,0x20},{0,0x10},
{0,8},{0,4},{0,2},{0,1}
};
unsigned int b = 0;
for (i=0; i< 32; i++)
b |= bits[i][a[i]];
The first value in the array is to be the leftmost bit: the highest possible value.
Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):
Ira 3618 ticks
naive, unrolled 5620 ticks
Ira, 1-shifted 10044 ticks
Galik 10265 ticks
Jongware, using adds 12536 ticks
Jongware 12682 ticks
naive 13373 ticks
(Relative timings, with the same compiler options.)
(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)
unsigned b=0;
for(int i=31; i>=0; --i){
b<<=1;
b|=a[i];
}
Your problem is a good opportunity to use -->, also called the downto operator:
unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
b += b + a[i];
}
The advantage of using --> is it works with both signed and unsigned loop index types.
This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ

AVX alignment in array

I'm using MSVC12 (Visual Studio 2013 Express) and I try to implemenent a fast multiplication of 8*8 float values. The problem is the alignment: The vector has actually 9*n values, but I always just need the first 8, so e.g. for n=0 the alignment of 32 bytes is guaranteed (when I use _mm_malloc), for n=1 the "first" value is aligned at 4*9 = 36 bytes.
for(unsigned i = 0; i < n; i++) {
float *coeff_set = (float *)_mm_malloc(909 * 100 *sizeof(float), 32);
// this works for n=0, not n=1, n=2, ...
__m256 coefficients = _mm256_load_ps(&coeff_set[9 * i]);
__m256 result = _mm256_mul_ps(coefficients, coefficients);
...
}
Is there any possibility to solve this? I would like to keep the structure of my data, but if not possible, I would change it. One solution I found was to copy the 8 floats first in an aligned array, and then load it, but the performance-loss is way too high then.
You have two choices:
Pad each set of coefficients to 16 values to maintain alignment
Use the _mm256_loadu_ps intrinsic for unaligned accesses
The first choice is more speed-efficient, while the second is more space-efficient.

How to bitwise operate on memory block (C++)

Is there a better (faster/more efficient) way to perform a bitwise operation on a large memory block than using a for loop? After looking it to options I noticed that std has a member std::bitset, and was also wondering if it would be better (or even possible) to convert a large region of memory into a bitset without changing its values, then perform the operations, and then switch its type back to normal?
Edit / update: I think union might apply here, such that the memory block is allocated a new array of int or something and then manipulated as a large bitset. Operations seem to be able to be done over the entire set based on what is said here: http://www.cplusplus.com/reference/bitset/bitset/operators/ .
In general, there is no magical way faster than a for loop. However, you can make it easier for the compiler to optimize the loop by keeping a few things in mind:
Load the largest available integer type into memory at a time. However, you need to be careful if your buffer has a length which does not divide evenly by the size of that integer type.
If possible, operate on multiple values in one loop iteration - this should make vectorization much simpler for the compiler. Again, you need to be careful about the buffer length.
If the loop is to be run many times on short sections of code, use a loop index that counts downwards to zero rather than upwards, and subtract it from the array length - this makes it easier for the CPU's branch predictor to figure out what's going on.
You can use explicit vector extensions provided by the compiler, but this will make your code less portable.
Ultimately, you can write the loop in assembly and use vector instructions provided by your CPU, but this is completely unportable.
[edit] Additionally, you can use OpenMP or a similar API to divide the loop between multiple threads, but this will only cause an improvement if you are performing the operation on a very large amount of memory.
C99 example of xoring memory with a constant byte, assuming long long is 128-bit, the start of the buffer is aligned to 16 bytes, and without considering point 3. Bitwise operations on two memory buffers are very similar.
size_t len = ...;
char *buffer = ...;
size_t const loadd_per_i = 4
size_t iters = len / sizeof(long long) / loads_per_i;
long long *ptr = (long long *) buffer;
long long xorvalue = 0x5e5e5e5e5e5e5e5e5e5e5e5e5e5e5e5eLL;
// run in multiple threads if there are more than 4 MB to xor
#pragma omp parallel for if(iters > 65536)
for (size_t i = 0; i < iters; ++i) {
size_t j = loads_per_i*i;
ptr[j ] ^= xorvalue;
ptr[j+1] ^= xorvalue;
ptr[j+2] ^= xorvalue;
ptr[j+3] ^= xorvalue;
}
// finish long longs which don't align to 4
for (size_t i = iters * loads_per_i; i < len / sizeof(long long); ++i) {
ptr[i] ^= xorvalue;
}
// finish bytes which don't align to long
for (size_t i = (len / sizeof(long long)) * sizeof(long long); i < len; ++i) {
buffer[i] ^= xorvalue;
}