My initial attempt looked like this (supposed we want to multiply)
__m128 mat[n]; /* rows */
__m128 vec[n] = {1,1,1,1};
float outvector[n];
for (int row=0;row<n;row++) {
for(int k =3; k < 8; k = k+ 4)
{
__m128 mrow = mat[k];
__m128 v = vec[row];
__m128 sum = _mm_mul_ps(mrow,v);
sum= _mm_hadd_ps(sum,sum); /* adds adjacent-two floats */
}
_mm_store_ss(&outvector[row],_mm_hadd_ps(sum,sum));
}
But this clearly doesn't work. How do I approach this?
I should load 4 at a time....
The other question is: if my array is very big (say n = 1000), how can I make it 16-bytes aligned? Is that even possible?
OK... I'll use a row-major matrix convention. Each row of [m] requires (2) __m128 elements to yield 8 floats. The 8x1 vector v is a column vector. Since you're using the haddps instruction, I'll assume SSE3 is available. Finding r = [m] * v :
void mul (__m128 r[2], const __m128 m[8][2], const __m128 v[2])
{
__m128 t0, t1, t2, t3, r0, r1, r2, r3;
t0 = _mm_mul_ps(m[0][0], v[0]);
t1 = _mm_mul_ps(m[1][0], v[0]);
t2 = _mm_mul_ps(m[2][0], v[0]);
t3 = _mm_mul_ps(m[3][0], v[0]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r0 = _mm_hadd_ps(t0, t2);
t0 = _mm_mul_ps(m[0][1], v[1]);
t1 = _mm_mul_ps(m[1][1], v[1]);
t2 = _mm_mul_ps(m[2][1], v[1]);
t3 = _mm_mul_ps(m[3][1], v[1]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r1 = _mm_hadd_ps(t0, t2);
t0 = _mm_mul_ps(m[4][0], v[0]);
t1 = _mm_mul_ps(m[5][0], v[0]);
t2 = _mm_mul_ps(m[6][0], v[0]);
t3 = _mm_mul_ps(m[7][0], v[0]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r2 = _mm_hadd_ps(t0, t2);
t0 = _mm_mul_ps(m[4][1], v[1]);
t1 = _mm_mul_ps(m[5][1], v[1]);
t2 = _mm_mul_ps(m[6][1], v[1]);
t3 = _mm_mul_ps(m[7][1], v[1]);
t0 = _mm_hadd_ps(t0, t1);
t2 = _mm_hadd_ps(t2, t3);
r3 = _mm_hadd_ps(t0, t2);
r[0] = _mm_add_ps(r0, r1);
r[1] = _mm_add_ps(r2, r3);
}
As for alignment, a variable of a type __m128 should be automatically aligned on the stack. With dynamic memory, this is not a safe assumption. Some malloc / new implementations may only return memory guaranteed to be 8-byte aligned.
The intrinsics header provides _mm_malloc and _mm_free. The align parameter should be (16) in this case.
Intel has developed a Small Matrix Library for matrices with sizes ranging from 1×1 to 6×6. Application Note AP-930 Streaming SIMD Extensions - Matrix Multiplication describes in detail the algorithm for multiplying two 6×6 matrices. This should be adaptable to other size matrices with some effort.
Related
I'm making a code that essentially takes advantage of SSE2 on optimizing this code:
double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pC[sampleIndex] = exp((mMin + std::clamp(pA[sampleIndex] + pB[sampleIndex], 0.0, 1.0) * mRange) * ln2per12);
}
in this:
double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
// SSE2
__m128d bound_lower = _mm_set1_pd(0.0);
__m128d bound_upper = _mm_set1_pd(1.0);
__m128d rangeLn2per12 = _mm_set1_pd(mRange * ln2per12);
__m128d minLn2per12 = _mm_set1_pd(mMin * ln2per12);
__m128d loaded_a = _mm_load_pd(pA);
__m128d loaded_b = _mm_load_pd(pB);
__m128d result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
double *pCEnd = pC + roundintup8(blockSize);
for (; pC < pCEnd; pA += 8, pB += 8, pC += 8) {
_mm_store_pd(pC, result);
loaded_a = _mm_load_pd(pA + 2);
loaded_b = _mm_load_pd(pB + 2);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 2, result);
loaded_a = _mm_load_pd(pA + 4);
loaded_b = _mm_load_pd(pB + 4);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 4, result);
loaded_a = _mm_load_pd(pA + 6);
loaded_b = _mm_load_pd(pB + 6);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 6, result);
loaded_a = _mm_load_pd(pA + 8);
loaded_b = _mm_load_pd(pB + 8);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
}
And I would say it works pretty well. BUT, can't find any exp function for SSE2, to complete the chain of operations.
Reading this, it seems I need to call standard exp() from library?
Really? Isn't this penalizing? Any other ways? Different builtin function?
I'm on MSVC, /arch:SSE2, /O2, producing 32-bit code.
The simplest way is to use exponent approximation. One possible case based on this limit
For n = 256 = 2^8:
__m128d fastExp1(__m128d x)
{
__m128d ret = _mm_mul_pd(_mm_set1_pd(1.0 / 256), x);
ret = _mm_add_pd(_mm_set1_pd(1.0), ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
return ret;
}
The other idea is the polynomial expansion. In particular, taylor series expansion:
__m128d fastExp2(__m128d x)
{
const __m128d a0 = _mm_set1_pd(1.0);
const __m128d a1 = _mm_set1_pd(1.0);
const __m128d a2 = _mm_set1_pd(1.0 / 2);
const __m128d a3 = _mm_set1_pd(1.0 / 2 / 3);
const __m128d a4 = _mm_set1_pd(1.0 / 2 / 3 / 4);
const __m128d a5 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5);
const __m128d a6 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6);
const __m128d a7 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6 / 7);
__m128d ret = _mm_fmadd_pd(a7, x, a6);
ret = _mm_fmadd_pd(ret, x, a5);
// If fma extention is not present use
// ret = _mm_add_pd(_mm_mul_pd(ret, x), a5);
ret = _mm_fmadd_pd(ret, x, a4);
ret = _mm_fmadd_pd(ret, x, a3);
ret = _mm_fmadd_pd(ret, x, a2);
ret = _mm_fmadd_pd(ret, x, a1);
ret = _mm_fmadd_pd(ret, x, a0);
return ret;
}
Note that with the same number of expansion terms, you can get a better approximation if you approximate the function for the specific x range, using for example the least squares method.
All of these methods works in a very limited x range but with continuous derivatives which may be important in some cases.
There is a trick to approximate an exponent in a very wide range but with a noticeable piecewise linear regions. It is based on integers reinterpretation as floating-point numbers. For a more accurate description, I recommend this refs:
Piecewise linear approximation to exponential and logarithm
A Fast, Compact Approximation of the Exponential Function
The possible implementation of this approach:
__m128d fastExp3(__m128d x)
{
const __m128d a = _mm_set1_pd(1.0 / M_LN2);
const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
__m128d t = _mm_fmadd_pd(x, a, b);
return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(t), 11));
}
Despite the simplicity and wide x range for this method, be careful when used in math. In small areas, it gives a piecewise approximation, which can disrupt sensitive algorithms, especially those using differentiation.
To compare the accuracy of different methods, look at the graphics. The first graph is made for the x = [0..1) range. As you can see, the best approximation in this case is given by the method fastExp2(x), slightly worse but acceptable is fastExp1(x). The worst approximation provides by fastExp3(x) - the piecewise stucrure is noticeable, discontinuities of the first derivative is presence.
In the range x = [0..10) fastExp3(x) method provides the best approximation, a bit worse is approximation given by fastExp1(x) - with the same number of calculations, it provides more order than fastExp2(x).
The next step is to improve the accuracy of the fastExp3(x) algorithm. The easiest way to significantly increase accuracy is to use equality exp(x) = exp(x/2)/exp(-x/2) Although it increases the amount of computation, it greatly reduces the error due to mutual error compensation when dividing.
__m128d fastExp5(__m128d x)
{
const __m128d ap = _mm_set1_pd(0.5 / M_LN2);
const __m128d an = _mm_set1_pd(-0.5 / M_LN2);
const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
__m128d tp = _mm_fmadd_pd(x, ap, b);
__m128d tn = _mm_fmadd_pd(x, an, b);
tp = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tp), 11));
tn = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tn), 11));
return _mm_div_pd(tp, tn);
}
Even greater accuracy can be achieved by combining methods from fastExp1(x) or fastExp2(x) and fastExp3(x) algorithms using equality exp(x+dx) = exp(x) *exp(dx). As shown above, the first multiplier can be computed similar to fastExp3(x) approach, for second multiplier fastExp1(x) or fastExp2(x) method can be used. Finding of the optimal solution in this case is quite a difficult task and I would recommend to look at the implementation in the libraries proposed in answers.
There are several libraries that provide vectorized exponential, with more or less accuracy.
SVML, provided with the Intel compiler (it provides intrinsics as well, so if you have a licence, you can use them), has different level of precision (and speed)
you mentioned IPP, also from Intel, that also provide some functionality
MKL also provides some interface for this computation (for this one, fixing the ISA can be done through macros, for instance if you need reproducibility or precision)
fmath is another option, you can tear the code from the vectorized exp to integrate it inside your loop.
From experience, all these are faster and more precise than a custom padde approximation (not even talking about the unstable Taylor expansion that would give you negative number VERY quickly).
For SVML, IPP and MKL, I would check which one is better: calling from inside your loop or calling exp with one call for your full array (as the libraries could use AVX512 instead of just SSE2).
There is no SSE2 implementation of exp so if you don't want to roll your own as suggested above, one option is to use AVX512 instructions on some hardware that supports ERI (Exponential and Reciprocal Instructions). See https://en.wikipedia.org/wiki/AVX-512#New_instructions_in_AVX-512_exponential_and_reciprocal
I think that currently limits you to the Xeon phi (as pointed out by Peter Cordes - I did find one claim about it being on Skylake and Cannonlake but can't corroborate it), and bear in mind as well that the code won't work at all (i.e. will crash) on other architectures.
I am trying to rewrite a code from c++ source code including SSE instructions, to only c++ code. I know i will lose performance, but its an experiment, i am trying to perform.
I was wondering if there is a C++ equivalent for doing the same as , __mm_unpackhi_pd and __mm_unpacklo_pd. I have zero knowledge about SSE.
A snippet of the code for reference which i am trying to convert. Any knowledge or tips would be helpful. Thank you.
for (unsigned chunk = 0; chunk < chunks; chunk++)
{
unsigned start = chunk * chunksize;
unsigned end =
std::min((chunk + 1) * chunksize, (unsigned)2 * w);
__m128d a2b2 =
_mm_load_pd(d_origx +
((2 * init_G_offset + start) & n2_m_1));
unsigned i2_mod_B = 0;
for (unsigned i = start; i < end; i += 2)
{
__m128d ab = a2b2;
a2b2 =
_mm_load_pd(d_origx +
((origx_offset + i) & n2_m_1));
__m128d cd = _mm_load_pd(d_filter + i);
__m128d cc = _mm_unpacklo_pd(cd, cd);
__m128d dd = _mm_unpackhi_pd(cd, cd);
__m128d a0a1 = _mm_unpacklo_pd(ab, a2b2);
__m128d b0b1 = _mm_unpackhi_pd(ab, a2b2);
__m128d ac = _mm_mul_pd(cc, a0a1);
__m128d ad = _mm_mul_pd(dd, a0a1);
__m128d bc = _mm_mul_pd(cc, b0b1);
__m128d bd = _mm_mul_pd(dd, b0b1);
__m128d ac_m_bd = _mm_sub_pd(ac, bd);
__m128d ad_p_bc = _mm_add_pd(ad, bc);
__m128d ab_times_cd = _mm_unpacklo_pd(ac_m_bd, ad_p_bc);
__m128d a2b2_times_cd =
_mm_unpackhi_pd(ac_m_bd, ad_p_bc);
__m128d xy = _mm_load_pd(d_x_sampt + i2_mod_B);
__m128d x2y2 = _mm_load_pd(d_x_sampt + i2_mod_B + 2);
__m128d st = _mm_add_pd(xy, ab_times_cd);
__m128d s2t2 = _mm_add_pd(x2y2, a2b2_times_cd);
_mm_store_pd(d_x_sampt + i2_mod_B, st);
_mm_store_pd(d_x_sampt + i2_mod_B + 2, s2t2);
i2_mod_B += 4;
}
}
Below you find the description of the two functions, I've also linked each function to its reference page. The whole reference is available here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
_mm_unpackhi_p
__m128d _mm_unpackhi_pd (__m128d a, __m128d b)
Unpack and interleave double-precision (64-bit) floating-point
elements from the high half of a and b, and store the results in dst.
_mm_unpacklo_pd
_m128d _mm_unpacklo_pd (__m128d a, __m128d b)
Unpack and interleave double-precision (64-bit) floating-point
elements from the low half of a and b, and store the results in dst.
Exactly how to implement it depends on your representation, but basically you return a new value composed of the high (or low) half of a concatenated with the high (or low) half of b. For example:
typedef double[2] __m128d;
__m128d _mm_unpackhi_pd(__m128d a, __m128d b) {
__m128d res;
res[0] = a[1];
res[1] = b[1];
return res;
}
__m128d _mm_unpacklo_pd(__m128d a, __m128d b) {
__m128d res;
res[0] = a[0];
res[1] = b[0];
return res;
}
Wierd timing on this question… I found this issue while implementing this function for SIMDe, and it's only 17 days old. If you want to use SIMDe as a reference, these functions are in sse2.h along with a lot of others. The code in SIMDe is a bit more complex than what's above, but that's mostly just to match the implementations of the other _mm_unpack* functions.
So I've come across another problem when dealing with AVX code. I have a case where I have 4 ymm registers that need to be split vertically to 4 other ymm registers
(ie. ymm0(ABCD) -> ymm4(A...), ymm5(B...), ymm6(C...), ymm7(D...)).
Here is an example:
// a, b, c, d are __m256 structs with [] operators to access xyzw
__m256d A = _mm256_setr_pd(a[0], b[0], c[0], d[0]);
__m256d B = _mm256_setr_pd(a[1], b[1], c[1], d[1]);
__m256d C = _mm256_setr_pd(a[2], b[2], c[2], d[2]);
__m256d D = _mm256_setr_pd(a[3], b[3], c[3], d[3]);
Just putting Paul's comment into an answer:
My question is about how to a matrix transposition which is easily done in AVX as indicated with the link he provided.
Here's my implementation for those who come across here:
void Transpose(__m256d* A, __m256d* T)
{
__m256d t0 = _mm256_shuffle_pd(A[0], A[1], 0b0000);
__m256d t1 = _mm256_shuffle_pd(A[0], A[1], 0b1111);
__m256d t2 = _mm256_shuffle_pd(A[2], A[3], 0b0000);
__m256d t3 = _mm256_shuffle_pd(A[2], A[3], 0b1111);
T[0] = _mm256_permute2f128_pd(t0, t2, 0b0100000);
T[1] = _mm256_permute2f128_pd(t1, t3, 0b0100000);
T[2] = _mm256_permute2f128_pd(t0, t2, 0b0110001);
T[3] = _mm256_permute2f128_pd(t1, t3, 0b0110001);
}
This function cuts the number of instructions in about half on full optimization as compared to my previous attempt
What I am trying to do ultimately is multiplying two complex numbers like this:
z1 = R1 + I1*j
z2 = R2 + I2*j
z3 = z1 * z2 = (R1*R2 - I1*I2) (R1*I2 + R2*I1)*j;
But what I have are two separate vectors for the real and complex part of both those complex numbers. So something like this:
v1 = [R1, R2, R3, R4 ... Rn] of z1
v2 = [I1, I2, I3, I4 ... In] of z1
v1 = [R1, R2, R3, R4 ... Rn] of z2
v2 = [I1, I2, I3, I4 ... In] of z2
So when I am trying to calculate z3 now, I do this:
foo (std::vector<double> real1, std::vector<double> imag1,
std::vector<double> real2, std::vector<double> imag2)
{
std::vector<double> realResult;
std::vector<double> imagResult;
for (size_t i = 0; i < real1.size(); i++)
{
realResult.push_back(real1[i]*real2[i] - imag1[i]*imag2[i]);
imagResult.push_back(real1[i]*imag2[i] + real2[i]*imag1[i]);
}
//And so on
}
Now, this function is eating a lot of time. There sure is another way of doing that can you think of something that I can use?
You might be able to make use of std::complex. This probably implements operations you require at least close to as well as they can be implemented.
EDIT (In reply to comment):
I would do this:
size_t num_items = real1.size();
std::vector<double> realResult;
realResult.reserve(num_items);
std::vector<double> imagResult;
imagResult.reserve(num_items);
for (size_t i = 0; i < num_items; ++i) {
// lalala not re-sizeing any vectors yey!
realResult.push_back(real1[i] * real2[i] - imag1[i] * imag2[i]);
imagResult.push_back(real1[i] * imag2[i] + real2[i] * imag1[i]);
}
Otherwise if you have a large input array and you are doing a lot of multiplication on doubles I'm afraid that might just be slow. Best you can do is mess around with getting things contiguous in memory for bonus cache points. Impossible to really say without profiling the code exactly what might work best.
Pass in parameter as const std::vector<double>& to avoid unnecessary copy
You may also consider computing each multiplication in parallel, if N is big enough, the overhead of parallel computing is worthwhile
Use std::valarray of std::complex. It is simple and optimized for arithmetic operations
foo(std::valarray<std::complex<double>> & z1,
std::valarray<std::complex<double>> & z2)
{
auto z3 = z1 * z2; // applies to each element of two valarrays, or a valarray and a value
// . . .
}
EDIT: Convert vectors to valarray
std::valarray<std::complex<double>> z1(real1.size());
for (size_t i = 0; i < z1.size(); ++i)
z1[i] = std::complex<double>(real1[i], imag1[i]);
I want to compute for k=0 to k=100
A[j][k]=((A[j][k]-con*A[r][k])%2);
for that I am storing (con*A[r][k]) in some int temp[5]
and then doing A[j][k]-temp[] in SIMD whats wrong in the code below its giving segmentation fault for line __m128i m5=_mm_sub_epi32(*m3,*m4);
while((k+4)<100)
{
__m128i *m3 = (__m128i*)A[j+k];
temp[0]=con*A[r][k];
temp[1]=con*A[r][k+1];
temp[2]=con*A[r][k+2];
temp[3]=con*A[r][k+3];
__m128i *m4 = (__m128i*)temp;
__m128i m5 =_mm_sub_epi32(*m3,*m4);
(temp_ptr)=(int*)&m5;
printf("%ld,%d,%ld\n",A[j][k],con,A[r][k]);
A[j][k] =temp_ptr[0]%2;
A[j][k+1]=temp_ptr[1]%2;
A[j][k+2]=temp_ptr[2]%2;
A[j][k+3]=temp_ptr[3]%2;
k=k+4;
}
Most likely, you didn't take care of the alignment. SIMD instructions require 16-byte alignment (see this article). Otherwise, your program will crash.
Either alignment, or you have wrong indexes somewhere, and access wrong memory.
Without the possible values for j, k, and r it's hard to tell why, but most likely you are overindexing one of your arrays
If you want to implement:
for (k = 0; k < 100; k += 4)
{
A[j][k] = (A[j][k] - con * A[r][k]) % 2;
}
and you want to see some benefit from SIMD, then you need to do it all in SIMD, i.e. don't mix SIMD and scalar code.
For example (untested):
const __m128i vcon = _mm_set1_epi32(con);
const __m128i vk1 = _mm_set1_epi32(1);
for (k = 0; k < 100; k += 4)
{
__m128i v1 = _mm_loadu_si128(&A[j][k]); // load v1 from A[j][k..k+3] (misaligned)
__m128i v2 = _mm_loadu_si128(&A[r][k]); // load v2 from A[r][k..k+3] (misaligned)
v2 = _mm_mullo_epi32(v2, vcon); // v2 = con * A[r][k..k+3]
v1 = _mm_sub_epi32(v1, v2); // v1 = A[j][k..k+3] - con * A[r][k..k+3]
v1 = _mm_and_si128(v1, vk1); // v1 = (A[j][k..k+3] - con * A[r][k..k+3]) % 2
_mm_storeu_si128(&A[j][k], v1); // store v1 back to A[j][k..k+3] (misaligned)
}
Note: if you can guarantee that each row of A is 16 byte aligned then you can change the misaligned loads/stores (_mm_loadu_si128/_mm_storeu_si128) to aligned loads/stores (_mm_load_si128/_mm_store_si128) - this will help performance somewhat, depending on what CPU you are targetting.