SSE to C++ code - c++

I am trying to rewrite a code from c++ source code including SSE instructions, to only c++ code. I know i will lose performance, but its an experiment, i am trying to perform.
I was wondering if there is a C++ equivalent for doing the same as , __mm_unpackhi_pd and __mm_unpacklo_pd. I have zero knowledge about SSE.
A snippet of the code for reference which i am trying to convert. Any knowledge or tips would be helpful. Thank you.
for (unsigned chunk = 0; chunk < chunks; chunk++)
{
unsigned start = chunk * chunksize;
unsigned end =
std::min((chunk + 1) * chunksize, (unsigned)2 * w);
__m128d a2b2 =
_mm_load_pd(d_origx +
((2 * init_G_offset + start) & n2_m_1));
unsigned i2_mod_B = 0;
for (unsigned i = start; i < end; i += 2)
{
__m128d ab = a2b2;
a2b2 =
_mm_load_pd(d_origx +
((origx_offset + i) & n2_m_1));
__m128d cd = _mm_load_pd(d_filter + i);
__m128d cc = _mm_unpacklo_pd(cd, cd);
__m128d dd = _mm_unpackhi_pd(cd, cd);
__m128d a0a1 = _mm_unpacklo_pd(ab, a2b2);
__m128d b0b1 = _mm_unpackhi_pd(ab, a2b2);
__m128d ac = _mm_mul_pd(cc, a0a1);
__m128d ad = _mm_mul_pd(dd, a0a1);
__m128d bc = _mm_mul_pd(cc, b0b1);
__m128d bd = _mm_mul_pd(dd, b0b1);
__m128d ac_m_bd = _mm_sub_pd(ac, bd);
__m128d ad_p_bc = _mm_add_pd(ad, bc);
__m128d ab_times_cd = _mm_unpacklo_pd(ac_m_bd, ad_p_bc);
__m128d a2b2_times_cd =
_mm_unpackhi_pd(ac_m_bd, ad_p_bc);
__m128d xy = _mm_load_pd(d_x_sampt + i2_mod_B);
__m128d x2y2 = _mm_load_pd(d_x_sampt + i2_mod_B + 2);
__m128d st = _mm_add_pd(xy, ab_times_cd);
__m128d s2t2 = _mm_add_pd(x2y2, a2b2_times_cd);
_mm_store_pd(d_x_sampt + i2_mod_B, st);
_mm_store_pd(d_x_sampt + i2_mod_B + 2, s2t2);
i2_mod_B += 4;
}
}

Below you find the description of the two functions, I've also linked each function to its reference page. The whole reference is available here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
_mm_unpackhi_p
__m128d _mm_unpackhi_pd (__m128d a, __m128d b)
Unpack and interleave double-precision (64-bit) floating-point
elements from the high half of a and b, and store the results in dst.
_mm_unpacklo_pd
_m128d _mm_unpacklo_pd (__m128d a, __m128d b)
Unpack and interleave double-precision (64-bit) floating-point
elements from the low half of a and b, and store the results in dst.

Exactly how to implement it depends on your representation, but basically you return a new value composed of the high (or low) half of a concatenated with the high (or low) half of b. For example:
typedef double[2] __m128d;
__m128d _mm_unpackhi_pd(__m128d a, __m128d b) {
__m128d res;
res[0] = a[1];
res[1] = b[1];
return res;
}
__m128d _mm_unpacklo_pd(__m128d a, __m128d b) {
__m128d res;
res[0] = a[0];
res[1] = b[0];
return res;
}
Wierd timing on this question… I found this issue while implementing this function for SIMDe, and it's only 17 days old. If you want to use SIMDe as a reference, these functions are in sse2.h along with a lot of others. The code in SIMDe is a bit more complex than what's above, but that's mostly just to match the implementations of the other _mm_unpack* functions.

Related

Fsolve equivalent in C++

I am trying to replicate Matlab's Fsolve as my project is in C++ solving an implicit RK4 scheme. I am using the NLopt library using the NLOPT_LD_MMA algorithm. I have run the required section in matlab and it is considerably faster. I was wondering whether anyone had any ideas of a better Fsolve equivalent in C++? Another reason is that I would like f1 and f2 to both tend to zero and it seems suboptimal to calculate the L2 norm to include both of them as NLopt seems to only allow a scalar return value from the objective function. Does anyone have any ideas of an alternative library or perhaps using a different algorithm/constraints to more closely replicate the default fsolve.
Would it be better (faster) perhaps to call the python scipy.minimise.fsolve from C++?
double implicitRK4(double time, double V, double dt, double I, double O, double C, double R){
const int number_of_parameters = 2;
double lb[number_of_parameters];
double ub[number_of_parameters];
lb[0] = -999; // k1 lb
lb[1] = -999;// k2 lb
ub[0] = 999; // k1 ub
ub[1] = 999; // k2 ub
double k [number_of_parameters];
k[0] = 0.01;
k[1] = 0.01;
kOptData addData(time,V,dt,I,O,C,R);
nlopt_opt opt; //NLOPT_LN_MMA NLOPT_LN_COBYLA
opt = nlopt_create(NLOPT_LD_MMA, number_of_parameters);
nlopt_set_lower_bounds(opt, lb);
nlopt_set_upper_bounds(opt, ub);
nlopt_result nlopt_remove_inequality_constraints(nlopt_opt opt);
// nlopt_result nlopt_remove_equality_constraints(nlopt_opt opt);
nlopt_set_min_objective(opt,solveKs,&addData);
double minf;
if (nlopt_optimize(opt, k, &minf) < 0) {
printf("nlopt failed!\n");
}
else {
printf("found minimum at f(%g,%g,%g) = %0.10g\n", k[0],k[1],minf);
}
nlopt_destroy(opt);
return V + (1/2)*dt*k[0] + (1/2)*dt*k[1];```
double solveKs(unsigned n, const double *x, double *grad, void *my_func_data){
kOptData *unpackdata = (kOptData*) my_func_data;
double t1,y1,t2,y2;
double f1,f2;
t1 = unpackdata->time + ((1/2)-(1/6)*sqrt(3));
y1 = unpackdata->V + (1/4)*unpackdata->dt*x[0] + ((1/4)-(1/6)*sqrt(3))*unpackdata->dt*x[1];
t2 = unpackdata->time + ((1/2)+(1/6)*sqrt(3));
y2 = unpackdata->V + ((1/4)+(1/6)*sqrt(3))*unpackdata->dt*x[0] + (1/4)*unpackdata->dt*x[1];
f1 = x[0] - stateDeriv_implicit(t1,y1,unpackdata->dt,unpackdata->I,unpackdata->O,unpackdata->C,unpackdata->R);
f2 = x[1] - stateDeriv_implicit(t2,y2,unpackdata->dt,unpackdata->I,unpackdata->O,unpackdata->C,unpackdata->R);
return sqrt(pow(f1,2) + pow(f2,2));
My matlab version below seems to be a lot simpler but I would prefer the whole code in c++!
k1 = 0.01;
k2 = 0.01;
x0 = [k1,k2];
fun = #(x)solveKs(x,t,z,h,I,OCV1,Cap,Rct,static);
options = optimoptions('fsolve','Display','none');
k = fsolve(fun,x0,options);
% Calculate the next state vector from the previous one using RungeKutta
% update equation
znext = z + (1/2)*h*k(1) + (1/2)*h*k(2);``
function [F] = solveKs(x,t,z,h,I,O,C,R,static)
t1 = t + ((1/2)-(1/6)*sqrt(3));
y1 = z + (1/4)*h*x(1) + ((1/4)-(1/6)*sqrt(3))*h *x(2);
t2 = t + ((1/2)+(1/6)*sqrt(3));
y2 = z + ((1/4)+(1/6)*sqrt(3))*h*x(1) + (1/4)*h*x(2);
F(1) = x(1) - stateDeriv_implicit(t1,y1,h,I,O,C,R,static);
F(2) = x(2) - stateDeriv_implicit(t2,y2,h,I,O,C,R,static);
end

How do you process exp() with SSE2?

I'm making a code that essentially takes advantage of SSE2 on optimizing this code:
double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pC[sampleIndex] = exp((mMin + std::clamp(pA[sampleIndex] + pB[sampleIndex], 0.0, 1.0) * mRange) * ln2per12);
}
in this:
double *pA = a;
double *pB = b[voiceIndex];
double *pC = c[voiceIndex];
// SSE2
__m128d bound_lower = _mm_set1_pd(0.0);
__m128d bound_upper = _mm_set1_pd(1.0);
__m128d rangeLn2per12 = _mm_set1_pd(mRange * ln2per12);
__m128d minLn2per12 = _mm_set1_pd(mMin * ln2per12);
__m128d loaded_a = _mm_load_pd(pA);
__m128d loaded_b = _mm_load_pd(pB);
__m128d result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
double *pCEnd = pC + roundintup8(blockSize);
for (; pC < pCEnd; pA += 8, pB += 8, pC += 8) {
_mm_store_pd(pC, result);
loaded_a = _mm_load_pd(pA + 2);
loaded_b = _mm_load_pd(pB + 2);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 2, result);
loaded_a = _mm_load_pd(pA + 4);
loaded_b = _mm_load_pd(pB + 4);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 4, result);
loaded_a = _mm_load_pd(pA + 6);
loaded_b = _mm_load_pd(pB + 6);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
_mm_store_pd(pC + 6, result);
loaded_a = _mm_load_pd(pA + 8);
loaded_b = _mm_load_pd(pB + 8);
result = _mm_add_pd(loaded_a, loaded_b);
result = _mm_max_pd(bound_lower, result);
result = _mm_min_pd(bound_upper, result);
result = _mm_mul_pd(rangeLn2per12, result);
result = _mm_add_pd(minLn2per12, result);
}
And I would say it works pretty well. BUT, can't find any exp function for SSE2, to complete the chain of operations.
Reading this, it seems I need to call standard exp() from library?
Really? Isn't this penalizing? Any other ways? Different builtin function?
I'm on MSVC, /arch:SSE2, /O2, producing 32-bit code.
The simplest way is to use exponent approximation. One possible case based on this limit
For n = 256 = 2^8:
__m128d fastExp1(__m128d x)
{
__m128d ret = _mm_mul_pd(_mm_set1_pd(1.0 / 256), x);
ret = _mm_add_pd(_mm_set1_pd(1.0), ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
return ret;
}
The other idea is the polynomial expansion. In particular, taylor series expansion:
__m128d fastExp2(__m128d x)
{
const __m128d a0 = _mm_set1_pd(1.0);
const __m128d a1 = _mm_set1_pd(1.0);
const __m128d a2 = _mm_set1_pd(1.0 / 2);
const __m128d a3 = _mm_set1_pd(1.0 / 2 / 3);
const __m128d a4 = _mm_set1_pd(1.0 / 2 / 3 / 4);
const __m128d a5 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5);
const __m128d a6 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6);
const __m128d a7 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6 / 7);
__m128d ret = _mm_fmadd_pd(a7, x, a6);
ret = _mm_fmadd_pd(ret, x, a5);
// If fma extention is not present use
// ret = _mm_add_pd(_mm_mul_pd(ret, x), a5);
ret = _mm_fmadd_pd(ret, x, a4);
ret = _mm_fmadd_pd(ret, x, a3);
ret = _mm_fmadd_pd(ret, x, a2);
ret = _mm_fmadd_pd(ret, x, a1);
ret = _mm_fmadd_pd(ret, x, a0);
return ret;
}
Note that with the same number of expansion terms, you can get a better approximation if you approximate the function for the specific x range, using for example the least squares method.
All of these methods works in a very limited x range but with continuous derivatives which may be important in some cases.
There is a trick to approximate an exponent in a very wide range but with a noticeable piecewise linear regions. It is based on integers reinterpretation as floating-point numbers. For a more accurate description, I recommend this refs:
Piecewise linear approximation to exponential and logarithm
A Fast, Compact Approximation of the Exponential Function
The possible implementation of this approach:
__m128d fastExp3(__m128d x)
{
const __m128d a = _mm_set1_pd(1.0 / M_LN2);
const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
__m128d t = _mm_fmadd_pd(x, a, b);
return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(t), 11));
}
Despite the simplicity and wide x range for this method, be careful when used in math. In small areas, it gives a piecewise approximation, which can disrupt sensitive algorithms, especially those using differentiation.
To compare the accuracy of different methods, look at the graphics. The first graph is made for the x = [0..1) range. As you can see, the best approximation in this case is given by the method fastExp2(x), slightly worse but acceptable is fastExp1(x). The worst approximation provides by fastExp3(x) - the piecewise stucrure is noticeable, discontinuities of the first derivative is presence.
In the range x = [0..10) fastExp3(x) method provides the best approximation, a bit worse is approximation given by fastExp1(x) - with the same number of calculations, it provides more order than fastExp2(x).
The next step is to improve the accuracy of the fastExp3(x) algorithm. The easiest way to significantly increase accuracy is to use equality exp(x) = exp(x/2)/exp(-x/2) Although it increases the amount of computation, it greatly reduces the error due to mutual error compensation when dividing.
__m128d fastExp5(__m128d x)
{
const __m128d ap = _mm_set1_pd(0.5 / M_LN2);
const __m128d an = _mm_set1_pd(-0.5 / M_LN2);
const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
__m128d tp = _mm_fmadd_pd(x, ap, b);
__m128d tn = _mm_fmadd_pd(x, an, b);
tp = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tp), 11));
tn = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tn), 11));
return _mm_div_pd(tp, tn);
}
Even greater accuracy can be achieved by combining methods from fastExp1(x) or fastExp2(x) and fastExp3(x) algorithms using equality exp(x+dx) = exp(x) *exp(dx). As shown above, the first multiplier can be computed similar to fastExp3(x) approach, for second multiplier fastExp1(x) or fastExp2(x) method can be used. Finding of the optimal solution in this case is quite a difficult task and I would recommend to look at the implementation in the libraries proposed in answers.
There are several libraries that provide vectorized exponential, with more or less accuracy.
SVML, provided with the Intel compiler (it provides intrinsics as well, so if you have a licence, you can use them), has different level of precision (and speed)
you mentioned IPP, also from Intel, that also provide some functionality
MKL also provides some interface for this computation (for this one, fixing the ISA can be done through macros, for instance if you need reproducibility or precision)
fmath is another option, you can tear the code from the vectorized exp to integrate it inside your loop.
From experience, all these are faster and more precise than a custom padde approximation (not even talking about the unstable Taylor expansion that would give you negative number VERY quickly).
For SVML, IPP and MKL, I would check which one is better: calling from inside your loop or calling exp with one call for your full array (as the libraries could use AVX512 instead of just SSE2).
There is no SSE2 implementation of exp so if you don't want to roll your own as suggested above, one option is to use AVX512 instructions on some hardware that supports ERI (Exponential and Reciprocal Instructions). See https://en.wikipedia.org/wiki/AVX-512#New_instructions_in_AVX-512_exponential_and_reciprocal
I think that currently limits you to the Xeon phi (as pointed out by Peter Cordes - I did find one claim about it being on Skylake and Cannonlake but can't corroborate it), and bear in mind as well that the code won't work at all (i.e. will crash) on other architectures.

Iteration causes crash

What is wrong with this iteration?
This particular piece of code is causing my program to crash. When I disable the code it works but of course giving wrong results. It's supposed to compare sigma with sigma_last until they remain equal at e-14.
This is what I tried first:
long double sigma_last = NULL;
do{
if(sigma_last != NULL){
sigma = sigma_last;
}
sigma1 = atan( tan(beta1) / cos(A1) );
sigmaM = (2*sigma1 + sigma) / 2;
d_sigma = B*sin(sigma)*(cos(2*sigmaM)+(1/4)*B*(cos(sigma)
*(-1+2*pow(cos(2*sigmaM),2)))-(1/6)*B*cos(2*sigmaM)
*(-3+4*pow(sin(sigma),2))*(-3+4*pow(cos(2*sigmaM),2)));
sigma_last = sigma + d_sigma;
}
while(set_precision_14(sigma)<= set_precision_14(sigma_last) || set_precision_14(sigma)>= set_precision_14(sigma_last));
Then I tried using a pointer (desperately):
long double *sigma_last;
*sigma_last = NULL;
do{
if(*sigma_last != NULL){
sigma = *sigma_last;
}
sigma1 = atan( tan(beta1) / cos(A1) );
sigmaM = (2*sigma1 + sigma) / 2;
d_sigma = B*sin(sigma)*(cos(2*sigmaM)+(1/4)*B*(cos(sigma)
*(-1+2*pow(cos(2*sigmaM),2)))-(1/6)*B*cos(2*sigmaM)
*(-3+4*pow(sin(sigma),2))*(-3+4*pow(cos(2*sigmaM),2)));
*sigma_last = sigma + d_sigma;
}
while(set_precision_14(sigma)<= set_precision_14(*sigma_last) || set_precision_14(sigma)>= set_precision_14(*sigma_last));
Finding the source of error in entire code and trying to solve it took me hours, cannot really come up with another "maybe this?" . Feel free to smite me.
Here's a github link to my full code if anyone out there's interested.
Your first (and only) iteration, sigma_last will be null, resulting in crash:
*sigma_last = NULL; // <-- dereferencing uninitialized ptr here
if(*sigma_last != NULL) { // <-- dereferencing uninitialized ptr here too
and if that would have been fixed, here:
*sigma_last == sigma + d_sigma;
This is because you have not set sigma_last to point to some valid floating-point space in memory. There doesn't seem to be any point to using a pointer in this particular case, so if I were you, I'd drop it and use a normal long double instead, as in your first attempt.
In your first example you assign NULL, which is really the value zero, to sigma_last. If zero is not what you're intending, you could either go with a value that most certainly will be out of range (say 1e20 and then compare to say < 1e19) or keep a separate boolan for the job. I personally prefer the first option:
long double sigma_last = 1e20;
...
if(sigma_last < 1e19){
sigma = sigma_last;
}
A better way still would be to use an infinite, or finite, loop and then break out at a certain condition. This will make the code easier to read.
Logic
Finally, you seem to have a problem with your logic in the while, since the comparison sigma <= sigma_last || sigma >= sigma_last is always true. It's always smaller, bigger, or equal.
sigma_last does not need to be a pointer. You just need to somehow flag its value to know whether it was already set or not. From your code I am not sure if we can use zero for this purpose, but we can use some constant (long double minimum value), like this one:
#include <float.h>
const long double invalid_constant = LDBL_MIN;
Try this:
long double DESTINATION_CALCULATION_plusplus ( double phi, double lambda, double S, double azimuth,
double a, double b, double *phi2, double* lambda2, double* azimuth2){
phi = phi*M_PI/180;
lambda = lambda*M_PI/180;
double A1;
double eu2 = (pow(a, 2) - pow(b, 2)) / pow(b, 2); //second eccentricity
double c = pow(a,2) / b;
double v = sqrt(1 + (eu2 * pow(cos(phi) , 2)));
double beta1 = tan(phi) / v;
double Aeq = asin( cos(beta1) * sin(azimuth) );
double f = (a - b) / a; //flattening
double beta = atan((1-f)*tan(phi));
double u2 = pow(cos(Aeq),2)*eu2;
//////////////////////////////----------------------------------------------
long double sigma1 = atan( tan(beta1)/ cos(azimuth) );
long double A = 1 + u2*(4096 + u2*(-768+u2*(320-175*u2))) / 16384;
long double B = u2*(256 + u2*(-128+u2*(74-47*u2)))/1024;
long double sigma = S / (b*A);
long double sigmaM = (2*sigma1 + sigma) /2;
long double d_w;
long double d_sigma;
////////////////////////////------------------------------------------------
double C;
double d_lambda;
long double sigma_last=invalid_constant;
do{
if(sigma_last != invalid_constant){
sigma = sigma_last;
}
sigma1 = atan( tan(beta1) / cos(A1) );
sigmaM = (2*sigma1 + sigma) / 2;
d_sigma = B*sin(sigma)*(cos(2*sigmaM)+(1/4)*B*(cos(sigma)
*(-1+2*pow(cos(2*sigmaM),2)))-(1/6)*B*cos(2*sigmaM)
*(-3+4*pow(sin(sigma),2))*(-3+4*pow(cos(2*sigmaM),2)));
sigma_last = sigma + d_sigma;
}
while(set_precision_14(sigma)<= set_precision_14(sigma_last) || set_precision_14(sigma)>= set_precision_14(sigma_last));
sigma = sigma_last;
*phi2 = atan((sin(beta1)*cos(sigma)+cos(beta1)*sin(sigma)*cos(azimuth))/((1-f)
*sqrt(pow(sin(Aeq),2)+pow((sin(beta1)*sin(sigma)-cos(beta1)*cos(sigma)*cos(azimuth)),2))));
d_w = (sin(sigma)*sin(azimuth))/(cos(beta1)*cos(sigma) - sin(beta1)* sin(sigma)*cos(azimuth));
C = (f/16)*pow(cos(Aeq),2)*(4+f*(4-3*pow(cos(Aeq),2)));
d_lambda = d_w - (1-C)*f*sin(azimuth)*(sigma + C*sin(sigma)*
(cos(2*sigmaM)+C*cos(sigma)*(-1+2*pow(cos(2*sigmaM),2))));
*lambda2 = lambda + d_lambda;
*azimuth2 = sin(Aeq) / (-sin(beta1)*sin(sigma)+cos(beta1)*cos(sigma)*cos(azimuth));
*azimuth2 = *azimuth2 * 180/M_PI;
*lambda2 = *lambda2 * 180/M_PI;
*phi2 = *phi2 * 180/M_PI;
}

Difference between ldexp(1, x) and exp2(x)

It seems if the floating-point representation has radix 2 (i.e. FLT_RADIX == 2) both std::ldexp(1, x) and std::exp2(x) raise 2 to the given power x.
Does the standard define or mention any expected behavioral and/or performance difference between them? What is the practical experience over different compilers?
exp2(x) and ldexp(x,i) perform two different operations. The former computes 2x, where x is a floating-point number, while the latter computes x*2i, where i is an integer. For integer values of x, exp2(x) and ldexp(1,int(x)) would be equivalent, provided the conversion of x to integer doesn't overflow.
The question about the relative efficiency of these two functions doesn't have a clear-cut answer. It will depend on the capabilities of the hardware platform and the details of the library implementation. While conceptually, ldexpf() looks like simple manipulation of the exponent part of a floating-point operand, it is actually a bit more complicated than that, once one considers overflow and gradual underflow via denormals. The latter case involves the rounding of the significand (mantissa) part of the floating-point number.
As ldexp() is generally an infrequently used function, it is in my experience fairly common that less of an optimization effort is applied to it by math library writers than to other math functions.
On some platforms, ldexp(), or a faster (custom) version of it, will be used as a building block in the software implementation of exp2(). The following code provides an exemplary implementation of this approach for float arguments:
#include <cmath>
/* Compute exponential base 2. Maximum ulp error = 0.86770 */
float my_exp2f (float a)
{
const float cvt = 12582912.0f; // 0x1.8p23
const float large = 1.70141184e38f; // 0x1.0p127
float f, r;
int i;
// exp2(a) = exp2(i + f); i = rint (a)
r = (a + cvt) - cvt;
f = a - r;
i = (int)r;
// approximate exp2(f) on interval [-0.5,+0.5]
r = 1.53720379e-4f; // 0x1.426000p-13f
r = fmaf (r, f, 1.33903872e-3f); // 0x1.5f055ep-10f
r = fmaf (r, f, 9.61817801e-3f); // 0x1.3b2b20p-07f
r = fmaf (r, f, 5.55036031e-2f); // 0x1.c6af7ep-05f
r = fmaf (r, f, 2.40226522e-1f); // 0x1.ebfbe2p-03f
r = fmaf (r, f, 6.93147182e-1f); // 0x1.62e430p-01f
r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+00f
// exp2(a) = 2**i * exp2(f);
r = ldexpf (r, i);
if (!(fabsf (a) < 150.0f)) {
r = a + a; // handle NaNs
if (a < 0.0f) r = 0.0f;
if (a > 0.0f) r = large * large; // + INF
}
return r;
}
Most real-life implementations of exp2() do not invoke ldexp(), but a custom version, for example when fast bit-wise transfer between integer and floating-point data is supported, here represented by internal functions __float_as_int() and __int_as_float() that re-interpret an IEEE-754 binary32 as an int32 and vice versa:
/* For a in [0.5, 4), compute a * 2**i, -250 < i < 250 */
float fast_ldexpf (float a, int i)
{
int ia = (i << 23) + __float_as_int (a); // scale by 2**i
a = __int_as_float (ia);
if ((unsigned int)(i + 125) > 250) { // |i| > 125
i = (i ^ (125 << 23)) - i; // ((i < 0) ? -125 : 125) << 23
a = __int_as_float (ia - i); // scale by 2**(+/-125)
a = a * __int_as_float ((127 << 23) + i); // scale by 2**(+/-(i%125))
}
return a;
}
On other platforms, the hardware provides a single-precision version of exp2() as a fast hardware instruction. Internal to the processor these are typically implemented by a table lookup with linear or quadratic interpolation. On such hardware platforms, ldexp(float) may be implemented in terms of exp2(float), for example:
float my_ldexpf (float x, int i)
{
float r, fi, fh, fq, t;
fi = (float)i;
/* NaN, Inf, zero require argument pass-through per ISO standard */
if (!(fabsf (x) <= 3.40282347e+38f) || (x == 0.0f) || (i == 0)) {
r = x;
} else if (abs (i) <= 126) {
r = x * exp2f (fi);
} else if (abs (i) <= 252) {
fh = (float)(i / 2);
r = x * exp2f (fh) * exp2f (fi - fh);
} else {
fq = (float)(i / 4);
t = exp2f (fq);
r = x * t * t * t * exp2f (fi - 3.0f * fq);
}
return r;
}
Lastly, there are platforms that basically provide both exp2() and ldexp() functionality in hardware, such as the x87 instructions F2XM1 and FSCALE on x86 processors.

How can I add together two SSE registers

I have two SSE registers (128 bits is one register) and I want to add them up. I know how I can add corresponding words in them, for example I can do it with _mm_add_epi16 if I use 16bit words in registers, but what I want is something like _mm_add_epi128 (which does not exist), which would use register as one big word.
Is there any way to perform this operation, even if multiple instructions are needed?
I was thinking about using _mm_add_epi64, detecting overflow in the right word and then adding 1 to the left word in register if needed, but I would also like this approach to work for 256bit registers (AVX2), and this approach seems too complicated for that.
To add two 128-bit numbers x and y to give z with SSE you can do it like this
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
This is based on this link how-can-i-add-and-subtract-128-bit-integers-in-c-or-c.
The function unsigned_lessthan is defined below. It's complicated without AMD XOP (actually a found a simpler version for SSE4.2 if XOP is not available - see the end of my answer). Probably some of the other people here can suggest a better method. Here is some code showing this works.
#include <stdint.h>
#include <x86intrin.h>
#include <stdio.h>
inline __m128i unsigned_lessthan(__m128i a, __m128i b) {
#ifdef __XOP__ // AMD XOP instruction set
return _mm_comgt_epu64(b,a));
#else // SSE2 instruction set
__m128i sign32 = _mm_set1_epi32(0x80000000); // sign bit of each dword
__m128i aflip = _mm_xor_si128(b,sign32); // a with sign bits flipped
__m128i bflip = _mm_xor_si128(a,sign32); // b with sign bits flipped
__m128i equal = _mm_cmpeq_epi32(b,a); // a == b, dwords
__m128i bigger = _mm_cmpgt_epi32(aflip,bflip); // a > b, dwords
__m128i biggerl = _mm_shuffle_epi32(bigger,0xA0); // a > b, low dwords copied to high dwords
__m128i eqbig = _mm_and_si128(equal,biggerl); // high part equal and low part bigger
__m128i hibig = _mm_or_si128(bigger,eqbig); // high part bigger or high part equal and low part
__m128i big = _mm_shuffle_epi32(hibig,0xF5); // result copied to low part
return big;
#endif
}
int main() {
__m128i x,y,z,c;
x = _mm_set_epi64x(3,0xffffffffffffffffll);
y = _mm_set_epi64x(1,0x2ll);
z = _mm_add_epi64(x,y);
c = _mm_unpacklo_epi64(_mm_setzero_si128(), unsigned_lessthan(z,x));
z = _mm_sub_epi64(z,c);
int out[4];
//int64_t out[2];
_mm_storeu_si128((__m128i*)out, z);
printf("%d %d\n", out[2], out[0]);
}
Edit:
The only potentially efficient way to add 128-bit or 256-bit numbers with SSE is with XOP. The only option with AVX would be XOP2 which does not exist yet. And even if you have XOP it may only be efficient to add two 128-bit or 256-numbers in parallel (you could do four with AVX if XOP2 existed) to avoid the horizontal instructions such as mm_unpacklo_epi64.
The best solution in general is to push the registers onto the stack and use scalar arithmetic. Assuming you have two 256-bit registers x4 and y4 you can add them like this:
__m256i x4, y4, z4;
uint64_t x[4], uint64_t y[4], uint64_t z[4]
_mm256_storeu_si256((__m256i*)x, x4);
_mm256_storeu_si256((__m256i*)y, y4);
add_u256(x,y,z);
z4 = _mm256_loadu_si256((__m256i*)z);
void add_u256(uint64_t x[4], uint64_t y[4], uint64_t z[4]) {
uint64_t c1 = 0, c2 = 0, tmp;
//add low 128-bits
z[0] = x[0] + y[0];
z[1] = x[1] + y[1];
c1 += z[1]<x[1];
tmp = z[1];
z[1] += z[0]<x[0];
c1 += z[1]<tmp;
//add high 128-bits + carry from low 128-bits
z[2] = x[2] + y[2];
c2 += z[2]<x[2];
tmp = z[2];
z[2] += c1;
c2 += z[2]<tmp;
z[3] = x[3] + y[3] + c2;
}
int main() {
uint64_t x[4], y[4], z[4];
x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
y[0] = 1; y[1] = 1; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,1,0)
//x[0] = -1; x[1] = -1; x[2] = 1; x[3] = 1;
//y[0] = 1; y[1] = 0; y[2] = 1; y[3] = 1;
//z = x + y (x3,x2,x1,x0) = (2,3,0,0)
add_u256(x,y,z);
for(int i=3; i>=0; i--) printf("%u ", z[i]); printf("\n");
}
Edit: based on a comment by Stephen Canon at saturated-substraction-avx-or-sse4-2 I discovered there is a more efficient way to compare unsigned 64-bit numbers with SSE4.2 if XOP is not available.
__m128i a,b;
__m128i sign64 = _mm_set1_epi64x(0x8000000000000000L);
__m128i aflip = _mm_xor_si128(a, sign64);
__m128i bflip = _mm_xor_si128(b, sign64);
__m128i cmp = _mm_cmpgt_epi64(aflip,bflip);