Why doesn't g++ -O3 change division to multiplication? - c++

I thought g++ -O3 will change division to multiplication automatically. But accroding to this code:
#include <iostream>
#include <sys/time.h>
double compute0(int i) {
double d_2 = i * i;
double ret = 0;
for (int j = 0; j < 1000000; j++) {
ret += j;
}
return ret;
}
double compute1(int i) {
double d_2 = i * i;
double ret = 0;
for (int j = 0; j < 1000000; j++) {
ret += j / d_2;
}
return ret;
}
double compute2(int i) {
double d_2 = i * i;
double d_2_inv = 1.0 / d_2;
double ret = 0;
for (int j = 0; j < 1000000; j++) {
ret += j * d_2_inv;
}
return ret;
}
double tik() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec * 1e-6;
}
int main() {
{
double begin = tik();
double ret = 0;
for(int i = 1; i < 100; i++)
ret += compute0(i);
double end = tik();
std::cout << "cost time: " << end - begin << " ret: " << ret << std::endl;
}
{
double begin = tik();
double ret = 0;
for(int i = 1; i < 100; i++)
ret += compute1(i);
double end = tik();
std::cout << "cost time: " << end - begin << " ret: " << ret << std::endl;
}
{
double begin = tik();
double ret = 0;
for(int i = 1; i < 100; i++)
ret += compute2(i);
double end = tik();
std::cout << "cost time: " << end - begin << " ret: " << ret << std::endl;
}
return 0;
}
the output is :
cost time: 0.105436 ret: 4.95e+13
cost time: 0.453676 ret: 8.17441e+11
cost time: 0.203873 ret: 8.17441e+11
WHY?

Compilers usually try to follow IEEE754. In this standard, division is defined exactly. It means, that for every a/b, there is a bit-exact answer. If one modifies this into a*(1/b), result may differ a little bit (you might see this effect, if you print your doubles with 16 significant digits)
Compilers usually have an option to relax this. GCC has -ffast-math, VC has /fp:fast.

Related

how to solve a weighted completion time minimization problem to cplex/c++?

a scheduling problem that wants to minimize a weighted completion time
I would like to pass the image problem to the c++ language and solve it with cplex solver.
Problem:
Min ∑WiCi
xi ∈ 0,1
t= ∆ + ∑_{i} xi*pi1
Ci1 = ∑_{j=1}^{i} xj*pj1
Ci2 = t + ∑_{j=1}^{i} (1−xj)*pj2
Ci ≥ Ci1
Ci ≥ Ci2−Ωxi
Ω = ∆ + ∑_i max(pi1, pi2)
The goal is to minimize the weighted completion time of a problem with one machine with at most one machine reconfiguration. I have a boolean variable Xi that tells me if the job is done before or after the reconfiguration. The variable t is the instant after the reconfiguration(delta is the time to reconfigure the machine). pi1 is the processing time in configuration 1 and pi2 is the processing time in configuration 2. Besides, I have restrictions due to Ci (completion time of a job i).
The example of code I made is not working. I know I have a problem writing the C variable (but I don't know how to solve it).
#include <iostream>
#include <ilcplex/ilocplex.h>
#include <ilcp/cp.h>
#include <vector>
#include <algorithm>
using namespace std;
int main()
{
int nbJob = 4;
int delta = 2;
int nbConf = 2;
vector<int> w_job;
w_job.push_back(1); w_job.push_back(1); w_job.push_back(1); w_job.push_back(1);
vector<vector<int>>p_job;
p_job.resize(nbJob);
p_job[0].push_back(6); p_job[0].push_back(3);
p_job[1].push_back(1); p_job[1].push_back(2);
p_job[2].push_back(10); p_job[2].push_back(2);
p_job[3].push_back(1); p_job[3].push_back(9);
int max_p = 0;
int aux;
for (size_t i = 0; i < nbJob - 1; i++) {
aux = max(p_job[i][0], p_job[i][1]);
max_p = max(max_p, aux);
}
cout << max_p << endl;
int max_w = 0;
aux = 0;
for (size_t i = 0; i < nbJob - 1; i++) {
aux = max(w_job[i], w_job[i + 1]);
max_w = max(aux, max_w);
}
cout << max_w << endl;
int omega = 0;
for (size_t i = 0; i < nbJob; i++) {
omega += max(p_job[i][0], p_job[i][1]);
}
omega += delta;
cout << omega << endl;
try {
IloEnv env;
IloModel model(env);
IloBoolVarArray x(env, nbJob);
for (size_t i = 0; i < nbJob; i++) {
IloBoolVar xi(env, 0, 1, "xi");
x[i] = xi;
}
IloArray<IloExprArray> C(env, nbJob);
for (size_t i = 0; i < nbJob; i++) {
IloExprArray Ci(env);
for (size_t j = 0; j < nbConf; j++) {
IloExpr Cij(env);
Ci.add(Cij);
}
C[i] = Ci;
}
IloNumVarArray C_final(env, nbJob);
for (size_t i = 0; i < nbJob; i++) {
IloNumVar C_finali(env, 0, IloInfinity, ILOINT); //
C_final[i] = C_finali;
}
IloExpr t(env);
for (int i = 0; i < nbJob; i++) {
t += x[i] * p_job[i][0];
}
t += delta;
for (size_t i = 0; i < nbJob; i++) {
for (size_t j = 0; j < nbJob; j++) {
if (j <= i) {
C[i][0] += x[j] * p_job[j][0];
}
}
}
for (size_t i = 0; i < nbJob; i++) {
for (size_t j = 0; j < nbJob; j++) {
if (j <= i) {
C[i][1] += ((1 - x[j]) * p_job[j][1]);
}
}
C[i][1] += t;
}
for (size_t i = 0; i < nbJob; i++) {
model.add(C_final[i] >= C[i][0]);
}
for (size_t i = 0; i < nbJob; i++) {
model.add(C_final[i] >= C[i][1] - (omega * x[i]));
}
IloExpr wiCi(env);
for (size_t i = 0; i < nbJob; i++) {
wiCi += w_job[i] * C_final[i];
}
model.add(IloMinimize(env, wiCi));
IloCplex solver(model);
solver.solve();
for (int i = 0; i < nbJob; i++) {
cout << "C " << i + 1 << " = " << solver.getValue(C_final[i]) << " x" << i+1 << " = " << solver.getValue(x[i]) << endl;
}
cout << endl;
cout << "t = " << solver.getValue(t) << endl;
cout << "wiCi = " << solver.getObjValue() << endl << endl;
}
catch (IloException e) {
cout << e.getMessage();
}
}

Reliable comparison of double

I have an admittedly very basic problem: I need to compare two numbers of type double for >=. For some reason, however, my code evaluates to true for values I know to be less than the threshold.
EDIT: My code (the error occurs in the countTrig() method of the Antenna class):
#define k 0.0000000000000000000000138064852 // Boltzmann's constant
class Antenna{
vector<vector<double> > output;
int channels, smplrate, smpldur, samples, timethld;
double resistance, temp, bandwidth, lnanoise, lnagain, RMS;
public:
Antenna(
const int _channels, const int _smplrate, const int _smpldur,
const double _resistance, const double _temp, const double _bandwidth,
const double _lnanoise, const double _lnagain
){
channels = _channels; smplrate = _smplrate; smpldur = _smpldur;
resistance = _resistance; temp = _temp; bandwidth = _bandwidth;
lnanoise = _lnanoise; lnagain = _lnagain;
RMS = 2 * sqrt(4 * k * resistance * temp * bandwidth);
RMS *= lnagain * pow(10,(lnanoise/10));
samples = smplrate/smpldur;
timethld = 508; //= (1/smplrate) * 0.127;
}
void genThrml(int units);
void plotTrig(int timethld, double voltsthld);
void plotThrml();
int countTrig(double snrthld, int iter);
};
double fabs(double val){ if(val < 0){ val *= -1; } return val; }
void Antenna::genThrml(int units){
output.resize(samples, vector<double>(channels));
samples *= units;
gRandom->SetSeed(time(NULL));
for(int i = 0; i < samples; ++i){
for(int j = 0; j < channels; ++j){
output[i][j] = gRandom->Gaus(0,RMS);
}
}
}
void Antenna::plotThrml(){
//Filler
}
int Antenna::countTrig(double snrthld, int iter){
int count = 0;
int high = iter + timethld;
int low = iter - timethld;
if(low < 0){ low = 0; }
if(high > samples){ high = samples; }
for(int i = low; i < high; ++i){
for(int j = 0; j < channels; ++j){
if(output[i][j] >= snrthld) count++; std::cout << output[i][j] << " " << snrthld << "\n";
}
}
if(iter >= 3) return 1;
else return 0;
}
void Antenna::plotTrig(int timethld, double voltsthld){
double snrthld = voltsthld / RMS;
for(int i = 0; i < samples; ++i){
for(int j = 0; j < channels; ++j){
countTrig(snrthld, i);
}
}
}
int main(){
Antenna test(20,4000,1,50,290,500000000,1.5,60);
test.genThrml(1);
test.plotTrig(400,0.0005);
return 0;
}
With a threshold of 0.147417, I get output like this:
0.0014238
-0.00187276
I believe I understand the problem (unless there's some obvious mistake I've made and not caught), and I understand the reasoning behind floating point errors, precision, etc. I don't, however, know, what the best practice is here. What is a good solution? How can I reliably compare values of type double? This will be used in an application where it is very important that values be precise and comparisons be reliable.
EDIT: A smaller example:
int countTrig(double snrthld, int iter, vector<vector<double> > output, int timethld){
int count = 0;
int high = iter + timethld;
int low = iter - timethld;
if(low < 0){ low = 0; }
if(high > 3){ high = 3; }
for(int i = low; i < high; ++i){
for(int j = 0; j < 3; ++j){
if(fabs(output[i][j]) >= snrthld) count++; std::cout << output[i][j] << " " << snrthld << "\n";
}
}
if(iter >= 3) return 1;
else return 0;
}
void plotTrig(int timethld, double snrthld){
vector<vector<double> > output = {{0.000028382, -0.0028348329, -0.00008573829},
{0.183849939, 0.9283829020, -0.92838200021},
{-0.00292889, 0.2399229929, -0.00081009189}};
for(int i = 0; i < 3; ++i){
for(int j = 0; j < 3; ++j){
countTrig(snrthld, i, output, timethld);
}
}
}
int main(){
plotTrig(1,0.1);
return 0;
}
You have a typo.
if(output[i][j] >= snrthld) count++; std::cout << output[i][j] << " " << snrthld << "\n";
this line means
if(output[i][j] >= snrthld)
count++;
std::cout << output[i][j] << " " << snrthld << "\n";
aka
if(output[i][j] >= snrthld)
{
count++;
}
std::cout << output[i][j] << " " << snrthld << "\n";
and you want:
if(output[i][j] >= snrthld)
{
count++;
std::cout << output[i][j] << " " << snrthld << "\n";
}

Measuring time with chrono changes after printing

I want to measure the execution time of a program in ns in C++. For that purpose I am using the chrono library.
int main() {
const int ROWS = 200;
const int COLS = 200;
double input[ROWS][COLS];
int i,j;
auto start = std::chrono::steady_clock::now();
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
input[i][j] = i + j;
}
auto end = std::chrono::steady_clock::now();
auto res=std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Elapsed time in nanoseconds : "
<< res
<< " ns" << std::endl;
return 0;
}
I measured the time and it executed in 90 ns . However when I add a printing afterwards the time changes.
int main() {
const int ROWS = 200;
const int COLS = 200;
double input[ROWS][COLS];
int i,j;
auto start = std::chrono::steady_clock::now();
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
input[i][j] = i + j;
}
auto end = std::chrono::steady_clock::now();
auto res=std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Elapsed time in nanoseconds : "
<< res
<< " ns" << std::endl;
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
std::cout<<input[i][j];
}
return 0;
}
The time changes to 89700 ns. What could be the problem. I only want to measure the execution time of the for.

Smith Waterman for C++ (Visual Studio 14.0)

I would like to kill two birds with one stone, as the questions are very similiar:
1:
I followed this code on github Smith Waterman Alignment to create the smith-waterman in C++. After some research I understood that implementing
double H[N_a+1][N_b+1]; is not possible (anymore) for the "newer" C++ versions. So to create a constant variable I changed this line to:
double **H = new double*[nReal + 1];
for (int i = 0; i < nReal + 1; i++)
H[i] = new double[nSynth + 1];
and also the same scheme for int I_i[N_a+1][N_b+1], I_j[N_a+1][N_b+1]; and so one (well, everywhere, where a two dimensional array exists). Now I'm getting the exception:
Unhandled exception at 0x00007FFF7B413C58 in Smith-Waterman.exe: Microsoft C
++ exception: std :: bad_alloc at location 0x0000008FF4F9FA50.
What is wrong here? Already debugged, and the program throws the exceptions above the for (int i = 0; i < nReal + 1; i++).
2: This code uses std::strings as parameters. Would it be also possible to create a smith waterman algortihm for cv::Mat?
For maybe more clarification, my full code looks like this:
#include "BinaryAlignment.h"
#include "WallMapping.h"
//using declarations
using namespace cv;
using namespace std;
//global variables
std::string bin;
cv::Mat temp;
std::stringstream sstrMat;
const int maxMismatch = 2;
const float mu = 0.33f;
const float delta = 1.33;
int ind;
BinaryAlignment::BinaryAlignment() { }
BinaryAlignment::~BinaryAlignment() { }
/**
*** Convert matrix to binary sequence
**/
std::string BinaryAlignment::matToBin(cv::Mat src, std::experimental::filesystem::path path) {
cv::Mat linesMat = WallMapping::wallMapping(src, path);
for (int i = 0; i < linesMat.size().height; i++) {
for (int j = 0; j < linesMat.size().width; j++) {
if (linesMat.at<Vec3b>(i, j)[0] == 0
&& linesMat.at<Vec3b>(i, j)[1] == 0
&& linesMat.at<Vec3b>(i, j)[2] == 255) {
src.at<int>(i, j) = 1;
}
else {
src.at<int>(i, j) = 0;
}
sstrMat << src.at<int>(i, j);
}
}
bin = sstrMat.str();
return bin;
}
double BinaryAlignment::similarityScore(char a, char b) {
double result;
if (a == b)
result = 1;
else
result = -mu;
return result;
}
double BinaryAlignment::findArrayMax(double array[], int length) {
double max = array[0];
ind = 0;
for (int i = 1; i < length; i++) {
if (array[i] > max) {
max = array[i];
ind = i;
}
}
return max;
}
/**
*** Smith-Waterman alignment for given sequences
**/
int BinaryAlignment::watermanAlign(std::string seqSynth, std::string seqReal, bool viableAlignment) {
const int nSynth = seqSynth.length(); //length of sequences
const int nReal = seqReal.length();
//H[nSynth + 1][nReal + 1]
double **H = new double*[nReal + 1];
for (int i = 0; i < nReal + 1; i++)
H[i] = new double[nSynth + 1];
cout << "passt";
for (int m = 0; m <= nSynth; m++)
for (int n = 0; n <= nReal; n++)
H[m][n] = 0;
double temp[4];
int **Ii = new int*[nReal + 1];
for (int i = 0; i < nReal + 1; i++)
Ii[i] = new int[nSynth + 1];
int **Ij = new int*[nReal + 1];
for (int i = 0; i < nReal + 1; i++)
Ij[i] = new int[nSynth + 1];
for (int i = 1; i <= nSynth; i++) {
for (int j = 1; j <= nReal; j++) {
temp[0] = H[i - 1][j - 1] + similarityScore(seqSynth[i - 1], seqReal[j - 1]);
temp[1] = H[i - 1][j] - delta;
temp[2] = H[i][j - 1] - delta;
temp[3] = 0;
H[i][j] = findArrayMax(temp, 4);
switch (ind) {
case 0: // score in (i,j) stems from a match/mismatch
Ii[i][j] = i - 1;
Ij[i][j] = j - 1;
break;
case 1: // score in (i,j) stems from a deletion in sequence A
Ii[i][j] = i - 1;
Ij[i][j] = j;
break;
case 2: // score in (i,j) stems from a deletion in sequence B
Ii[i][j] = i;
Ij[i][j] = j - 1;
break;
case 3: // (i,j) is the beginning of a subsequence
Ii[i][j] = i;
Ij[i][j] = j;
break;
}
}
}
//Print matrix H to console
std::cout << "**********************************************" << std::endl;
std::cout << "The scoring matrix is given by " << std::endl << std::endl;
for (int i = 1; i <= nSynth; i++) {
for (int j = 1; j <= nReal; j++) {
std::cout << H[i][j] << " ";
}
std::cout << std::endl;
}
//search H for the moaximal score
double Hmax = 0;
int imax = 0, jmax = 0;
for (int i = 1; i <= nSynth; i++) {
for (int j = 1; j <= nReal; j++) {
if (H[i][j] > Hmax) {
Hmax = H[i][j];
imax = i;
jmax = j;
}
}
}
std::cout << Hmax << endl;
std::cout << nSynth << ", " << nReal << ", " << imax << ", " << jmax << std::endl;
std::cout << "max score: " << Hmax << std::endl;
std::cout << "alignment index: " << (imax - jmax) << std::endl;
//Backtracing from Hmax
int icurrent = imax, jcurrent = jmax;
int inext = Ii[icurrent][jcurrent];
int jnext = Ij[icurrent][jcurrent];
int tick = 0;
char *consensusSynth = new char[nSynth + nReal + 2];
char *consensusReal = new char[nSynth + nReal + 2];
while (((icurrent != inext) || (jcurrent != jnext)) && (jnext >= 0) && (inext >= 0)) {
if (inext == icurrent)
consensusSynth[tick] = '-'; //deletion in A
else
consensusSynth[tick] = seqSynth[icurrent - 1]; //match / mismatch in A
if (jnext == jcurrent)
consensusReal[tick] = '-'; //deletion in B
else
consensusReal[tick] = seqReal[jcurrent - 1]; //match/mismatch in B
//fix for adding first character of the alignment.
if (inext == 0)
inext = -1;
else if (jnext == 0)
jnext = -1;
else
icurrent = inext;
jcurrent = jnext;
inext = Ii[icurrent][jcurrent];
jnext = Ij[icurrent][jcurrent];
tick++;
}
// Output of the consensus motif to the console
std::cout << std::endl << "***********************************************" << std::endl;
std::cout << "The alignment of the sequences" << std::endl << std::endl;
for (int i = 0; i < nSynth; i++) {
std::cout << seqSynth[i];
};
std::cout << " and" << std::endl;
for (int i = 0; i < nReal; i++) {
std::cout << seqReal[i];
};
std::cout << std::endl << std::endl;
std::cout << "is for the parameters mu = " << mu << " and delta = " << delta << " given by" << std::endl << std::endl;
for (int i = tick - 1; i >= 0; i--)
std::cout << consensusSynth[i];
std::cout << std::endl;
for (int j = tick - 1; j >= 0; j--)
std::cout << consensusReal[j];
std::cout << std::endl;
int numMismatches = 0;
for (int i = tick - 1; i >= 0; i--) {
if (consensusSynth[i] != consensusReal[i]) {
numMismatches++;
}
}
viableAlignment = numMismatches <= maxMismatch;
return imax - jmax;
}
Thanks!

AVX2 slower than SSE on Haswell

I have the following code (normal, SSE and AVX):
int testSSE(const aligned_vector & ghs, const aligned_vector & lhs) {
int result[4] __attribute__((aligned(16))) = {0};
__m128i vresult = _mm_set1_epi32(0);
__m128i v1, v2, vmax;
for (int k = 0; k < ghs.size(); k += 4) {
v1 = _mm_load_si128((__m128i *) & lhs[k]);
v2 = _mm_load_si128((__m128i *) & ghs[k]);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *) result, vresult);
int mymax = result[0];
for (int k = 1; k < 4; k++) {
if (result[k] > mymax) {
mymax = result[k];
}
}
return mymax;
}
int testAVX(const aligned_vector & ghs, const aligned_vector & lhs) {
int result[8] __attribute__((aligned(32))) = {0};
__m256i vresult = _mm256_set1_epi32(0);
__m256i v1, v2, vmax;
for (int k = 0; k < ghs.size(); k += 8) {
v1 = _mm256_load_si256((__m256i *) & ghs[ k]);
v2 = _mm256_load_si256((__m256i *) & lhs[k]);
vmax = _mm256_add_epi32(v1, v2);
vresult = _mm256_max_epi32(vresult, vmax);
}
_mm256_store_si256((__m256i *) result, vresult);
int mymax = result[0];
for (int k = 1; k < 8; k++) {
if (result[k] > mymax) {
mymax = result[k];
}
}
return mymax;
}
int testNormal(const aligned_vector & ghs, const aligned_vector & lhs) {
int max = 0;
int tempMax;
for (int k = 0; k < ghs.size(); k++) {
tempMax = lhs[k] + ghs[k];
if (max < tempMax) {
max = tempMax;
}
}
return max;
}
All these functions are tested with the following code:
void alignTestSSE() {
aligned_vector lhs;
aligned_vector ghs;
int mySize = 4096;
int FinalResult;
int nofTestCases = 1000;
double time, time1, time2, time3;
vector<int> lhs2;
vector<int> ghs2;
lhs.resize(mySize);
ghs.resize(mySize);
lhs2.resize(mySize);
ghs2.resize(mySize);
srand(1);
for (int k = 0; k < mySize; k++) {
lhs[k] = randomNodeID(1000000);
lhs2[k] = lhs[k];
ghs[k] = randomNodeID(1000000);
ghs2[k] = ghs[k];
}
/* Warming UP */
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testNormal(lhs, ghs);
}
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testSSE(lhs, ghs);
}
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testAVX(lhs, ghs);
}
cout << "===========================" << endl;
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testSSE(lhs, ghs);
}
time = timestamp() - time;
time1 = time;
cout << "SSE took " << time << " s" << endl;
cout << "SSE Result: " << FinalResult << endl;
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testAVX(lhs, ghs);
}
time = timestamp() - time;
time3 = time;
cout << "AVX took " << time << " s" << endl;
cout << "AVX Result: " << FinalResult << endl;
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
FinalResult = testNormal(lhs, ghs);
}
time = timestamp() - time;
cout << "Normal took " << time << " s" << endl;
cout << "Normal Result: " << FinalResult << endl;
cout << "SpeedUP SSE= " << time / time1 << " s" << endl;
cout << "SpeedUP AVX= " << time / time3 << " s" << endl;
cout << "===========================" << endl;
ghs.clear();
lhs.clear();
}
Where
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
And
typedef vector<int, aligned_allocator<int, sizeof (int)> > aligned_vector;
is an aligned vector using the AlignedAllocator of https://gist.github.com/donny-dont/1471329
I have an intel-i7 haswell 4771, and latest Ubuntu 14.04 64bit and gcc 4.8.2. Everything is up-to-date. I compiled with -march=native -mtune=native -O3 -m64.
Results are:
SSE took 0.000375986 s
SSE Result: 1982689
AVX took 0.000459909 s
AVX Result: 1982689
Normal took 0.00315714 s
Normal Result: 1982689
SpeedUP SSE= 8.39696 s
SpeedUP AVX= 6.8647 s
Which shows that the exact same code is 22% slower on AVX2 than SSE. Am I doing something wrong or is this normal behavior?
I converted your code to more vanilla C++ (plain arrays, no vectors, etc), cleaned it up and tested it with auto-vectorization disabled and got reasonable results:
#include <iostream>
using namespace std;
#include <sys/time.h>
#include <cstdlib>
#include <cstdint>
#include <immintrin.h>
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
int testSSE(const int32_t * ghs, const int32_t * lhs, size_t n) {
int result[4] __attribute__((aligned(16))) = {0};
__m128i vresult = _mm_set1_epi32(0);
__m128i v1, v2, vmax;
for (int k = 0; k < n; k += 4) {
v1 = _mm_load_si128((__m128i *) & lhs[k]);
v2 = _mm_load_si128((__m128i *) & ghs[k]);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *) result, vresult);
int mymax = result[0];
for (int k = 1; k < 4; k++) {
if (result[k] > mymax) {
mymax = result[k];
}
}
return mymax;
}
int testAVX(const int32_t * ghs, const int32_t * lhs, size_t n) {
int result[8] __attribute__((aligned(32))) = {0};
__m256i vresult = _mm256_set1_epi32(0);
__m256i v1, v2, vmax;
for (int k = 0; k < n; k += 8) {
v1 = _mm256_load_si256((__m256i *) & ghs[k]);
v2 = _mm256_load_si256((__m256i *) & lhs[k]);
vmax = _mm256_add_epi32(v1, v2);
vresult = _mm256_max_epi32(vresult, vmax);
}
_mm256_store_si256((__m256i *) result, vresult);
int mymax = result[0];
for (int k = 1; k < 8; k++) {
if (result[k] > mymax) {
mymax = result[k];
}
}
return mymax;
}
int testNormal(const int32_t * ghs, const int32_t * lhs, size_t n) {
int max = 0;
int tempMax;
for (int k = 0; k < n; k++) {
tempMax = lhs[k] + ghs[k];
if (max < tempMax) {
max = tempMax;
}
}
return max;
}
void alignTestSSE() {
int n = 4096;
int normalResult, sseResult, avxResult;
int nofTestCases = 1000;
double time, normalTime, sseTime, avxTime;
int lhs[n] __attribute__ ((aligned(32)));
int ghs[n] __attribute__ ((aligned(32)));
for (int k = 0; k < n; k++) {
lhs[k] = arc4random();
ghs[k] = arc4random();
}
/* Warming UP */
for (int k = 0; k < nofTestCases; k++) {
normalResult = testNormal(lhs, ghs, n);
}
for (int k = 0; k < nofTestCases; k++) {
sseResult = testSSE(lhs, ghs, n);
}
for (int k = 0; k < nofTestCases; k++) {
avxResult = testAVX(lhs, ghs, n);
}
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
normalResult = testNormal(lhs, ghs, n);
}
normalTime = timestamp() - time;
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
sseResult = testSSE(lhs, ghs, n);
}
sseTime = timestamp() - time;
time = timestamp();
for (int k = 0; k < nofTestCases; k++) {
avxResult = testAVX(lhs, ghs, n);
}
avxTime = timestamp() - time;
cout << "===========================" << endl;
cout << "Normal took " << normalTime << " s" << endl;
cout << "Normal Result: " << normalResult << endl;
cout << "SSE took " << sseTime << " s" << endl;
cout << "SSE Result: " << sseResult << endl;
cout << "AVX took " << avxTime << " s" << endl;
cout << "AVX Result: " << avxResult << endl;
cout << "SpeedUP SSE= " << normalTime / sseTime << endl;
cout << "SpeedUP AVX= " << normalTime / avxTime << endl;
cout << "===========================" << endl;
}
int main()
{
alignTestSSE();
return 0;
}
Test:
$ clang++ -Wall -mavx2 -O3 -fno-vectorize SO_avx.cpp && ./a.out
===========================
Normal took 0.00324106 s
Normal Result: 2143749391
SSE took 0.000527859 s
SSE Result: 2143749391
AVX took 0.000221968 s
AVX Result: 2143749391
SpeedUP SSE= 6.14002
SpeedUP AVX= 14.6015
===========================
I suggest you try the above code, with -fno-vectorize (or -fno-tree-vectorize if using g++), and see if you get similar results. If you do then you can work backwards towards your original code to see where the inconsistency might be coming from.
On my machine (core i7-4900M), based on updated code from Paul R, with g++ 4.8.2with 100,000 iterations instead of 1000, I have the following results:
g++ -Wall -mavx2 -O3 -std=c++11 test_avx.cpp && ./a.exe
SSE took 508,029 us
AVX took 1,308,075 us
Normal took 297,017 us
g++ -Wall -mavx2 -O3 -std=c++11 -fno-tree-vectorize test_avx.cpp && ./a.exe
SSE took 509,029 us
AVX took 1,307,075 us
Normal took 3,436,197 us
GCC is doing an amazing job optimizing the "Normal" code. Yet the slow performance of the "AVX" code can be explained by the lines below, which requires a full 256 bit store (ouch!) followed by a max search over 8 integers.
_mm256_store_si256((__m256i *) result, vresult);
int mymax = result[0];
for (int k = 1; k < 8; k++) {
if (result[k] > mymax) {
mymax = result[k];
}
}
return mymax;
It is best to continue using AVX intrinsics for the max of 8. I can propose the following changes
v1 = _mm256_permute2x128_si256(vresult,vresult,1); // from ABCD-EFGH to ????-ABCD
vresult = _mm256_max_epi32(vresult, v1);
v1 = _mm256_permute4x64_epi64(vresult,1); // from ????-ABCD to ????-??AB
vresult = _mm256_max_epi32(vresult, v1);
v1 = _mm256_shuffle_epi32(vresult,1); // from ????-???AB to ????-???A
vresult = _mm256_max_epi32(vresult, v1);
// no _mm256_extract_epi32 => need extra step
__m128i vres128 = _mm256_extracti128_si256(vresult,0);
return _mm_extract_epi32(vres128,0);
For a fair comparaison, I have also updated the SSE code, I have then:
SSE took 483,028 us
AVX took 258,015 us
Normal took 307,017 us
AVX time has decreased by a factor 5!
Doing loop unrolling manually can speed up above SSE/AVX code.
Original version on my i5-5300U:
Normal took 0.347 s
Normal Result: 2146591543
AVX took 0.409 s
AVX Result: 2146591543
SpeedUP AVX= 0.848411
After manual loop unrolling:
Normal took 0.375 s
Normal Result: 2146591543
AVX took 0.297 s
AVX Result: 2146591543
SpeedUP AVX= 1.26263