Vectorizing a program increases runtime - c++

I am asked to vectorize a larger program. Before I started with the big program I wanted to see the effect of vectorization in isolated case. For this I created two programs that should show the idea of the outstanding transformation. One with an array of structs (no vec) and struct of arrays (with vec). I expected that the soa would outperform the aos by far, but it doesn't.
measured program loop A
for (int i = 0; i < NUM; i++) {
ptr[i].c = ptr[i].a + ptr[i].b;
}
full program:
#include <cstdlib>
#include <iostream>
#include <stdlib.h>
#include <chrono>
using namespace std;
using namespace std::chrono;
struct myStruct {
double a, b, c;
};
#define NUM 100000000
high_resolution_clock::time_point t1, t2, t3;
int main(int argc, char* argsv[]) {
struct myStruct *ptr = (struct myStruct *) malloc(NUM * sizeof(struct myStruct));
for (int i = 0; i < NUM; i++) {
ptr[i].a = i;
ptr[i].b = 2 * i;
}
t1 = high_resolution_clock::now();
for (int i = 0; i < NUM; i++) {
ptr[i].c = ptr[i].a + ptr[i].b;
}
t2 = high_resolution_clock::now();
long dur = duration_cast<microseconds>( t2 - t1 ).count();
cout << "took "<<dur << endl;
double sum = 0;
for (int i = 0; i < NUM; i++) {
sum += ptr[i].c;
}
cout << "sum is "<< sum << endl;
}
measured program loop B
#pragma simd
for (int i = 0; i < NUM; i++) {
C[i] = A[i] + B[i];
}
full program:
#include <cstdlib>
#include <iostream>
#include <stdlib.h>
#include <omp.h>
#include <chrono>
using namespace std;
using namespace std::chrono;
#define NUM 100000000
high_resolution_clock::time_point t1, t2, t3;
int main(int argc, char* argsv[]) {
double *A = (double *) malloc(NUM * sizeof(double));
double *B = (double *) malloc(NUM * sizeof(double));
double *C = (double *) malloc(NUM * sizeof(double));
for (int i = 0; i < NUM; i++) {
A[i] = i;
B[i] = 2 * i;
}
t1 = high_resolution_clock::now();
#pragma simd
for (int i = 0; i < NUM; i++) {
C[i] = A[i] + B[i];
}
t2 = high_resolution_clock::now();
long dur = duration_cast<microseconds>( t2 - t1 ).count();
cout << "Aos "<<dur << endl;
double sum = 0;
for (int i = 0; i < NUM; i++) {
sum += C[i];
}
cout << "sum "<<sum;
}
I compile with
icpc vectorization_aos.cpp -qopenmp --std=c++11 -cxxlib=/lrz/mnt/sys.x86_64/compilers/gcc/4.9.3/
icpc (v16)
compiled and executed on an Intel(R) Xeon(R) CPU E5-2697 v3 # 2.60GHz
in my test cases program A takes around 300ms, B 350ms. If I add unnecessary additional data to the struct in A it becomes increasingly slower (as more memory has to be loaded)
the -O3 flag does not have any impact on run-time
removing the #pragma simd directive does also not have impact. So either its auto vectorized or my vectorization does not work at all.
Questions:
am I missing something? Is this the way how one would vectorize a program?
Why is program 2 slower? Maybe the program is both times just memory bandwidth bound and I need to increase the computation density?
Are there programs/ code snippets that show the impact of vecotrization better and how can I verify that my program is actually executed vectorized.

Related

read/write to large array using large loop - execution time concerns

So recently I ran into a problem that I thought was interesting and I couldn't fully explain. I've highlighted the nature of the problem in the following code:
#include <cstring>
#include <chrono>
#include <iostream>
#define NLOOPS 10
void doWorkFast(int total, int *write, int *read)
{
for (int j = 0; j < NLOOPS; j++) {
for (int i = 0; i < total; i++) {
write[i] = read[i] + i;
}
}
}
void doWorkSlow(int total, int *write, int *read, int innerLoopSize)
{
for (int i = 0; i < NLOOPS; i++) {
for (int j = 0; j < total/innerLoopSize; j++) {
for (int k = 0; k < innerLoopSize; k++) {
write[j*k + k] = read[j*k + k] + j*k + k;
}
}
}
}
int main(int argc, char *argv[])
{
int n = 1000000000;
int *heapMemoryWrite = new int[n];
int *heapMemoryRead = new int[n];
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start1 = std::chrono::high_resolution_clock::now();
doWorkFast(n,heapMemoryWrite, heapMemoryRead);
auto finish1 = std::chrono::high_resolution_clock::now();
auto duration1 = std::chrono::duration_cast<std::chrono::microseconds>(finish1 - start1);
for (int i = 0; i < n; i++)
{
heapMemoryRead[i] = 1;
}
std::memset(heapMemoryWrite, 0, n * sizeof(int));
auto start2 = std::chrono::high_resolution_clock::now();
doWorkSlow(n,heapMemoryWrite, heapMemoryRead, 10);
auto finish2 = std::chrono::high_resolution_clock::now();
auto duration2 = std::chrono::duration_cast<std::chrono::microseconds>(finish2 - start2);
std::cout << "Small inner loop:" << duration1.count() << " microseconds.\n" <<
"Large inner loop:" << duration2.count() << " microseconds." << std::endl;
delete[] heapMemoryWrite;
delete[] heapMemoryRead;
}
Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses. I understand that in the doWorkSlow implementation, we are doing one or two more operations to resolve j*k + k, however, I think it's reasonably safe to assume that relative to the time it takes to do the load/stores for memory read and write, the time contribution of these operations is negligible.
Nevertheless, doWorkSlow takes about twice as long (46.8s) compared to doWorkFast (25.5s) on my i7-3700 using g++ --version 7.5.0. While things like cache prefetching and branch prediction come to mind, I don't have a great explanation as to why doWorkFast is much faster than doWorkSlow. Does anyone have insight?
Thanks
Looking at the two doWork* functions, for every iteration, we are reading the same addresses adding the same value and writing to the same addresses.
This is not true!
In doWorkFast, you index each integer incrementally, as array[i].
array[0]
array[1]
array[2]
array[3]
In doWorkSlow, you index each integer as array[j*k + k], which jumps around and repeats.
When j is 10, for example, and you iterate k from 0 onwards, you are accessing
array[0] // 10*0+0
array[11] // 10*1+1
array[22] // 10*2+2
array[33] // 10*3+3
This will prevent your optimizer from using instructions that can operate on many adjacent integers at once.

Why is this vectorized code subject to vector size?

I compile the following code without vectorization (-O2) and compare the time with vectorization (-O3 -march=native) for three different vector lengths (determined by uncommenting the respective #define SIZE), obtaining 29::9, 247::145 and 4866::4884, for vector sizes 10000, 100000 and 1000000, respectively.
#include <iostream>
#include <random>
#include<chrono>
#include<cmath>
using namespace std;
using namespace std::chrono;
//#define SIZE (10000) // 29::9
//#define SIZE (100000) // 247::145
#define SIZE (1000000) // 4866::4884
void vector_op_2(int * __restrict__ v1, int * __restrict__ v2) {
for (unsigned i = 0; i < SIZE; i++)
v1[i] = 2 * v2[i];
}
int main() {
using namespace std;
int* v = new int[SIZE];
int* w = new int[SIZE];
for (int i = 0; i < SIZE; i++) {
v[i] = i;
}
auto start = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
for (int k = 0; k < 5000; k++) {
vector_op_2(w, v);
}
auto end = duration_cast<milliseconds>(system_clock::now().time_since_epoch());
std::cout << "Time " << end.count() - start.count() << std::endl;
for (int i = 0; i < SIZE; i++) {
if (abs(w[i]-2*v[i])>0.01) {
throw 1;
}
}
delete v;
return 0;
}
Why does no speedup occur in the case of vector size 1000000?
What is the optimal length?
Why does this vector length issue not occur with the following example?
[shortened]
long vector_op_1(int v[SIZE]) throw()
{
long s = 0;
for (unsigned i=0; i<SIZE; i++) s += v[i];
return s;
}
[... I am using g++ 7 on Ubuntu 16.04 ...]
[... For short vector size 1000 I am achieving a 6:1 ratio! ...]

OpenMP reduction with Eigen::VectorXd

I am attempting to parallelize the below loop with an OpenMP reduction;
#define EIGEN_DONT_PARALLELIZE
#include <iostream>
#include <cmath>
#include <string>
#include <eigen3/Eigen/Dense>
#include <eigen3/Eigen/Eigenvalues>
#include <omp.h>
using namespace Eigen;
using namespace std;
VectorXd integrand(double E)
{
VectorXd answer(500000);
double f = 5.*E + 32.*E*E*E*E;
for (int j = 0; j !=50; j++)
answer[j] =j*f;
return answer;
}
int main()
{
omp_set_num_threads(4);
double start = 0.;
double end = 1.;
int n = 100;
double h = (end - start)/(2.*n);
VectorXd result(500000);
result.fill(0.);
double E = start;
result = integrand(E);
#pragma omp parallel
{
#pragma omp for nowait
for (int j = 1; j <= n; j++){
E = start + (2*j - 1.)*h;
result = result + 4.*integrand(E);
if (j != n){
E = start + 2*j*h;
result = result + 2.*integrand(E);
}
}
}
for (int i=0; i <50 ; ++i)
cout<< i+1 << " , "<< result[i] << endl;
return 0;
}
This is definitely faster in parallel than without, but with all 4 threads, the results are hugely variable. When the number of threads is set to 1, the output is correct.
I would be most grateful if someone could assist me with this...
I am using the clang compiler with compile flags;
clang++-3.8 energy_integration.cpp -fopenmp=libiomp5
If this is a bust, then I'll have to learn to implement Boost::thread, or std::thread...
Your code does not define a custom reduction for OpenMP to reduce the Eigen objects. I'm not sure if clang supports user defined reductions (see OpenMP 4 spec, page 180). If so, you can declare a reduction and add reduction(+:result) to the #pragma omp for line. If not, you can do it yourself by changing your code as follows:
VectorXd result(500000); // This is the final result, not used by the threads
result.fill(0.);
double E = start;
result = integrand(E);
#pragma omp parallel
{
// This is a private copy per thread. This resolves race conditions between threads
VectorXd resultPrivate(500000);
resultPrivate.fill(0.);
#pragma omp for nowait// reduction(+:result) // Assuming user-defined reductions aren't allowed
for (int j = 1; j <= n; j++) {
E = start + (2 * j - 1.)*h;
resultPrivate = resultPrivate + 4.*integrand(E);
if (j != n) {
E = start + 2 * j*h;
resultPrivate = resultPrivate + 2.*integrand(E);
}
}
#pragma omp critical
{
// Here we sum the results of each thread one at a time
result += resultPrivate;
}
}
The error you're getting (in your comment) seems to be due to a size mismatch. While there isn't a trivial one in your code itself, don't forget that when OpenMP starts each thread, it has to initialize a private VectorXd per thread. If none is supplied, the default would be VectorXd() (with a size of zero). When this object is the used, the size mismatch occurs. A "correct" usage of omp declare reduction would include the initializer part:
#pragma omp declare reduction (+: VectorXd: omp_out=omp_out+omp_in)\
initializer(omp_priv=VectorXd::Zero(omp_orig.size()))
omp_priv is the name of the private variable. It gets initialized by VectorXd::Zero(...). The size is specified using omp_orig. The standard
(page 182, lines 25-27) defines this as:
The special identifier omp_orig can also appear in the initializer-clause and it will refer to the storage of the original variable to be reduced.
In our case (see full example below), this is result. So result.size() is 500000 and the private variable is initialized to the correct size.
#include <iostream>
#include <string>
#include <Eigen/Core>
#include <omp.h>
using namespace Eigen;
using namespace std;
VectorXd integrand(double E)
{
VectorXd answer(500000);
double f = 5.*E + 32.*E*E*E*E;
for (int j = 0; j != 50; j++) answer[j] = j*f;
return answer;
}
#pragma omp declare reduction (+: Eigen::VectorXd: omp_out=omp_out+omp_in)\
initializer(omp_priv=VectorXd::Zero(omp_orig.size()))
int main()
{
omp_set_num_threads(4);
double start = 0.;
double end = 1.;
int n = 100;
double h = (end - start) / (2.*n);
VectorXd result(500000);
result.fill(0.);
double E = start;
result = integrand(E);
#pragma omp parallel for reduction(+:result)
for (int j = 1; j <= n; j++) {
E = start + (2 * j - 1.)*h;
result += (4.*integrand(E)).eval();
if (j != n) {
E = start + 2 * j*h;
result += (2.*integrand(E)).eval();
}
}
for (int i = 0; i < 50; ++i)
cout << i + 1 << " , " << result[i] << endl;
return 0;
}

SSEx with a only 1.25x speed boost

a newbie in coding, really need your advice......
Recently I'm been trying some SSE coding to speed up simple calculations (addition and multiplication), I've been told there will be a 2x more speed boost with SSEx. But my result shows only a 1.25x boost, is there anything wrong with my code?
I've tried declaring the input arrays as global variables to maintain address continuity,not using local variables in SSE part, both in vain.
The following is the code,compiling with
g++ -mfpath=sse -mmmx -msse -msse2 -msse4.1 -O -Wall test.c
#define N 32768
#include<stdio.h>
#include<stdlib.h>
#include<stdint.h>
#include <smmintrin.h> //sse4.1
#include <emmintrin.h> //sse2
#include <xmmintrin.h> //sse
#include <mmintrin.h> //mmx
#include <time.h>
#include <string.h>
void init_with_rand(float *array);
float input1[N];
float input2[N];
float input3[N];
float output1[N];
float output2[N];
__m128 A,B,C,MUX,SUM;
int main(void)
{
clock_t t1, t2;
int i,j;
init_with_rand(input1);
init_with_rand(input2);
init_with_rand(input3);
t1 = clock();
for(j = 0; j < 1000000; j++){
for(i = 0; i < N; i++){
output1[i] = input1[i] * input2[i] + input3[i];
}
}
t1 = clock()-t1;
printf ("It took me %d clicks (%f seconds).\n",t1,((float)t1)/CLOCKS_PER_SEC);
/////////////////////////////////////////////////////////////////////////////////
t2 = clock();
for(j = 0; j < 1000000; j++){
for(i = 0; i < N; i+=4){
A = _mm_load_ps(input1+i);
B = _mm_load_ps(input2+i);
C = _mm_load_ps(input3+i);
MUX = _mm_mul_ps(A, B);
SUM = _mm_add_ps( MUX , C);
_mm_store_ps(output2+i, SUM);
}
}
t2 = clock()-t2;
printf ("It took me %d clicks (%f seconds).\n",t2,((float)t2)/CLOCKS_PER_SEC);
printf ("Performance is increased by %f times.\n",((float)t1/(float)t2));
if(!memcmp(output1,output2,N))
printf("Valid\n");
else if(memcmp(output1,output2,N))
printf("Invalid\n");
else
printf("Error\n");
return 0;
}
void init_with_rand(float *array)
{
int i;
for( i = 0; i < N; i++)
array[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
}
Thanks for any suggestion!

Avoid blas when involving temporary memory allocation?

I have a program that computes the matrix product x'Ay repeatedly. Is it better practice to compute this by making calls to MKL's blas, i.e. cblas_dgemv and cblas_ddot, which requires allocating memory to a temporary vector, or is better to simply take the sum of x_i * a_ij * y_j? In other words, does MKL's blas theoretically add any value?
I benchmarked this for my laptop. There was virtually no difference in each of the tests, other than g++_no_blas performed twice as poorly as the other tests (why?). There was also no difference between O2, O3 and Ofast.
g++_blas_static 57ms
g++_blas_dynamic 58ms
g++_no_blas 100ms
icpc_blas_static 57ms
icpc_blas_dynamic 58ms
icpc_no_blas 58ms
util.h
#ifndef UTIL_H
#define UTIL_H
#include <random>
#include <memory>
#include <iostream>
struct rng
{
rng() : unif(0.0, 1.0)
{
}
std::default_random_engine re;
std::uniform_real_distribution<double> unif;
double rand_double()
{
return unif(re);
}
std::unique_ptr<double[]> generate_square_matrix(const unsigned N)
{
std::unique_ptr<double[]> p (new double[N * N]);
for (unsigned i = 0; i < N; ++i)
{
for (unsigned j = 0; j < N; ++j)
{
p.get()[i*N + j] = rand_double();
}
}
return p;
}
std::unique_ptr<double[]> generate_vector(const unsigned N)
{
std::unique_ptr<double[]> p (new double[N]);
for (unsigned i = 0; i < N; ++i)
{
p.get()[i] = rand_double();
}
return p;
}
};
#endif // UTIL_H
main.cpp
#include <iostream>
#include <iomanip>
#include <memory>
#include <chrono>
#include "util.h"
#include "mkl.h"
double vtmv_blas(double* x, double* A, double* y, const unsigned n)
{
double temp[n];
cblas_dgemv(CblasRowMajor, CblasNoTrans, n, n, 1.0, A, n, y, 1, 0.0, temp, 1);
return cblas_ddot(n, temp, 1, x, 1);
}
double vtmv_non_blas(double* x, double* A, double* y, const unsigned n)
{
double r = 0;
for (unsigned i = 0; i < n; ++i)
{
for (unsigned j = 0; j < n; ++j)
{
r += x[i] * A[i*n + j] * y[j];
}
}
return r;
}
int main()
{
std::cout << std::fixed;
std::cout << std::setprecision(2);
constexpr unsigned N = 10000;
rng r;
std::unique_ptr<double[]> A = r.generate_square_matrix(N);
std::unique_ptr<double[]> x = r.generate_vector(N);
std::unique_ptr<double[]> y = r.generate_vector(N);
auto start = std::chrono::system_clock::now();
const double prod = vtmv_blas(x.get(), A.get(), y.get(), N);
auto end = std::chrono::system_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
end - start);
std::cout << "Result: " << prod << std::endl;
std::cout << "Time (ms): " << duration.count() << std::endl;
GCC no blas is poor because it does not use vectorized SMID instructions, while others all do. icpc will auto-vectorize you loop.
You don't show your matrix size, but generally gemv is memory bound. As the matrix is much larger than a temp vector, eliminating it may not be able to increase the performance a lot.