I want to optimize my application using vectorization. More specifically, I want to vectorize the mathematical operations on the std::complex<double> type. However, this seems to be quite difficult. Consider the following example:
#define TEST_LEN 100
#include <algorithm>
#include <complex>
typedef std::complex<double> cmplx;
using namespace std::complex_literals;
#pragma omp declare simd
cmplx add(cmplx a, cmplx b)
{
return a + b;
}
#pragma omp declare simd
cmplx mult(cmplx a, cmplx b)
{
return a * b;
}
void k(cmplx *x, cmplx *&y, int i0, int N)
{
#pragma omp for simd
for (int i = i0; i < N; i++)
y[i] = add(mult(-(1i + 1.0), x[i]), 1i);
}
int main(int argc, char **argv)
{
cmplx *x = new cmplx[TEST_LEN];
cmplx *y = new cmplx[TEST_LEN];
for (int i = 0; i < TEST_LEN; i++)
x[i] = 0;
for (int i = 0; i < TEST_LEN; i++)
{
int N = std::min(4, TEST_LEN - i);
k(x, y, i, N);
}
delete[] x;
delete[] y;
return 1;
}
I am using the g++ compiler. For this code the compiler gives the following warning:
warning: unsupported return type 'cmplx' {aka 'std::complex'} for simd
for the lines containing the mult and add function.
It seems like it is not possible to vectorize the std::complex<double> type like this.
Is there a different way how this can be archieved?
Not easily. SIMD works quite well when you have values in the next N steps that behave the same way. So consider for example an array of 2D vectors:
X Y X Y X Y X Y
If we were to do a vector addition operation here,
X Y X Y X Y X Y
+ + + + + + + +
X Y X Y X Y X Y
The compiler will nicely vectorise that operation. If however we were to want to do something different for the X and Y values, the memory layout becomes problematic for SIMD:
X Y X Y X Y X Y
+ / + / + / + /
X Y X Y X Y X Y
If you consider for example the multiplication case:
(a + bi) (c + di) = (ac - bd) (ad + bc)i
Suddenly the operations are jumping between SIMD lanes, which is pretty much going to kill any decent vectorization.
Take a quick look at this godbolt: https://godbolt.org/z/rnVVgl
Addition boils down to some vaddps instructions (working on 8 floats at a time).
Multiply ends up using vfmadd231ss and vmulss (which both work on 1 float at a time).
The only easy way to automatically vectorise your complex code would be to seperate out the real and imaginary parts into 2 arrays:
struct ComplexArray {
float* real;
float* imaginary;
};
Within this godbolt you can see that the compiler is now using vfmadd213ps instructions (so again back to working on 8 floats at a time).
https://godbolt.org/z/Ostaax
Related
I'm new to OpenMP and cannot for the life of me utilise multiple threads. I have my environment variable set. Here is a snippet of code which simply should just iterate through the mandelbrot set:
#include <omp.h>
#include <limits>
#define WIDTH 10000
#define HEIGHT 10000
#define INFINITY 2.0f
#define ITERATIONS 1000
using namespace std;
int main()
{
#pragma omp parallel for
for (size_t py = 0; py < HEIGHT; py++) {
for (size_t px = 0; px < WIDTH; px++) {
float x0 = -2.5f + (px * (1.0f - -2.5f) / WIDTH);
float y0 = 1.0f + (py * (-1.0f - 1.0f) / HEIGHT);
unsigned short iteration;
float x = 0.0f;
float y = 0.0f;
for (iteration = 0; iteration < ITERATIONS; iteration++) {
float xn = x * x - y * y + x0;
y = 2 * x * y + y0;
x = xn;
if (x * x + y * y > INFINITY) {
break;
}
}
}
}
}
Whenever I run this, it never spawns additional threads. I feel I'm doing something horribly wrong. Any help would be appreciated, thanks.
I needed at add the flag -fopenmp to my compiler arguments. Now works properly.
Just to add to the missing -fopenmp flag, OpenMP supports different work scheduling modes and the simple equal work distribution may not reach high efficiency on the run-time minimization for unbalanced workloads like Mandelbrot-Set generation.
#pragma omp parallel for schedule(dynamic)
for(...){}
can be faster for unknown work-intensity per pixel with speedup depending on the unbalance between pixels.
At the moment I am trying to figure out why my naive matrix-matrix-multiplication is slower (0.7 sec) when I use the overloaded parentheses operator (//first multiplication). If I don't use them (//second multiplication) and the multiplication directly access the class member array data_ it is about twice as fast (0.35 sec). I use my own matrix class as defined in Matrix.h.
Why is there such a significant difference in speed? Is there something wrong with my copy constructor? Is there so much "overhead" in calling the overloaded operator function that it justifies for that kind of performance penalty?
There is one more question / weird behavior: When you exchange the two inner most loops (x and inner) with each other, then the multiplication gets (of course) really slow, but both multiplications take almost the SAME time (7 sec) now. Why does it take the same time for them in this case, but before there was a ~50% performance difference.
edit: The program is compiled the following way: g++ -c -std=c++0x -O3 -DNDEBUG
Thank you so much for your help!
My main function looks like this:
#include "Matrix.h"
int main(){
Matrix m1(1024,1024, 2.0);
Matrix m2(1024,1024, 2.5);
Matrix m3(1024,1024);
//first multiplication
for(int y = 0; y < 1024; ++y){
for(int inner = 0; inner < 1024; ++inner){
for(int x = 0; x < 1024; ++x){
m3(y,x) += m1(y, inner) * m2(inner, x);
}
}
}
//second multiplication
for(int y = 0; y < 1024; ++y){
for(int inner = 0; inner < 1024; ++inner){
for(int x = 0; x < 1024; ++x){
m3.data_[y*1024+x] += m1.data_[y*1024+inner]*m2.data_[inner*1024+inner];
}
}
}
}
And here is the part of Matrix.h:
class Matrix{
public:
Matrix();
Matrix(int sizeY, int sizeX);
Matrix(int sizeY, int sizeX, double init);
Matrix(const Matrix & orig);
~Matrix(){delete[] data_;}
double & operator() (int y, int x);
double operator() (int y, int x) const;
double * data_;
private:
int sizeX_;
int sizeY_;
}
And here the Implementation of Matrix.h
Matrix::Matrix()
: sizeX_(0),
sizeY_(0),
data_(nullptr)
{ }
Matrix::Matrix(int sizeY, int sizeX)
: sizeX_(sizeX),
sizeY_(sizeY),
data_(new double[sizeX*sizeY]())
{
assert( sizeX > 0 );
assert( sizeY > 0 );
}
Matrix::Matrix(int sizeY, int sizeX, double init)
: sizeX_(sizeX),
sizeY_(sizeY)
{
assert( sizeX > 0 );
assert( sizeY > 0 );
data_ = new double[sizeX*sizeY];
std::fill(data_, data_+(sizeX_*sizeY_), init);
}
Matrix::Matrix(const Matrix & orig)
: sizeX_(orig.sizeX_),
sizeY_(orig.sizeY_)
{
data_ = new double[orig.sizeY_*orig.sizeX_];
std::copy(orig.data_, orig.data_+(sizeX_*sizeY_), data_);
}
double & Matrix::operator() (int y, int x){
assert( x >= 0 && x < sizeX_);
assert( y >= 0 && y < sizeY_);
return data_[y*sizeX_ + x];
}
double Matrix::operator() (int y, int x) const {
assert( x >= 0 && x < sizeX_);
assert( y >= 0 && y < sizeY_);
return data_[y*sizeX_ + x];
}
EDIT2: Turns out I used the wrong array access for the //second multiplication. I changed it to m3.data_[y*1024+x] += m1.data_[y*1024+inner]*m2.data_[inner*1024+x]; and now both multiplications take the same time.
Thank you very much for your help!
I think your two versions are not computing the same thing:
In the first you have:
m3(y,x) += m1(y, inner) * m2(inner, x);
But in the second you have
m3.data_[y*1024+x] += m1.data_[y*1024+inner]*m2.data_[inner*1024+inner];
The second one can factor inner out and instead do inner * (1024 + 1) which can be optimized a number of ways that the first can't.
What are the outputs of the two versions? Do they match?
Edit: Another answerer is quite right suggesting that the dimensions in the class not being constant will take some optimizations off the table; in the first version the compiler doesn't know that the size is a power of two so it uses general-purpose multiplication but in the second version it knows that one of the operands is 1024 (not just a constant but a compile time constant) so it can use fast multiplication (left shift by the power of two).
(Apologies for my earlier answer about NDEBUG: I had the page open for a while so didn't see your edit with the compilation line.)
I suspect the difference is that in the operator() version, sizeX_ is not const, and this may be preventing the compiler from optimizing something, i.e. loading sizeX_ into a register repeatedly. Try declaring sizeX_ and sizeY_ const in the class definition.
That and you should inline the functions in the header, as has been suggested in the comments.
I managed to get my sqrt function to run perfectly, but I'm second guessing if I wrote this code correctly based on the pseudo code I was given.
Here is the pseudo code:
x = 1
repeat 10 times: x = (x + n / x) / 2
return x.
The code I wrote,
#include <iostream>
#include <math.h>
using namespace std;
double my_sqrt_1(double n)
{
double x= 1; x<10; ++x;
return (x+n/x)/2;
}
No, your code is not following your pseudo-code. For example, you're not repeating anything in your code. You need to add a loop to do that:
#include <iostream>
#include <math.h>
using namespace std;
double my_sqrt_1(double n)
{
double x = 1;
for(int i = 0; i < 10; ++i) // repeat 10 times
x = (x+n/x)/2;
return x;
}
Let's analyze your code:
double x = 1;
// Ok, x set to 1
x < 10;
// This is true, as 1 is less than 10, but it is not used anywhere
++x;
// Increment x - now x == 2
return (x + n / x) / 2
// return value is always (2 + n / 2) / 2
As you don't have any loop, function will always exit in the first "iteration" with the return value (2 + n / 2) / 2.
Just as another approach that you can use binary search or the another pretty elegant solution is to use the Newton's method.
Newton's method is a method for finding roots of a function, making use of a function's derivative. At each step, a value is calculated as: x(step) = x(step-1) - f(x(step-1))/f'(x(step-1)) Newton's_method
This might be faster than binary search.My implementation in C++:
double NewtonMethod(double x) {
double eps = 0.0001; //the precision
double x0 = 10;
while( fabs(x-x0) > eps) {
double a = x0*x0-n;
double r = a/(2*x0);
x = x0 - r;
x0 = x;
}
return x;
}
Since people are showing different approaches to calculating the square root, I couldn't resist ;)...
Below is the exact copy (with the original comments, but without preprocessor directives) of the inverse square root implementation from Quake III Arena:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the...?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
I am trying to calculate complex numbers for a 2D array in C++. The code is running very slowly and I have narrowed down the main cause to be the exp function (the program runs quickly when I comment out that line, even though I have 4 nested loops).
int main() {
typedef vector< complex<double> > complexVect;
typedef vector<double> doubleVect;
const int SIZE = 256;
vector<doubleVect> phi_w(SIZE, doubleVect(SIZE));
vector<complexVect> phi_k(SIZE, complexVect(SIZE));
complex<double> i (0, 1), cmplx (0, 0);
complex<double> temp;
int x, y, t, k, w;
double dk = 2.0*M_PI / (SIZE-1);
double dt = M_PI / (SIZE-1);
int xPos, yPos;
double arg, arg2, arg4;
complex<double> arg3;
double angle;
vector<complexVect> newImg(SIZE, complexVect(SIZE));
for (x = 0; x < SIZE; ++x) {
xPos = -127 + x;
for (y = 0; y < SIZE; ++y) {
yPos = -127 + y;
for (t = 0; t < SIZE; ++t) {
temp = cmplx;
angle = dt * t;
arg = xPos * cos(angle) + yPos * sin(angle);
for (k = 0; k < SIZE; ++k) {
arg2 = -M_PI + dk*k;
arg3 = exp(-i * arg * arg2);
arg4 = abs(arg) * M_PI / (abs(arg) + M_PI);
temp = temp + arg4 * arg3 * phi_k[k][t];
}
}
newImg[y][x] = temp;
}
}
}
Is there a way I can improve computation time? I have tried using the following helper function but it doesn't noticeably help.
complex<double> complexexp(double arg) {
complex<double> temp (sin(arg), cos(arg));
return temp;
}
I am using clang++ to compile my code
edit: I think the problem is the fact that I'm trying to calculate complex numbers. Would it be faster if I just used Euler's formula to calculate the real and imaginary parts in separate arrays and not have to deal with the complex class?
maybe this will work for you:
http://martin.ankerl.com/2007/02/11/optimized-exponential-functions-for-java/
I've had a look with callgrind. The only marginal improvement (~1.3% with size = 50) I could find was to change:
temp = temp + arg4 * arg3 * phi_k[k][t];
to
temp += arg4 * arg3 * phi_k[k][t];
The most costly function calls were sin()/cos(). I suspect that calling exp() with a complex number argument calls those functions in the background.
To retain precision, the function will compute very slowly and there doesn't seem to be a way around it. However, you could trade precision for accuracy, which seems to be what game developers would do: sin and cos are slow, is there an alternatve?
You can define number e as a constant and use std::pow() function
We have the following serial C code operating on
two vectors a[] and b[]:
double a[20000],b[20000],r=0.9;
for(int i=1;i<=10000;++i)
{
a[i]=r*a[i]+(1-r)*b[i]];
errors=max(errors,fabs(a[i]-b[i]);
b[i]=a[i];
}
Please tell us on how to port this code to CUDA and cublas?
It's also possible to implement this reduction in Thrust using thrust::transform_reduce. This solution fuses the entire operation, as talonmies suggests:
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform_reduce.h>
#include <thrust/functional.h>
// this functor unpacks a tuple and then computes
// a weighted absolute difference of its members
struct weighted_absolute_difference
{
double r;
weighted_absolute_difference(const double r)
: r(r)
{}
__host__ __device__
double operator()(thrust::tuple<double,double> t)
{
double a = thrust::get<0>(t);
double b = thrust::get<1>(t);
a = r * a + (1.0 - r) * b;
return fabs(a - b);
}
};
int main()
{
using namespace thrust;
const std::size_t n = 20000;
const double r = 0.9;
device_vector<double> a(n), b(n);
// initialize a & b
...
// do the reduction
double result =
transform_reduce(make_zip_iterator(make_tuple(a.begin(), b.begin())),
make_zip_iterator(make_tuple(a.end(), b.end())),
weighted_absolute_difference(r),
-1.f,
maximum<double>());
// note that this solution does not set
// a[i] = r * a[i] + (1 - r) * b[i]
return 0;
}
Note that we do not perform the assignment a[i] = r * a[i] + (1 - r) * b[i] in this solution, though it would be simple to do so after the reduction using thrust::transform. It is not safe to modify transform_reduce's arguments in either functor.
This second line in your loop:
errors=max(errors,fabs(a[i]-b[i]);
is known as a reduction. Fortunately there is reduction example code in the CUDA SDK - take a look at this and use it as a template for your algorithm.
You probably want to split this into two separate operations (possibly as two separate kernels) - one for the parallel part (calculation of bp[] values) and a second for the reduction (calculate errors).