Improving computation speed for sine/cosine and large arrays - c++

for signal processing I need to compute relatively large C arrays as shown in the code part below. This is working fine so far, unfortunately, the implementation is slow. The size of "calibdata" is arround 150k and needs to be calculated for different frequencies/phases. Is there a way to improve speed significantly? Doing the same with logical indexing in MATLAB is way faster.
What I tried already:
using taylor approximation of sine: no siginificant improvement.
using std::vector, also no siginificant improvement.
code:
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
for (int i = 0; i < size; i++)
result += calibdata[i] * cos((2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180) - (PI / 2)));
result = fabs(result / size);
return result;}
Best regards,
Thomas

When optimizing code for speed, step 1 is to enable compiler optimizations. I hope you've done that already.
Step 2 is to profile the code and see exactly how the time is being spent. Without profiling, you're just guessing, and you could end up trying to optimize the wrong thing.
For example, your guess seems to be that the cos function is the bottleneck. But the other possibility is that the calculation of the angle is the bottleneck. Here's how I would refactor the code to reduce the time spent calculating the angle.
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier)
{
double result = 0;
double angle = phase * (PI / 180) - (PI / 2);
double delta = 2 * PI * freqscale[currentcarrier] / fs;
for (int i = 0; i < size; i++)
{
result += calibdata[i] * cos( angle );
angle += delta;
}
return fabs(result / size);
}

Okay, I'm probably going to get flogged for this answer, but I would use the GPU for this. Because your array doesn't appear to be self-referential, the best speedup you're going to get for large arrays is through parallelization... by far. I don't use MATLAB, but I just did a quick search for GPU utilization on the MathWorks site:
http://www.mathworks.com/company/newsletters/articles/gpu-programming-in-matlab.html?requestedDomain=www.mathworks.com
Outside of MATLAB you could use OpenCL or CUDA yourself.

Your enemies in execution time are:
Division
Function calls (including implicit ones in loops)
Accessing data from diffent areas
Operating dissimilar instructions
You should research on Data Driving programming and using the data cache effectively.
Division
Whether with hardware support or software support division takes a long time by its very nature. Eliminate if possibly by changing the numeric base or factoring out of the loop (if possible).
Function Calls
The most efficient method of execution is sequential. Processors are optimized for this. A branch may require the processor perform some additional calculation (branch prediction) or reloading of the instruction cache / pipeline. A waste of time (that could be spent executing data instructions).
The optimization for this is to use techniques like loop unrolling and inlining of small functions. Also reduce the quantity of branches by simplifying expressions and using Boolean algebra.
Accessing data from different areas
Modern processors are optimized to operate on local data (data in one area). One example is loading an internal cache with data. Specifically, loading a cache line with data. For example, if the data from your arrays is in one location and the cosine data in another, this may cause the data cache to be reloaded, again wasting time.
A better solution is to place all data contiguously or to contiguously access all the data. Rather than making many discontiguous accesses to the cosine table, look up a batch of cosine values sequentially (without any other data accesses between).
Dissimilar Instructions
Modern processors are more efficient at processing a batch of similar instructions. For example the pattern load, add, store is more efficient for blocks when all the loading is performed, then all adding, then all storing.
Summary
Here's an example:
register double result = 0.0;
register unsigned int i = 0U;
for (i = 0; i < size; i += 2)
{
register double cos_angle1 = /* ... */;
register double cos_angle2 = /* ... */;
result += calibdata[i + 0] * cos_angle1;
result += calibdata[i + 1] * cos_angle2;
}
The above loop is unrolled and like operations are performed in groups.
Although the keyword register may be deprecated, it is a suggestion to the compiler to use dedicated registers (if possible).

You can try to use the definition of cosine based on the complex exponential:
where j^2=-1.
Store exp((2 * PI*freqscale[currentcarrier] / fs)*j) and exp(phase*j). Evaluating cos(...) then resumes to a couple of products and additions in the for loops, and sin(), cos() and exp() are only called a couple of times.
Here goes the implementation:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex.h>
#include <time.h>
#define PI 3.141592653589
typedef struct cos_plan{
double complex* expo;
int size;
}cos_plan;
double phase_func(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier){
double result=0; //initialization
for (int i = 0; i < size; i++){
result += calibdata[i] * cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) );
//printf("i %d cos %g\n",i,cos ( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.) - (PI / 2.)) ));
}
result = fabs(result / size);
return result;
}
double phase_func2(double* calibdata, long size, double* freqscale, double fs, double phase, int currentcarrier, cos_plan* plan){
//first, let's compute the exponentials:
//double complex phaseexp=cos(phase*(PI / 180.) - (PI / 2.))+sin(phase*(PI / 180.) - (PI / 2.))*I;
//double complex phaseexpm=conj(phaseexp);
double phasesin=sin(phase*(PI / 180.) - (PI / 2.));
double phasecos=cos(phase*(PI / 180.) - (PI / 2.));
if (plan->size<size){
double complex *tmp=realloc(plan->expo,size*sizeof(double complex));
if(tmp==NULL){fprintf(stderr,"realloc failed\n");exit(1);}
plan->expo=tmp;
plan->size=size;
}
plan->expo[0]=1;
//plan->expo[1]=exp(2 *I* PI*freqscale[currentcarrier]/fs);
plan->expo[1]=cos(2 * PI*freqscale[currentcarrier]/fs)+sin(2 * PI*freqscale[currentcarrier]/fs)*I;
//printf("%g %g\n",creall(plan->expo[1]),cimagl(plan->expo[1]));
for(int i=2;i<size;i++){
if(i%2==0){
plan->expo[i]=plan->expo[i/2]*plan->expo[i/2];
}else{
plan->expo[i]=plan->expo[i/2]*plan->expo[i/2+1];
}
}
//computing the result
double result=0; //initialization
for(int i=0;i<size;i++){
//double coss=0.5*creall(plan->expo[i]*phaseexp+conj(plan->expo[i])*phaseexpm);
double coss=creall(plan->expo[i])*phasecos-cimagl(plan->expo[i])*phasesin;
//printf("i %d cos %g\n",i,coss);
result+=calibdata[i] *coss;
}
result = fabs(result / size);
return result;
}
int main(){
//the parameters
long n=100000000;
double* calibdata=malloc(n*sizeof(double));
if(calibdata==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
int freqnb=42;
double* freqscale=malloc(freqnb*sizeof(double));
if(freqscale==NULL){fprintf(stderr,"malloc failed\n");exit(1);}
for (int i = 0; i < freqnb; i++){
freqscale[i]=i*i*0.007+i;
}
double fs=n;
double phase=0.05;
//populate calibdata
for (int i = 0; i < n; i++){
calibdata[i]=i/((double)n);
calibdata[i]=calibdata[i]*calibdata[i]-calibdata[i]+0.007/(calibdata[i]+3.0);
}
//call to sample code
clock_t t;
t = clock();
double res=phase_func(calibdata,n, freqscale, fs, phase, 13);
t = clock() - t;
printf("first call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);
//initialize
cos_plan plan;
plan.expo=malloc(n*sizeof(double complex));
plan.size=n;
t = clock();
res=phase_func2(calibdata,n, freqscale, fs, phase, 13,&plan);
t = clock() - t;
printf("second call got %g in %g seconds.\n",res,((float)t)/CLOCKS_PER_SEC);
//cleaning
free(plan.expo);
free(calibdata);
free(freqscale);
return 0;
}
Compile with gcc main.c -o main -std=c99 -lm -Wall -O3. Using the code you provided, it take 8 seconds with size=100000000 on my computer while the execution time of the proposed solution takes 1.5 seconds... It is not so impressive, but it is not negligeable.
The solution that is presented does not involve any call to cos of sin in the for loops. Indeed, there are only multiplications and additions. The bottleneck is either the memory bandwidth or the tests and access to memory in the exponentiation by squaring (most likely first issue, since i add to use an additional array of complex).
For complex number in c, see:
How to work with complex numbers in C?
Computing e^(-j) in C
If the problem is memory bandwidth, then parallelism is required... and directly computing cos would be easier. Additional simplifications coud have be performed if freqscale[currentcarrier] / fs were an integer. Your problem is really close to the computation of Discrete Cosine Transform, the present trick is close to the Discrete Fourier Transform and the FFTW library is really good at computing these transforms.
Notice that the present code can produce innacurate results due to loss of significance : result can be much larger than cos(...)*calibdata[] when size is large. Using partial sums can resolve the issue.

Simple trig identity to eliminate the - (PI / 2). This is also more accurate than attempting the subtraction which uses machine_PI. This is important when values are near π/2.
cosine(x - π/2) == -sine(x)
Use of const and restrict: Good compilers can perform more optimizations with this knowledge. (See also #user3528438)
// double phase_func(double* calibdata, long size,
// double* freqscale, double fs, double phase, int currentcarrier) {
double phase_func(const double* restrict calibdata, long size,
const double* restrict freqscale, double fs, double phase, int currentcarrier) {
Some platforms perform faster calculations with float vs double with a tolerable loss of precision. YMMV. Profile code both ways.
// result += calibdata[i] * cos(...
result += calibdata[i] * cosf(...
Minimize recalculations.
double angle_delta = ...;
double angle_current = ...;
for (int i = 0; i < size; i++) {
result += calibdata[i] * cos(angle_current);
angle_current += angle_delta;
}
Unclear why code uses long size and and int currentcarrier. I'd expect the same type and to use type size_t. This is idiomatic for array indexing. #Daniel Jour
Reversing loops can allow a compare to 0 rather than compare to variable. Sometimes a modest performance gain.
Insure compiler optimizations are well enabled.
All together
double phase_func2(const double* restrict calibdata, size_t size,
const double* restrict freqscale, double fs, double phase,
size_t currentcarrier) {
double result = 0.0;
double angle_delta = 2.0 * PI * freqscale[currentcarrier] / fs;
double angle_current = angle_delta * (size - 1) + phase * (PI / 180);
size_t i = size;
while (i) {
result -= calibdata[--i] * sinf(angle_current);
angle_current -= angle_delta;
}
result = fabs(result / size);
return result;
}

Leveraging the cores you have, without resorting to the GPU, use OpenMP. Testing with VS2015, the invariants are lifted out of the loop by the optimizer. Enabling AVX2 and OpenMP.
double phase_func3(double* calibdata, const int size, const double* freqscale,
const double fs, const double phase, const size_t currentcarrier)
{
double result{};
constexpr double PI = 3.141592653589;
#pragma omp parallel
#pragma omp for reduction(+: result)
for (int i = 0; i < size; ++i) {
result += calibdata[i] *
cos( (2 * PI*freqscale[currentcarrier] * i / fs) + (phase*(PI / 180.0) - (PI / 2.0)));
}
result = fabs(result / size);
return result;
}
The original version with AVX enabled took: ~1.4 seconds
and adding OpenMP brought it down to: ~0.51 seconds.
Pretty nice return for two pragmas and a compiler switch.

Related

study of FFT - Why it's not fast?

I am not sure if it's more math or more programming question. If it's math please tell me.
I know there is a lot of ready to use for free FFT projects. But I try to understand FFT method. Just for fun and for studying it. So I made both algorithms - DFT and FFT, to compare them.
But I have problem with my FFT. It seems there is not big difference in efficiency. My FFT is only little bit faster then DFT (in some cases it's two times faster, but it's max acceleration)
In most articles about FFT, there is something about bit reversal. But I don't see the reason to use bit reversing. Probably it's the case. I don't understand it. Please help me. What I do wrong?
This is my code (you can copy it here and see how it works - online compiler):
#include <complex>
#include <iostream>
#include <math.h>
#include <cmath>
#include <vector>
#include <chrono>
#include <ctime>
float _Pi = 3.14159265;
float sampleRate = 44100;
float resolution = 4;
float _SRrange = sampleRate / resolution; // I devide Sample Rate to make the loop smaller,
//just to perform tests faster
float bufferSize = 512;
// Clock class is for measure time to execute whole loop:
class Clock
{
public:
Clock() { start = std::chrono::high_resolution_clock::now(); }
~Clock() {}
float secondsElapsed()
{
auto stop = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
}
void reset() { start = std::chrono::high_resolution_clock::now(); }
private:
std::chrono::time_point<std::chrono::high_resolution_clock> start;
};
// Function to calculate magnitude of complex number:
float _mag_Hf(std::complex<float> sf);
// Function to calculate exp(-j*2*PI*n*k / sampleRate) - where "j" is imaginary number:
std::complex<float> _Wnk_Nc(float n, float k);
// Function to calculate exp(-j*2*PI*k / sampleRate):
std::complex<float> _Wk_Nc(float k);
int main() {
float scaleFFT = 512; // devide and conquere - if it's "1" then whole algorhitm is just simply DFT
// I wonder what is the maximum of that value. I alvays thought it should be equal to
// buffer size (number o samples) but above some value it start to work slower then DFT
std::vector<float> inputSignal; // array of input signal
inputSignal.resize(bufferSize); // how many sample we will use to calculate Fourier Transform
std::vector<std::complex<float>> _Sf; // array to store Fourier Transform value for each measured frequency bin
_Sf.resize(scaleFFT); // resize it to size which we need.
std::vector<std::complex<float>> _Hf_Db_vect; //array to store magnitude (in logarythmic dB scale)
//for each measured frequency bin
_Hf_Db_vect.resize(_SRrange); //resize it to make it able to store value for each measured freq value
std::complex<float> _Sf_I_half; // complex to calculate first half of freq range
// from 1 to Nyquist (sampleRate/2)
std::complex<float> _Sf_II_half; // complex to calculate second half of freq range
//from Nyquist to sampleRate
for(int i=0; i<(int)_Sf.size(); i++)
inputSignal[i] = cosf((float)i/_Pi); // fill the input signal with some data, no matter
Clock _time; // Start measure time
for(int freqBinK=0; freqBinK < _SRrange/2; freqBinK++) // start calculate all freq (devide by 2 for two halves)
{
for(int i=0; i<(int)_Sf.size(); i++) _Sf[i] = 0.0f; // clean all values, for next loop we need all values to be zero
for (int n=0; n<bufferSize/_Sf.size(); ++n) // Here I take all samples in buffer
{
std::complex<float> _W = _Wnk_Nc(_Sf.size()*(float)n, freqBinK);
for(int i=0; i<(int)_Sf.size(); i++) // Finally here is my devide and conquer
_Sf[i] += inputSignal[_Sf.size()*n +i] * _W; // And I see no reason to use any bit reversal, how it shoul be????
}
std::complex<float> _Wk = _Wk_Nc(freqBinK);
_Sf_I_half = 0.0f;
_Sf_II_half = 0.0f;
for(int z=0; z<(int)_Sf.size()/2; z++) // here I calculate Fourier transform for each freq
{
_Sf_I_half += _Wk_Nc(2.0f * (float)z * freqBinK) * (_Sf[2*z] + _Wk * _Sf[2*z+1]); // First half - to Nyquist
_Sf_II_half += _Wk_Nc(2.0f * (float)z *freqBinK) * (_Sf[2*z] - _Wk * _Sf[2*z+1]); // Second half - to SampleRate
// also don't see need to use reversal bit, where it shoul be??? :)
}
// Calculate magnitude in dB scale
_Hf_Db_vect[freqBinK] = _mag_Hf(_Sf_I_half); // First half
_Hf_Db_vect[freqBinK + _SRrange/2] = _mag_Hf(_Sf_II_half); // Second half
}
std::cout << _time.secondsElapsed() << std::endl; // time measuer after execution of whole loop
}
float _mag_Hf(std::complex<float> sf)
{
float _Re_2;
float _Im_2;
_Re_2 = sf.real() * sf.real();
_Im_2 = sf.imag() * sf.imag();
return 20*log10(pow(_Re_2 + _Im_2, 0.5f)); //transform magnitude to logarhytmic dB scale
}
std::complex<float> _Wnk_Nc(float n, float k)
{
std::complex<float> _Wnk_Ncomp;
_Wnk_Ncomp.real(cosf(-2.0f * _Pi * (float)n * k / sampleRate));
_Wnk_Ncomp.imag(sinf(-2.0f * _Pi * (float)n * k / sampleRate));
return _Wnk_Ncomp;
}
std::complex<float> _Wk_Nc(float k)
{
std::complex<float> _Wk_Ncomp;
_Wk_Ncomp.real(cosf(-2.0f * _Pi * k / sampleRate));
_Wk_Ncomp.imag(sinf(-2.0f * _Pi * k / sampleRate));
return _Wk_Ncomp;
}
One huge mistake you are making is calculating the butterfly weights (which involves sin and cos) on the fly (in _Wnk_Nc()). sin and cos typically cost 10s to 100s of clock cycles, whereas the other butterfly operations are just mul and add, which only take a few cycles, hence the need to factor these out. All fast FFT implementations do this as part of an initialisation step (usually called "plan creation" or similar). See e.g. FFTW and KissFFT.
apart of abovementioned "pre-calculating butterfly weights" optimization, most FFT implementations also use SIMD instructions to vectorize code.
// also don't see need to use reversal bit, where it shoul be?
The very first butterfly loop should be reverse-bit indexed. Those indexes are usually calculated inside recursion, but for loop solution calculating those indexes is also costly, so it's better to pre-calculate them in plan as well.
Combining those optimization approaches result in approximately 100x speedup
Most fast FFT implementations either use a lookup table of precomputed twiddle factors, or a simple recursion to rotate the twiddle factors on the fly, instead of calling trigonometric math library functions inside the FFT inner loop.
For large FFTs, using a trig recursion formula is less likely to thrash the data caches on contemporary processors.

Ineffective "Peel/Remainder" Loop in my code

I have this function:
bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
{
bool ret = false;
// input size (-1 for the safe bilinear interpolation)
const int width = im.cols-1;
const int height = im.rows-1;
// output size
const int halfWidth = res.cols >> 1;
const int halfHeight = res.rows >> 1;
float *out = res.ptr<float>(0);
const float *imptr = im.ptr<float>(0);
for (int j=-halfHeight; j<=halfHeight; ++j)
{
const float rx = ofsx + j * a12;
const float ry = ofsy + j * a22;
#pragma omp simd
for(int i=-halfWidth; i<=halfWidth; ++i, out++)
{
float wx = rx + i * a11;
float wy = ry + i * a21;
const int x = (int) floor(wx);
const int y = (int) floor(wy);
if (x >= 0 && y >= 0 && x < width && y < height)
{
// compute weights
wx -= x; wy -= y;
int rowOffset = y*im.cols;
int rowOffset1 = (y+1)*im.cols;
// bilinear interpolation
*out =
(1.0f - wy) * ((1.0f - wx) * imptr[rowOffset+x] + wx * imptr[rowOffset+x+1]) +
( wy) * ((1.0f - wx) * imptr[rowOffset1+x] + wx * imptr[rowOffset1+x+1]);
} else {
*out = 0;
ret = true; // touching boundary of the input
}
}
}
return ret;
}
halfWidth is very random: it can be 9, 84, 20, 95, 111...I'm only trying to optimize this code, I don't understand it in details.
As you can see, the inner for has been already vectorized, but Intel Advisor suggests this:
And this is the Trip Count analysis result:
To my understand this means that:
Vector length is 8, so it means that 8 floats can be processed at the same time for each loop. This would mean (if I'm not wrong) that data are 32 bytes aligned (even though as I explain here it seems that the compiler think that data is not aligned).
On average, 2 cycles are totally vectorized, while 3 cycles are remainder loops. The same goes for Min and Max. Otherwise I don't understand what ; means.
Now my question is: how can I follow Intel Advisor first suggestion? It says to "increase the size of objects and add iterations so the trip count is a multiple of vector length"...Ok, so it's simply sayin' "hey man do this so halfWidth*2+1 (since it goes from -halfWidth to +halfWidth is a multiple of 8)". But how can I do this? If I add random cycles, this would obviously break the algorithm!
The only solution that came to my mind is to add "fake" iterations like this:
const int vectorLength = 8;
const int iterations = halfWidth*2+1;
const int remainder = iterations%vectorLength;
for(int i=0; i<loop+length-remainder; i++){
//this iteration was not supposed to exist, skip it!
if(i>halfWidth)
continue;
}
Of course this code would not work since it goes from -halfWidth to halfWidth, but it's to make you understand my strategy of "fake" iterations.
About the second option ("Increase the size of static and automatic objects, and use a compiler option to add data padding") I have no idea how to implement this.
First, you have to check Vector Advisor Efficiency metric as well as relative time spent in Loop Remainder compared to Loop Body (see hotspots list in advisor). If efficiency is close to 100% (or time spent in Remainder is very small), then it is not worth effort (and money as MSalters mentioned in comments).
If it is << 100% (and there are no other penalties reported by the tool), then you can either refactor the code to "add fake iterations" (rare users can afford it) or you should try #pragma loop_count for most typical #iterations values (depending on typical halfWidth value).
If halfWIdth is totally random (no common or average values), then there is nothing you can really do with this issue.

How to fast calculate the normalized l1 and l2 norm of a vector in C++?

I have a matrix X that has n column data vectors in d dimensional space.
Given a vector xj, v[j] is its l1 norm (the summation of all abs(xji)), w[j] is the square of its l2 norm (the summation of all xji^2), and pj[i] is the combination of entries divided by l1 and l2 norm. Finally, I need the outputs: pj, v, w for subsequet applications.
// X = new double [d*n]; is the input.
double alpha = 0.5;
double *pj = new double[d];
double *x_abs = new double[d];
double *x_2 = new double[d];
double *v = new double[n]();
double *w = new double[n]();
for (unsigned long j=0; j<n; ++j) {
jm = j*m;
jd = j*d;
for (unsigned long i=0; i<d; ++i) {
x_abs[i] = abs(X[i+jd]);
v[j] += x_abs[i];
x_2[i] = x_abs[i]*x_abs[i];
w[j] += x_2[i];
}
for (unsigned long i=0; i<d; ++i){
pj[i] = alpha*x_abs[i]/v[j]+(1-alpha)*x_2[i]/w[j];
}
// functionA(pj){ ... ...} for subsequent applications
}
// functionB(v, w){ ... ...} for subsequent applications
My above algorithm takes O(nd) Flops/Time-complexity, can any one help me to speed up it by using building-functoin or new implementation in C++? Reducing the constant value in O(nd) is also very helpful for me.
Let me guess: since you have problems related with the performance, the dimension of your vectors is quite large.If this is the case, then it worth considering "CPU cache locality" - some interesting info on this in a cppcon14 presentation.
If the data is not available in the CPU caches, then abs-ing or squaring it it once available is dwarfed by the time the CPU just wait for the data.
With this is mind, you may want to try the following solution (with no warranties that will improve performance - the compiler may actually apply these techniques when optimizing the code)
for (unsigned long j=0; j<n; ++j) {
// use pointer arithmetic - at > -O0 the compiler will do it anyway
double *start=X+j*d, *end=X+(j+1)*d;
// this part avoid as much as possible the competition
// on CPU caches between X and v/w.
// Don't store the norms in v/w as yet, keep them in registers
double l1norm=0, l2norm=0;
for(double *src=start; src!=end; src++) {
double val=*src;
l1norm+=abs(src);
l2norm+= src*src;
}
double pl1=alpha/l1norm, pl2=(1-alpha)*l2norm;
for(double *src=start, *dst=pj; src!=end; src++, dst++) {
// Yes, recomputing abs/sqr may actually save time by not
// creating competition on CPU caches with x_abs and x_2
double val=*src;
*dst = pl1*abs(val) + pl2*val*val;
}
// functionA(pj){ ... ...} for subsequent applications
// Think well if you really need v/w. If you really do,
// at least there are two values to be sent for storage into memory,
//meanwhile the CPU can actually load the next vector into cache
v[j]=l1norm; w[j]=l2norm;
}
// functionB(v, w){ ... ...} for subsequent applications

Unable to find simple sum of 1 to 100 numbers in CUDA?

I am working on image processing algorithm using CUDA. In my algorithm i want to find sum of all pixels of image using CUDA kernel. so i made kernel method in cuda for measure sum of all pixels of 16 bit gray scale image, but i got wrong answer.
So i make simple program in cuda for find sum of 1 to 100 numbers and my code is below.
In my code i got not exact sum of that 1 to 100 numbers using GPU, but i got exact sum of that 1 to 100 numbers using CPU. So what i had done in that code ?
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <conio.h>
#include <malloc.h>
#include <limits>
#include <math.h>
using namespace std;
__global__ void computeMeanValue1(double *pixels,double *sum){
int x = threadIdx.x;
sum[0] = sum[0] + (pixels[(x)]);
__syncthreads();
}
int main(int argc, char **argv)
{
double *data;
double *dev_data;
double *dev_total;
double *total;
data=new double[(100) * sizeof(double)];
total=new double[(1) * sizeof(double)];
double cpuSum=0.0;
for(int i=0;i<100;i++){
data[i]=i+1;
cpuSum=cpuSum+data[i];
}
cout<<"CPU total = "<<cpuSum<<std::endl;
cudaMalloc( (void**)&dev_data, 100 * sizeof(double));
cudaMalloc( (void**)&dev_total, 1 * sizeof(double));
cudaMemcpy(dev_data, data, 100 * sizeof(double), cudaMemcpyHostToDevice);
computeMeanValue1<<<1,100>>>(dev_data,dev_total);
cudaDeviceSynchronize();
cudaMemcpy(total, dev_total, 1* sizeof(double), cudaMemcpyDeviceToHost);
cout<<"GPU total = "<<total[0]<<std::endl;
cudaFree(dev_data);
cudaFree(dev_total);
free(data);
free(total);
getch();
return 0;
}
All your threads are writing to the same memory location at the same time.
sum[0] = sum[0] + (pixels[(x)]);
You can't do this and expect to get the correct result. Your kernel needs to take a different approach to avoid writing to the same memory from different threads. The pattern usually employed for doing this is reduction. Simply put with a reduction each thread is responsible for summing a block of elements within the array and then storing the result. By employing a series of these reduction operations its possible to sum the entire contents of the array.
__global__ void block_sum(const float *input,
float *per_block_results,
const size_t n)
{
extern __shared__ float sdata[];
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
// load input into __shared__ memory
float x = 0;
if(i < n)
{
x = input[i];
}
sdata[threadIdx.x] = x;
__syncthreads();
// contiguous range pattern
for(int offset = blockDim.x / 2;
offset > 0;
offset >>= 1)
{
if(threadIdx.x < offset)
{
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
}
// wait until all threads in the block have
// updated their partial sums
__syncthreads();
}
// thread 0 writes the final result
if(threadIdx.x == 0)
{
per_block_results[blockIdx.x] = sdata[0];
}
}
Each thread writes to a different location in sdata[threadIdx.x] there is no race condition. Threads are free to access other elements in sdata because they only read from them so there are no race conditions. Note the use of __syncthreads() to ensure that the operations to load data into sdata are complete before the threads start to read the data and the second call to __syncthreads() to ensure that all the summation operations have completed before copying the final result from sdata[0]. Note that only thread 0 writes its result to per_block_results[blockIdx.x], so there is no race condition there either.
You can find the complete sample code for the above on Google Code (I did not write this). This slide deck has a reasonable summary of reductions in CUDA. It includes diagrams which really help in understanding how the interleaved memory reads and writes do not conflict with each other.
You can find lots of other material on efficient implementations of reduction on GPUs. Ensuring that your implementation makes most efficient use of memory is key to getting the best performance out of a memory bound operation like reduction.
In GPU code, we have multiple threads executing in parallel. If all of those threads attempt to update the same location in memory, we have undefined behavior, unless we use special operations, called atomics to do the update.
In your case, since sum is updated by all threads, and sum is a double quantity, we can use the special custom atomic function described in the programming guide to accomplish this.
If I replace your kernel code with the following:
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val +
__longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
__global__ void computeMeanValue1(double *pixels,double *sum){
int x = threadIdx.x;
atomicAdd(sum, pixels[x]);
}
And initialize the sum value to zero before the kernel:
double gpuSum = 0.0;
cudaMemcpy(dev_total, &gpuSum, sizeof(double), cudaMemcpyHostToDevice);
Then I think you'll get matching results.
As #AdeMiller pointed out, the faster way to perform parallel sums like this is via classical parallel reduction.
There is a CUDA sample code that demonstrates this and an accompanying presentation that covers the methodology.

What is faster on division? doubles / floats / UInt32 / UInt64 ? in C++/C

I did some speed testing to figure out what is the fastest, when doing multiplication or division on numbers. I had to really work hard to defeat the optimiser. I got nonsensical results such as a massive loop operating in 2 microseconds, or that multiplication was the same speed as division (if only that were true).
After I finally worked hard enough to defeat enough of the compiler optimisations, while still letting it optimise for speed, I got these speed results. They maybe of interest to someone else?
If my test is STILL FLAWED, let me know, but be kind seeing as I just spend two hours writing this crap :P
64 time: 3826718 us
32 time: 2476484 us
D(mul) time: 936524 us
D(div) time: 3614857 us
S time: 1506020 us
"Multiplying to divide" using doubles seems the fastest way to do a division, followed by integer division. I did not test the accuracy of division. Could it be that "proper division" is more accurate? I have no desire to find out after these speed test results as I'll just be using integer division on a base 10 constant and letting my compiler optimise it for me ;) (and not defeating it's optimisations either).
Here's the code I used to get the results:
#include <iostream>
int Run(int bla, int div, int add, int minus) {
// these parameters are to force the compiler to not be able to optimise away the
// multiplications and divides :)
long LoopMax = 100000000;
uint32_t Origbla32 = 1000000000;
long i = 0;
uint32_t bla32 = Origbla32;
uint32_t div32 = div;
clock_t Time32 = clock();
for (i = 0; i < LoopMax; i++) {
div32 += add;
div32 -= minus;
bla32 = bla32 / div32;
bla32 += bla;
bla32 = bla32 * div32;
}
Time32 = clock() - Time32;
uint64_t bla64 = bla32;
clock_t Time64 = clock();
uint64_t div64 = div;
for (long i = 0; i < LoopMax; i++) {
div64 += add;
div64 -= minus;
bla64 = bla64 / div64;
bla64 += bla;
bla64 = bla64 * div64;
}
Time64 = clock() - Time64;
double blaDMul = Origbla32;
double multodiv = 1.0 / (double)div;
double multomul = div;
clock_t TimeDMul = clock();
for (i = 0; i < LoopMax; i++) {
multodiv += add;
multomul -= minus;
blaDMul = blaDMul * multodiv;
blaDMul += bla;
blaDMul = blaDMul * multomul;
}
TimeDMul = clock() - TimeDMul;
double blaDDiv = Origbla32;
clock_t TimeDDiv = clock();
for (i = 0; i < LoopMax; i++) {
multodiv += add;
multomul -= minus;
blaDDiv = blaDDiv / multomul;
blaDDiv += bla;
blaDDiv = blaDDiv / multodiv;
}
TimeDDiv = clock() - TimeDDiv;
float blaS = Origbla32;
float divS = div;
clock_t TimeS = clock();
for (i = 0; i < LoopMax; i++) {
divS += add;
divS -= minus;
blaS = blaS / divS;
blaS += bla;
blaS = blaS * divS;
}
TimeS = clock() - TimeS;
printf("64 time: %i us (%i)\n", (int)Time64, (int)bla64);
printf("32 time: %i us (%i)\n", (int)Time32, bla32);
printf("D(mul) time: %i us (%f)\n", (int)TimeDMul, blaDMul);
printf("D(div) time: %i us (%f)\n", (int)TimeDDiv, blaDDiv);
printf("S time: %i us (%f)\n", (int)TimeS, blaS);
return 0;
}
int main(int argc, char* const argv[]) {
Run(0, 10, 0, 0); // adds and minuses 0 so it doesn't affect the math, only kills the opts
return 0;
}
There are lots of ways to perform certain arithmetic, so there might not be a single answer (shifting, fractional multiplication, actual division, some round-trip through a logarithm unit, etc; these might all have different relative costs depending on the operands and resource allocation).
Let the compiler do its thing with the program and data flow information it has.
For some data applicable to assembly on x86, you might look at: "Instruction latencies and throughput for AMD and Intel x86 processors"
What is fastest will depend entirely on the target architecture. It looks here like you're interested only in the platform you happen to be on, which guessing from your execution times seems to be 64-bit x86, either Intel (Core2?) or AMD.
That said, floating-point multiplication by the inverse will be the fastest on many platforms, but is, as you speculate, usually less accurate than a floating-point divide (two roundings instead of one -- whether or not that matters for your usage is a separate question). In general, you are better off re-arranging your algorithm to use fewer divides than you are jumping through hoops to make division as efficient as possible (the fastest division is the one you don't do), and make sure to benchmark before you spend time optimizing at all, as algorithms that bottleneck on division are few and far between.
Also, if you have integer sources and need an integer result, make sure to include the cost of conversion between integer and floating-point in your benchmarking.
Since you're interested in timings on a specific machine, you should be aware that Intel now publishes this information in their Optimization Reference Manual (pdf). Specifically, you will be interested in the tables of Appendix C section 3.1, "Latency and Throughput with Register Operands".
Be aware that integer divide timings depend strongly on the actual values involved. Based on the information in that guide, it seems that your timing routines still have a fair bit of overhead, as the performance ratios you measure don't match up with Intel's published information.
As Stephen mentioned, use the optimisation manual - but you should also be considering the use of SSE instructions. These can do 4 or 8 divisions / multiplications in a single instruction.
Also, it is fairly common for a division to take a single clock cycle to process. The result may not be available for several clock cycles (called latency), however the next division can begin during this time (overlapping with the first) as long as it does not require the result from the first. This is due to pipe-lining in the CPU, in the same way as you can wash more clothes while the previous load is still drying.
Multiplying to divide is a common trick, and should be used wherever your divisor changes infrequently.
There is a very good chance that you will spend time and effort making the maths fast only to discover that it is the speed of memory access (as you navigate the input and write the output) that limits your final implimentation.
I wrote a flawed test to do this on MSVC 2008
double i32Time = GetTime();
{
volatile __int32 i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 61;
count++;
}
}
i32Time = GetTime() - i32Time;
double i64Time = GetTime();
{
volatile __int64 i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 61;
count++;
}
}
i64Time = GetTime() - i64Time;
double fTime = GetTime();
{
volatile float i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 4.0f;
count++;
}
}
fTime = GetTime() - fTime;
double fmTime = GetTime();
{
volatile float i = 4;
const float div = 1.0f / 4.0f;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i *= div;
count++;
}
}
fmTime = GetTime() - fmTime;
double dTime = GetTime();
{
volatile double i = 4;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i /= 4.0f;
count++;
}
}
dTime = GetTime() - dTime;
double dmTime = GetTime();
{
volatile double i = 4;
const double div = 1.0f / 4.0f;
__int32 count = 0;
__int32 max = 1000000;
while( count < max )
{
i *= div;
count++;
}
}
dmTime = GetTime() - dmTime;
DebugOutput( _T( "%f\n" ), i32Time );
DebugOutput( _T( "%f\n" ), i64Time );
DebugOutput( _T( "%f\n" ), fTime );
DebugOutput( _T( "%f\n" ), fmTime );
DebugOutput( _T( "%f\n" ), dTime );
DebugOutput( _T( "%f\n" ), dmTime );
DebugBreak();
I then ran it on an AMD64 Turion 64 in 32-bit mode. The results I got were as follows:
0.006622
0.054654
0.006283
0.006353
0.006203
0.006161
The reason the test is flawed is the usage of volatile which forces the compiler to re-load the variable from memory just in case its changed. All in it show there is precious little difference between any of the implementations on this machine (__int64 is obviously slow).
It also categorically shows that the MSVC compiler performs the multiply by reciprocal optimisation. I imagine GCC does the same if not better. If i change the float and double division checks to divide by "i" then it increases the time significantly. Though, while a lot of that could be the re-loading from disk, it is obvious the compiler can't optimise that away so easily.
To understand such micro-optimisations try reading this pdf.
All in I'd argue that if you are worrying about such things you obviously haven't profiled your code. Profile and fix the problems as and when they actually ARE a problem.
Agner Fog has done some pretty detailed measurements himself, which can be found here. If you're really trying to optimize stuff, you should read the rest of the documents from his software optimization resources as well.
I would point out that, even if you are measuring non-vectorized floating point operations, the compiler has two options for the generated assembly: it can use the FPU instructions (fadd, fmul) or it can use SSE instructions while still manipulate one floating point value per instruction (addss, mulss). In my experience the SSE instructions are faster and have less inaccuracies, but compilers don't make it the default because it could break compatibility with code that relies on the old behavior. You can turn it on in gcc with the -mfpmath=sse flag.