I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.
The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.
The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower.
Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.
My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.
See the example below.
Anyone has any clue on how to make this more effective with OpenMP?
void redBlackSmooth(std::vector<double> const & b,
std::vector<double> & x,
double h)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
int const N = static_cast<int>(x.size());
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
}
Addition:
I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.
First of all, make sure that the x vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:
double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);
Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.
In my case this implementation is much faster when x is not cache-aligned.
const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;
#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
base = j * workGroupSize;
for (int k = 0; k < workGroupSize; k+=2)
{
i = base + k + (redSweep ? 1 : 0);
if ( i == 0 || i+1 == N) continue;
sigma = -invh2* ( x[i-1] + x[i+1] );
x[i] = ( h2/2.0 ) * ( b[i] - sigma );
}
}
In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.
I think the main problem is about type of array structure you are using. Lets try comparing results with vectors and arrays. (Arrays = c-arrays using new operator).
Vector and array sizes are N = 10000000. I force the smoothing function to repeat in order to maintain runtime > 0.1secs.
Vector Time: 0.121007 Repeat: 1 MLUPS: 82.6399
Array Time: 0.164009 Repeat: 2 MLUPS: 121.945
MLUPS = ((N-2)*repeat/runtime)/1000000 (Million Lattice Points Update per second)
MFLOPS are misleading when it comes to grid calculation. A few changes in the basic equation can lead to consider high performance for the same runtime.
The modified code:
double my_redBlackSmooth(double *b, double* x, double h, int N)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
double runtime(0.0), wcs, wce;
int repeat = 1;
timing(&wcs);
for(; runtime < 0.1; repeat*=2)
{
for(int r = 0; r < repeat; ++r)
{
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// cout << "In Array: " << r << endl;
}
if(x[0] != 0) dummy(x[0]);
timing(&wce);
runtime = (wce-wcs);
}
// cout << "Before division: " << repeat << endl;
repeat /= 2;
cout << "Array Time:\t" << runtime << "\t" << "Repeat:\t" << repeat
<< "\tMLUPS:\t" << ((N-2)*repeat/runtime)/1000000.0 << endl;
return runtime;
}
I didn't change anything in the code except than array type. For better cache access and blocking you should look into data alignment (_mm_malloc).
Related
this is optimized implementation of matrix multiplication and this routine performs a matrix multiplication operation.
C := C + A * B (where A, B, and C are n-by-n matrices stored in column-major format)
On exit, A and B maintain their input values.
void matmul_optimized(int n, int *A, int *B, int *C)
{
// to the effective bitwise calculation
// save the matrix as the different type
int i, j, k;
int cij;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
cij = C[i + j * n]; // the initialization into C also, add separate additions to the product and sum operations and then record as a separate variable so there is no multiplication
for (k = 0; k < n; ++k) {
cij ^= A[i + k * n] & B[k + j * n]; // the multiplication of each terms is expressed by using & operator the addition is done by ^ operator.
}
C[i + j * n] = cij; // allocate the final result into C }
}
}
how do I more speed up the multiplication of matrix based on above function/method?
this function is tested up to 2048 by 2048 matrix.
the function matmul_optimized is done with matmul.
#include <stdio.h>
#include <stdlib.h>
#include "cpucycles.c"
#include "helper_functions.c"
#include "matmul_reference.c"
#include "matmul_optimized.c"
int main()
{
int i, j;
int n = 1024; // Number of rows or columns in the square matrices
int *A, *B; // Input matrices
int *C1, *C2; // Output matrices from the reference and optimized implementations
// Performance and correctness measurement declarations
long int CLOCK_start, CLOCK_end, CLOCK_total, CLOCK_ref, CLOCK_opt;
long int COUNTER, REPEAT = 5;
int difference;
float speedup;
// Allocate memory for the matrices
A = malloc(n * n * sizeof(int));
B = malloc(n * n * sizeof(int));
C1 = malloc(n * n * sizeof(int));
C2 = malloc(n * n * sizeof(int));
// Fill bits in A, B, C1
fill(A, n * n);
fill(B, n * n);
fill(C1, n * n);
// Initialize C2 = C1
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
C2[i * n + j] = C1[i * n + j];
// Measure performance of the reference implementation
CLOCK_total = 0;
for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
{
CLOCK_start = cpucycles();
matmul_reference(n, A, B, C1);
CLOCK_end = cpucycles();
CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
}
CLOCK_ref = CLOCK_total / REPEAT;
printf("n=%d Avg cycle count for reference implementation = %ld\n", n, CLOCK_ref);
// Measure performance of the optimized implementation
CLOCK_total = 0;
for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
{
CLOCK_start = cpucycles();
matmul_optimized(n, A, B, C2);
CLOCK_end = cpucycles();
CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
}
CLOCK_opt = CLOCK_total / REPEAT;
printf("n=%d Avg cycle count for optimized implementation = %ld\n", n, CLOCK_opt);
speedup = (float)CLOCK_ref / (float)CLOCK_opt;
// Check correctness by comparing C1 and C2
difference = 0;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
difference = difference + C1[i * n + j] - C2[i * n + j];
if (difference == 0)
printf("Speedup factor = %.2f\n", speedup);
if (difference != 0)
printf("Reference and optimized implementations do not match\n");
//print(C2, n);
free(A);
free(B);
free(C1);
free(C2);
return 0;
}
You can try algorithm like Strassen or Coppersmith-Winograd and here is also a good example.
Or maybe try Parallel computing like future::task or std::thread
Optimizing matrix-matrix multiplication requires careful attention to be paid to a number of issues:
First, you need to be able to use vector instructions. Only vector instructions can access parallelism inherent in the architecture. So, either your compiler needs to be able to automatically map to vector instructions, or you have to do so by hand, for example by calling the vector intrinsic library for AVX-2 instructions (for x86 architectures).
Next, you need to pay careful attention to the memory hierarchy. Your performance can easily drop to less than 5% of peak if you don't do this.
Once you do this right, you will hopefully have broken the computation up into small enough computational chunks that you can also parallelize via OpenMP or pthreads.
A document that carefully steps through what is required can be found at http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html. (This is very much a work in progress.) At the end of it all, you will have an implementation that gets close to the performance attained by high-performance libraries like Intel's Math Kernel Library (MKL) or the BLAS-like Library Instantiation Software (BLIS).
(And, actually, you CAN then also effectively incorporate Strassen's algorithm. But that is another story, told in Unit 3.5.3 of these notes.)
You may find the following thread relevant: How does BLAS get such extreme performance?
I've been doing a really really simple finite difference code that solves the 1D convection equation.
It seems to work pretty fine but if I increase the size of the arrays that I'm using I get a segmentation fault error. This happens when I reduce the timestep or if I increase the time interval.
The code is
#include <math.h>
#include <iostream>
#include <fstream>
#include <stdio.h>
#include <cmath>
using namespace std;
int main(){
double xi = 0.0;
double xf = 10.0;
double ti = 0.0;
double tf = 1.0;
Time interval, if it is equal to 1 the code works fine.
double x,t;
double dt = 0.1;
double dx = 0.1;
int nstep_x = (xf - xi)/dx;
int nstep_t = (tf - ti)/dt;
double f[nstep_x][nstep_t];
double ex[nstep_x][nstep_t];
// Parameters
const double v = 0.05;
const double D = 0.0001;
const double pi = 3.141592654;
ofstream salida;
salida.open("out");
for (int i = 0 ; i <= nstep_x; i++){
x = xi + dx*i;
f[i][0] = 0.5*sin(pi*x); //Initial conditions
salida << x << " " << 0 << " " << f[i][0] << endl;
}
salida << endl;
for (int n = 0; n <= nstep_t ; n++){
t = ti + n*dt;
for (int i = 1; i <= nstep_x; i++){
x = xi + dx*i;
f[i][n+1] = f[i][n] - ((v*dt)/(2*dx))*(f[i+1][n] - f[i-1][n]); //CONV|SOC
ex[i][n] = 0.5*sin(pi*x - v*t);
salida << x << " " << t << " " << ex[i][n] << " " << f[i][n] << endl;
}
salida << endl;
salida << endl;
}
}
I think that is not a problem of going out of the array bounds in the loops because the code works for "small" arrays.
I guess that I must be doing something wrong with the array handling but I can't find the error.
You have several issues with your code. One is that you're using variable length arrays (VLA's) which are not standard C++.
double f[nstep_x][nstep_t];
double ex[nstep_x][nstep_t];
This is not valid C++, as arrays must have its size known at compile-time, not run time.
The quick solution is to use std::vector<std::vector<double>>:
#include <vector>
//...
std::vector<std::vector<double>> f(nstep_x, std::vector<double>(nstep_t));
std::vector<std::vector<double>> ex = f; // use copy constructor to easily create a copy
The above code basically does what your originally did, but has several advantages:
1) The code is now standard C++, as it uses a standard C++ container class, std::vector.
2) You won't get into stack space issues if nstep_x and/or nstep_t are large values since std::vector gets the memory to store its items from the heap.
3) You can check boundary conditions by using std::vector::at() if it is suspected that you are accessing the vector out-of-bounds. You don't have this test if you're using VLA's (or just arrays in general).
It is item 3) that becomes important in attempting to find the errors.
If we take your code, change it to using std::vector, we see that there is a problem with "small arrays", going against what you thought was not an issue. If we look at this code:
for (int i = 0 ; i <= nstep_x; i++)
{
x = xi + dx*i;
f.at(i).at(0) = 0.5*sin(pi*x); // exception is thrown here
}
We see that there is an out-of-bounds condition. This was detected by using vector::at() instead of [ ] to access the elements in the vector. A std::out_of_range exception is thrown at the line where f[i][0] is being assigned to.
Here is a live example showing this error.
How do you fix this? Simply don't go out of bounds by changing the loop:
for (int i = 0 ; i < nstep_x; i++)
You also have boundary condition issues in your other loops:
for (int n = 0; n <= nstep_t ; n++)
{
t = ti + n*dt;
for (int i = 1; i <= nstep_x; i++)
{
x = xi + dx*i;
f.at(i).at(n+1) = f[i][n] - ((v*dt)/(2*dx))*(f.at(i+1).at(n) - f[i-1][n]);
ex.at(i)(n) = 0.5*sin(pi*x - v*t);
}
}
You will see using at() that you are accessing the f and ex vectors out-of-bounds, and thus can diagnose the issue correctly (as the other answers have done).
gdb would help to see where it actually crashes but:
f[i+1][n] and i grows up to nstep_x, but f is allocated as f[nstep_x][nstep_t],
so it seems like you'd be accessing f[nstep_x+1][n], but the maximum you can is f[nstep_x-1][n].
As noted in the comments, the reason is not the array size, but the for loops
for (int i = 0 ; i <= nstep_x; i++) {
// ...
f[i][0] = 0.5*sin(pi*x);
}
This is a classic one off error, going one beyond the end of the array. The correct way is
for (int i = 0 ; i < nstep_x; i++) {
// ...
}
Note < vs <=.
for (int n = 0; n <= nstep_t ; n++) {
for (int i = 1; i <= nstep_x; i++) {
// ...
f[i][n+1] = f[i][n] - ((v*dt)/(2*dx))*(f[i+1][n] - f[i-1][n]);
}
}
Here too, you have <= instead of <. Additionally you access indexes at i + 1 and n + 1 respectively, which means, you are not one but two steps over the end of the array.
When calculating the second time-step, you're referring to f at x=xi and t=ti+dt which wasn't calculated in the first time-step, since i runs from 1. There is a similar problem at the other boundary.
You need to specify the spatial boundary conditions for x=xi and x=xf for all t as well as fixing the off-by-one errors noted in the other answers.
To clarify, the convection equation requires boundary conditions to be specified for f(x=xi,t) and f(x=xf, t). This is typically either a constant or a prescribed flow rate in the case of an 'insulated' boundary, but other types exist.
I have problem with the following code:
int *chosen_pts = new int[k];
std::pair<float, int> *dist2 = new std::pair<float, int>[x.n];
// initialize dist2
for (int i = 0; i < x.n; ++i) {
dist2[i].first = std::numeric_limits<float>::max();
dist2[i].second = i;
}
// choose the first point randomly
int ndx = 1;
chosen_pts[ndx - 1] = rand() % x.n;
double begin, end;
double elapsed_secs;
while (ndx < k) {
float sum_distribution = 0.0;
// look for the point that is furthest from any center
begin = omp_get_wtime();
#pragma omp parallel for reduction(+:sum_distribution)
for (int i = 0; i < x.n; ++i) {
int example = dist2[i].second;
float d2 = 0.0, diff;
for (int j = 0; j < x.d; ++j) {
diff = x(example,j) - x(chosen_pts[ndx - 1],j);
d2 += diff * diff;
}
if (d2 < dist2[i].first) {
dist2[i].first = d2;
}
sum_distribution += dist2[i].first;
}
end = omp_get_wtime() - begin;
std::cout << "center assigning -- "
<< ndx << " of " << k << " = "
<< (float)ndx / k * 100
<< "% is done. Elasped time: "<< (float)end <<"\n";
/**/
bool unique = true;
do {
// choose a random interval according to the new distribution
float r = sum_distribution * (float)rand() / (float)RAND_MAX;
float sum_cdf = dist2[0].first;
int cdf_ndx = 0;
while (sum_cdf < r) {
sum_cdf += dist2[++cdf_ndx].first;
}
chosen_pts[ndx] = cdf_ndx;
for (int i = 0; i < ndx; ++i) {
unique = unique && (chosen_pts[ndx] != chosen_pts[i]);
}
} while (! unique);
++ndx;
}
As you can see i use omp to make parallel the for loop. It works fine and i can achive a significant speed up. However if i increase the value of x.n over 20000000 the function stops to work after 8-10 loops:
It doestn produces any output (std::cout)
Only one core works
No error, whatsoever
If i comment out the do while loop, it works again as expected. All cores are busy and there is an output after each iteration, and i can increase k.n over 100 millions just as i need it.
It's not OpenMP parallel for getting stuck, it's obviously in your serial do-while loop.
One particular issue that I see is that there is no array boundary checks in the inner while loop accessing dist2. In theory, out-of-boundary access should never happen; but in practice it may - see below why. So first of all I would rewrite the calculation of cdf_ndx to guarantee that the loop ends when all elements are inspected:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
sum_cdf += dist2[cdf_ndx].first;
++cdf_ndx;
}
Now, how it may happen that sum_cdf does not reach r? It is due to specifics of floating-point arithmetic and the fact that sum_distribution was computed in parallel, while sum_cdf is computed serially. The problem is that contribution of one element to the sum can be below the accuracy for floats; in other words, when you sum two float values that differ more than ~8 orders of magnitude, the smaller one does not affect the sum.
So, with 20M of floats after some point it might happen that the next value to add is so small comparing to the accumulated sum_cdf that adding this value does not change it! On the other hand, sum_distribution was essentially computed as several independent partial sums (one per thread) then combined together. Thus it is more accurate, and possibly bigger than sum_cdf can ever reach.
A solution can be to compute sum_cdf in portions, having two nested loops. For example:
float sum_cdf = 0;
int cdf_ndx = 0;
while (sum_cdf < r && cdf_ndx < x.n ) {
float block_sum = 0;
int block_end = min(cdf_ndx+10000, x.n); // 10000 is arbitrary selected block size
for (int i=cdf_ndx; i<block_end; ++i ) {
block_sum += dist2[i].first;
if( sum_cdf+block_sum >=r ) {
block_end = i; // adjust to correctly compute cdf_ndx
break;
}
}
sum_cdf += block_sum;
cdf_ndx = block_end;
}
And after the loop you need to check that cdf_ndx < x.n, otherwise repeat with a new random interval.
I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function:
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
for (int k = 0; k < 128; k += 2) {
pplus = nodeL[indx2][k] - nodeL[indx1][k];
pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if it can be accelerated through SSE instructions. I use Ubuntu 12.04 and gcc.
Of course it is microoptimization but it would helpful if you could provide some help, since I know nothing about SSE. Thanks in advance
Benchmark:
int nofTestCases = 10000000;
vector<int> nodeIds(nofTestCases);
vector<int> goalNodeIds(nofTestCases);
vector<int> results(nofTestCases);
for (int l = 0; l < nofTestCases; l++) {
nodeIds[l] = randomNodeID(18000000);
goalNodeIds[l] = randomNodeID(18000000);
}
double time, result;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff2(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
where
int randomNodeID(int n) {
return (int) (rand() / (double) (RAND_MAX + 1.0) * n);
}
/** Returns a timestamp ('now') in seconds (incl. a fractional part). */
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
FWIW I put together a pure SSE version (SSE4.1) which seems to run around 20% faster than the original scalar code on a Core i7:
#include <smmintrin.h>
int getDiff_SSE(int indx1, int indx2)
{
int result[4] __attribute__ ((aligned(16))) = { 0 };
const int * const p1 = &nodeL[indx1][0];
const int * const p2 = &nodeL[indx2][0];
const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);
__m128i vresult = _mm_set1_epi32(0);
for (int k = 0; k < 128; k += 4)
{
__m128i v1, v2, vmax;
v1 = _mm_loadu_si128((__m128i *)&p1[k]);
v2 = _mm_loadu_si128((__m128i *)&p2[k]);
v1 = _mm_xor_si128(v1, vke);
v2 = _mm_xor_si128(v2, vko);
v1 = _mm_sub_epi32(v1, vke);
v2 = _mm_sub_epi32(v2, vko);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *)result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}
You probably can get the compiler to use SSE for this. Will it make the code quicker? Probably not. The reason being is that there is a lot of memory access compared to computation. The CPU is much faster than the memory and a trivial implementation of the above will already have the CPU stalling when it's waiting for data to arrive over the system bus. Making the CPU faster will just increase the amount of waiting it does.
The declaration of nodeL can have an effect on the performance so it's important to choose an efficient container for your data.
There is a threshold where optimising does have a benfit, and that's when you're doing more computation between memory reads - i.e. the time between memory reads is much greater. The point at which this occurs depends a lot on your hardware.
It can be helpful, however, to optimise the code if you've got non-memory constrained tasks that can run in prarallel so that the CPU is kept busy whilst waiting for the data.
This will be faster. Double dereference of vector of vectors is expensive. Caching one of the dereferences will help. I know it's not answering the posted question but I think it will be a more helpful answer.
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
const vector<int>& nodetemp1 = nodeL[indx1];
const vector<int>& nodetemp2 = nodeL[indx2];
for (int k = 0; k < 128; k += 2) {
pplus = nodetemp2[k] - nodetemp1[k];
pminus = nodetemp1[k + 1] - nodetemp2[k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
A couple of things to look at. One is the amount of data you are passing around. That will cause a bigger issue than the trivial calculation.
I've tried to rewrite it using SSE instructions (AVX) using library here
The original code on my system ran in 11.5s
With Neil Kirk's optimisation, it went down to 10.5s
EDIT: Tested the code with a debugger rather than in my head!
int getDiff(std::vector<std::vector<int>>& nodeL,int row1, int row2) {
Vec4i result(0);
const std::vector<int>& nodetemp1 = nodeL[row1];
const std::vector<int>& nodetemp2 = nodeL[row2];
Vec8i mask(-1,0,-1,0,-1,0,-1,0);
for (int k = 0; k < 128; k += 8) {
Vec8i nodeA(nodetemp1[k],nodetemp1[k+1],nodetemp1[k+2],nodetemp1[k+3],nodetemp1[k+4],nodetemp1[k+5],nodetemp1[k+6],nodetemp1[k+7]);
Vec8i nodeB(nodetemp2[k],nodetemp2[k+1],nodetemp2[k+2],nodetemp2[k+3],nodetemp2[k+4],nodetemp2[k+5],nodetemp2[k+6],nodetemp2[k+7]);
Vec8i tmp = select(mask,nodeB-nodeA,nodeA-nodeB);
Vec4i tmp_a(tmp[0],tmp[2],tmp[4],tmp[6]);
Vec4i tmp_b(tmp[1],tmp[3],tmp[5],tmp[7]);
Vec4i max_tmp = max(tmp_a,tmp_b);
result = select(max_tmp > result,max_tmp,result);
}
return horizontal_add(result);
}
The lack of branching speeds it up to 9.5s but still data is the biggest impact.
If you want to speed it up more, try to change the data structure to a single array/vector rather than a 2D one (a.l.a. std::vector) as that will reduce cache pressure.
EDIT
I thought of something - you could add a custom allocator to ensure you allocate the 2*18M vectors in a contiguous block of memory which allows you to keep the data structure and still go through it quickly. But you'd need to profile it to be sure
EDIT 2: Tested the code with a debugger rather than in my head!
Sorry Alex, this should be better. Not sure it will be faster than what the compiler can do. I still maintain that it's memory access that's the issue, so I would still try the single array approach. Give this a go though.
I'm probably going to ask this incorrectly and make myself look very stupid but here goes:
I'm trying to do some audio manipulate and processing on a .wav file. Now, I am able to read all of the data (including the header) but need the data to be in frequency, and, in order to this I need to use an FFT.
I searched the internet high and low and found one, and the example was taken out of the "Numerical Recipes in C" book, however, I amended it to use vectors instead of arrays. Ok so here's the problem:
I have been given (as an example to use) a series of numbers and a sampling rate:
X = {50, 206, -100, -65, -50, -6, 100, -135}
Sampling Rate : 8000
Number of Samples: 8
And should therefore answer this:
0Hz A=0 D=1.57079633
1000Hz A=50 D=1.57079633
2000HZ A=100 D=0
3000HZ A=100 D=0
4000HZ A=0 D=3.14159265
The code that I re-wrote compiles, however, when trying to input these numbers into the equation (function) I get a Segmentation fault.. Is there something wrong with my code, or is the sampling rate too high? (The algorithm doesn't segment when using a much, much smaller sampling rate). Here is the code:
#include <iostream>
#include <math.h>
#include <vector>
using namespace std;
#define SWAP(a,b) tempr=(a);(a)=(b);(b)=tempr;
#define pi 3.14159
void ComplexFFT(vector<float> &realData, vector<float> &actualData, unsigned long sample_num, unsigned int sample_rate, int sign)
{
unsigned long n, mmax, m, j, istep, i;
double wtemp,wr,wpr,wpi,wi,theta,tempr,tempi;
// CHECK TO SEE IF VECTOR IS EMPTY;
actualData.resize(2*sample_rate, 0);
for(n=0; (n < sample_rate); n++)
{
if(n < sample_num)
{
actualData[2*n] = realData[n];
}else{
actualData[2*n] = 0;
actualData[2*n+1] = 0;
}
}
// Binary Inversion
n = sample_rate << 1;
j = 0;
for(i=0; (i< n /2); i+=2)
{
if(j > i)
{
SWAP(actualData[j], actualData[i]);
SWAP(actualData[j+1], actualData[i+1]);
if((j/2)<(n/4))
{
SWAP(actualData[(n-(i+2))], actualData[(n-(j+2))]);
SWAP(actualData[(n-(i+2))+1], actualData[(n-(j+2))+1]);
}
}
m = n >> 1;
while (m >= 2 && j >= m) {
j -= m;
m >>= 1;
}
j += m;
}
mmax=2;
while(n > mmax) {
istep = mmax << 1;
theta = sign * (2*pi/mmax);
wtemp = sin(0.5*theta);
wpr = -2.0*wtemp*wtemp;
wpi = sin(theta);
wr = 1.0;
wi = 0.0;
for(m=1; (m < mmax); m+=2) {
for(i=m; (i <= n); i += istep)
{
j = i*mmax;
tempr = wr*actualData[j-1]-wi*actualData[j];
tempi = wr*actualData[j]+wi*actualData[j-1];
actualData[j-1] = actualData[i-1] - tempr;
actualData[j] = actualData[i]-tempi;
actualData[i-1] += tempr;
actualData[i] += tempi;
}
wr = (wtemp=wr)*wpr-wi*wpi+wr;
wi = wi*wpr+wtemp*wpi+wi;
}
mmax = istep;
}
// determine if the fundamental frequency
int fundemental_frequency = 0;
for(i=2; (i <= sample_rate); i+=2)
{
if((pow(actualData[i], 2)+pow(actualData[i+1], 2)) > pow(actualData[fundemental_frequency], 2)+pow(actualData[fundemental_frequency+1], 2)) {
fundemental_frequency = i;
}
}
}
int main(int argc, char *argv[]) {
vector<float> numbers;
vector<float> realNumbers;
numbers.push_back(50);
numbers.push_back(206);
numbers.push_back(-100);
numbers.push_back(-65);
numbers.push_back(-50);
numbers.push_back(-6);
numbers.push_back(100);
numbers.push_back(-135);
ComplexFFT(numbers, realNumbers, 8, 8000, 0);
for(int i=0; (i < realNumbers.size()); i++)
{
cout << realNumbers[i] << "\n";
}
}
The other thing, (I know this sounds stupid) but I don't really know what is expected of the
"int sign" That is being passed through the ComplexFFT function, this is where I could be going wrong.
Does anyone have any suggestions or solutions to this problem?
Thank you :)
I think the problem lies in errors in how you translated the algorithm.
Did you mean to initialize j to 1 rather than 0?
for(i = 0; (i < n/2); i += 2) should probably be for (i = 1; i < n; i += 2).
Your SWAPs should probably be
SWAP(actualData[j - 1], actualData[i - 1]);
SWAP(actualData[j], actualData[i]);
What are the following SWAPs for? I don't think they're needed.
if((j/2)<(n/4))
{
SWAP(actualData[(n-(i+2))], actualData[(n-(j+2))]);
SWAP(actualData[(n-(i+2))+1], actualData[(n-(j+2))+1]);
}
The j >= m in while (m >= 2 && j >= m) should probably be j > m if you intended to do bit reversal.
In the code implementing the Danielson-Lanczos section, are you sure j = i*mmax; was not supposed to be an addition, i.e. j = i + mmax;?
Apart from that, there are a lot of things you can do to simplify your code.
Using your SWAP macro should be discouraged when you can just use std::swap... I was going to suggest std::swap_ranges, but then I realized you only need to swap the real parts, since your data is all reals (your time-series imaginary parts are all 0):
std::swap(actualData[j - 1], actualData[i - 1]);
You can simplify the entire thing using std::complex, too.
I reckon its down to the re-sizing of your vector.
One possibility: Maybe re-sizing will create temp objects on the stack before moving them back to heap i think.
The FFT in Numerical Recipes in C uses the Cooley-Tukey Algorithm, so in answer to your question at the end, the int sign being passed allows the same routine to be used to compute both the forward (sign=-1) and inverse (sign=1) FFT. This seems to be consistent with the way you are using sign when you define theta = sign * (2*pi/mmax).