Audio Processing C++ - FFT - c++

I'm probably going to ask this incorrectly and make myself look very stupid but here goes:
I'm trying to do some audio manipulate and processing on a .wav file. Now, I am able to read all of the data (including the header) but need the data to be in frequency, and, in order to this I need to use an FFT.
I searched the internet high and low and found one, and the example was taken out of the "Numerical Recipes in C" book, however, I amended it to use vectors instead of arrays. Ok so here's the problem:
I have been given (as an example to use) a series of numbers and a sampling rate:
X = {50, 206, -100, -65, -50, -6, 100, -135}
Sampling Rate : 8000
Number of Samples: 8
And should therefore answer this:
0Hz A=0 D=1.57079633
1000Hz A=50 D=1.57079633
2000HZ A=100 D=0
3000HZ A=100 D=0
4000HZ A=0 D=3.14159265
The code that I re-wrote compiles, however, when trying to input these numbers into the equation (function) I get a Segmentation fault.. Is there something wrong with my code, or is the sampling rate too high? (The algorithm doesn't segment when using a much, much smaller sampling rate). Here is the code:
#include <iostream>
#include <math.h>
#include <vector>
using namespace std;
#define SWAP(a,b) tempr=(a);(a)=(b);(b)=tempr;
#define pi 3.14159
void ComplexFFT(vector<float> &realData, vector<float> &actualData, unsigned long sample_num, unsigned int sample_rate, int sign)
{
unsigned long n, mmax, m, j, istep, i;
double wtemp,wr,wpr,wpi,wi,theta,tempr,tempi;
// CHECK TO SEE IF VECTOR IS EMPTY;
actualData.resize(2*sample_rate, 0);
for(n=0; (n < sample_rate); n++)
{
if(n < sample_num)
{
actualData[2*n] = realData[n];
}else{
actualData[2*n] = 0;
actualData[2*n+1] = 0;
}
}
// Binary Inversion
n = sample_rate << 1;
j = 0;
for(i=0; (i< n /2); i+=2)
{
if(j > i)
{
SWAP(actualData[j], actualData[i]);
SWAP(actualData[j+1], actualData[i+1]);
if((j/2)<(n/4))
{
SWAP(actualData[(n-(i+2))], actualData[(n-(j+2))]);
SWAP(actualData[(n-(i+2))+1], actualData[(n-(j+2))+1]);
}
}
m = n >> 1;
while (m >= 2 && j >= m) {
j -= m;
m >>= 1;
}
j += m;
}
mmax=2;
while(n > mmax) {
istep = mmax << 1;
theta = sign * (2*pi/mmax);
wtemp = sin(0.5*theta);
wpr = -2.0*wtemp*wtemp;
wpi = sin(theta);
wr = 1.0;
wi = 0.0;
for(m=1; (m < mmax); m+=2) {
for(i=m; (i <= n); i += istep)
{
j = i*mmax;
tempr = wr*actualData[j-1]-wi*actualData[j];
tempi = wr*actualData[j]+wi*actualData[j-1];
actualData[j-1] = actualData[i-1] - tempr;
actualData[j] = actualData[i]-tempi;
actualData[i-1] += tempr;
actualData[i] += tempi;
}
wr = (wtemp=wr)*wpr-wi*wpi+wr;
wi = wi*wpr+wtemp*wpi+wi;
}
mmax = istep;
}
// determine if the fundamental frequency
int fundemental_frequency = 0;
for(i=2; (i <= sample_rate); i+=2)
{
if((pow(actualData[i], 2)+pow(actualData[i+1], 2)) > pow(actualData[fundemental_frequency], 2)+pow(actualData[fundemental_frequency+1], 2)) {
fundemental_frequency = i;
}
}
}
int main(int argc, char *argv[]) {
vector<float> numbers;
vector<float> realNumbers;
numbers.push_back(50);
numbers.push_back(206);
numbers.push_back(-100);
numbers.push_back(-65);
numbers.push_back(-50);
numbers.push_back(-6);
numbers.push_back(100);
numbers.push_back(-135);
ComplexFFT(numbers, realNumbers, 8, 8000, 0);
for(int i=0; (i < realNumbers.size()); i++)
{
cout << realNumbers[i] << "\n";
}
}
The other thing, (I know this sounds stupid) but I don't really know what is expected of the
"int sign" That is being passed through the ComplexFFT function, this is where I could be going wrong.
Does anyone have any suggestions or solutions to this problem?
Thank you :)

I think the problem lies in errors in how you translated the algorithm.
Did you mean to initialize j to 1 rather than 0?
for(i = 0; (i < n/2); i += 2) should probably be for (i = 1; i < n; i += 2).
Your SWAPs should probably be
SWAP(actualData[j - 1], actualData[i - 1]);
SWAP(actualData[j], actualData[i]);
What are the following SWAPs for? I don't think they're needed.
if((j/2)<(n/4))
{
SWAP(actualData[(n-(i+2))], actualData[(n-(j+2))]);
SWAP(actualData[(n-(i+2))+1], actualData[(n-(j+2))+1]);
}
The j >= m in while (m >= 2 && j >= m) should probably be j > m if you intended to do bit reversal.
In the code implementing the Danielson-Lanczos section, are you sure j = i*mmax; was not supposed to be an addition, i.e. j = i + mmax;?
Apart from that, there are a lot of things you can do to simplify your code.
Using your SWAP macro should be discouraged when you can just use std::swap... I was going to suggest std::swap_ranges, but then I realized you only need to swap the real parts, since your data is all reals (your time-series imaginary parts are all 0):
std::swap(actualData[j - 1], actualData[i - 1]);
You can simplify the entire thing using std::complex, too.

I reckon its down to the re-sizing of your vector.
One possibility: Maybe re-sizing will create temp objects on the stack before moving them back to heap i think.

The FFT in Numerical Recipes in C uses the Cooley-Tukey Algorithm, so in answer to your question at the end, the int sign being passed allows the same routine to be used to compute both the forward (sign=-1) and inverse (sign=1) FFT. This seems to be consistent with the way you are using sign when you define theta = sign * (2*pi/mmax).

Related

how to optimize and speed up the multiplication of matrix in c++?

this is optimized implementation of matrix multiplication and this routine performs a matrix multiplication operation.
C := C + A * B (where A, B, and C are n-by-n matrices stored in column-major format)
On exit, A and B maintain their input values.
void matmul_optimized(int n, int *A, int *B, int *C)
{
// to the effective bitwise calculation
// save the matrix as the different type
int i, j, k;
int cij;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
cij = C[i + j * n]; // the initialization into C also, add separate additions to the product and sum operations and then record as a separate variable so there is no multiplication
for (k = 0; k < n; ++k) {
cij ^= A[i + k * n] & B[k + j * n]; // the multiplication of each terms is expressed by using & operator the addition is done by ^ operator.
}
C[i + j * n] = cij; // allocate the final result into C }
}
}
how do I more speed up the multiplication of matrix based on above function/method?
this function is tested up to 2048 by 2048 matrix.
the function matmul_optimized is done with matmul.
#include <stdio.h>
#include <stdlib.h>
#include "cpucycles.c"
#include "helper_functions.c"
#include "matmul_reference.c"
#include "matmul_optimized.c"
int main()
{
int i, j;
int n = 1024; // Number of rows or columns in the square matrices
int *A, *B; // Input matrices
int *C1, *C2; // Output matrices from the reference and optimized implementations
// Performance and correctness measurement declarations
long int CLOCK_start, CLOCK_end, CLOCK_total, CLOCK_ref, CLOCK_opt;
long int COUNTER, REPEAT = 5;
int difference;
float speedup;
// Allocate memory for the matrices
A = malloc(n * n * sizeof(int));
B = malloc(n * n * sizeof(int));
C1 = malloc(n * n * sizeof(int));
C2 = malloc(n * n * sizeof(int));
// Fill bits in A, B, C1
fill(A, n * n);
fill(B, n * n);
fill(C1, n * n);
// Initialize C2 = C1
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
C2[i * n + j] = C1[i * n + j];
// Measure performance of the reference implementation
CLOCK_total = 0;
for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
{
CLOCK_start = cpucycles();
matmul_reference(n, A, B, C1);
CLOCK_end = cpucycles();
CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
}
CLOCK_ref = CLOCK_total / REPEAT;
printf("n=%d Avg cycle count for reference implementation = %ld\n", n, CLOCK_ref);
// Measure performance of the optimized implementation
CLOCK_total = 0;
for (COUNTER = 0; COUNTER < REPEAT; COUNTER++)
{
CLOCK_start = cpucycles();
matmul_optimized(n, A, B, C2);
CLOCK_end = cpucycles();
CLOCK_total = CLOCK_total + CLOCK_end - CLOCK_start;
}
CLOCK_opt = CLOCK_total / REPEAT;
printf("n=%d Avg cycle count for optimized implementation = %ld\n", n, CLOCK_opt);
speedup = (float)CLOCK_ref / (float)CLOCK_opt;
// Check correctness by comparing C1 and C2
difference = 0;
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
difference = difference + C1[i * n + j] - C2[i * n + j];
if (difference == 0)
printf("Speedup factor = %.2f\n", speedup);
if (difference != 0)
printf("Reference and optimized implementations do not match\n");
//print(C2, n);
free(A);
free(B);
free(C1);
free(C2);
return 0;
}
You can try algorithm like Strassen or Coppersmith-Winograd and here is also a good example.
Or maybe try Parallel computing like future::task or std::thread
Optimizing matrix-matrix multiplication requires careful attention to be paid to a number of issues:
First, you need to be able to use vector instructions. Only vector instructions can access parallelism inherent in the architecture. So, either your compiler needs to be able to automatically map to vector instructions, or you have to do so by hand, for example by calling the vector intrinsic library for AVX-2 instructions (for x86 architectures).
Next, you need to pay careful attention to the memory hierarchy. Your performance can easily drop to less than 5% of peak if you don't do this.
Once you do this right, you will hopefully have broken the computation up into small enough computational chunks that you can also parallelize via OpenMP or pthreads.
A document that carefully steps through what is required can be found at http://www.cs.utexas.edu/users/flame/laff/pfhp/LAFF-On-PfHP.html. (This is very much a work in progress.) At the end of it all, you will have an implementation that gets close to the performance attained by high-performance libraries like Intel's Math Kernel Library (MKL) or the BLAS-like Library Instantiation Software (BLIS).
(And, actually, you CAN then also effectively incorporate Strassen's algorithm. But that is another story, told in Unit 3.5.3 of these notes.)
You may find the following thread relevant: How does BLAS get such extreme performance?

Using Mean Squared error instead of SAD for image compensation

I have an assignment where image composition is done using SAD. And another task is to use MSE instead of SAD in the code. Im struggling with it so can anyone help me with this? Here is the code for SAD.
find_motion(my_image_comp *ref, my_image_comp *tgt,
int start_row, int start_col, int block_width, int block_height)
/* This function finds the motion vector which best describes the motion
between the `ref' and `tgt' frames, over a specified block in the
`tgt' frame. Specifically, the block in the `tgt' frame commences
at the coordinates given by `start_row' and `start_col' and extends
over `block_width' columns and `block_height' rows. The function finds
the translational offset (the returned vector) which describes the
best matching block of the same size in the `ref' frame, where
the "best match" is interpreted as the one which minimizes the sum of
absolute differences (SAD) metric. */
{
mvector vec, best_vec;
int sad, best_sad=256*block_width*block_height;
for (vec.y=-8; vec.y <= 8; vec.y++)
for (vec.x=-8; vec.x <= 8; vec.x++)
{
int ref_row = start_row-vec.y;
int ref_col = start_col-vec.x;
if ((ref_row < 0) || (ref_col < 0) ||
((ref_row+block_height) > ref->height) ||
((ref_col+block_width) > ref->width))
continue; // Translated block not containe within reference frame
int r, c;
int *rp = ref->buf + ref_row*ref->stride + ref_col;
int *tp = tgt->buf + start_row*tgt->stride + start_col;
for (sad=0, r=block_height; r > 0; r--,
rp+=ref->stride, tp+=tgt->stride)
for (c=0; c < block_width; c++)
{
int diff = tp[c] - rp[c];
sad += (diff < 0)?(-diff):diff;
}
if (sad < best_sad)
{
best_sad = sad;
best_vec = vec;
}
}
return best_vec;
}
I got the answer myself I think.
its,
for (mse = 0, r = block_height; r > 0; r--,
rp+=ref->stride, tp+=tgt->stride)
for (c=0; c < block_width; c++)
{
int diff = tp[c] - rp[c];
temp = (diff*diff) / (block_height*block_width);
mse += temp;
temp = 0;
}
if (mse < best_mse)
{
best_mse = mse;
best_vec = vec;
}
}
return best_vec;
}

How does this strange I/O method work?

While taking input output in C++ I have only used scanf/printf and cin/cout. Now I recently came across this code taking I/O in a strange fashion.
Also note that this I/O method is causing the code to run extremely fast, as this code uses almost the same algorithm as most of the other codes but it executes in a much smaller time. Why is this I/O so fast and how does this work in general?
edit: code
#include <bits/stdtr1c++.h>
#define MAXN 200010
#define MAXQ 200010
#define MAXV 1000010
#define clr(ar) memset(ar, 0, sizeof(ar))
#define read() freopen("lol.txt", "r", stdin)
using namespace std;
const int block_size = 633;
long long res, out[MAXQ]; int n, q, ar[MAXN], val[MAXN], freq[MAXV];
namespace fastio{
int ptr, ye;
char temp[25], str[8333667], out[8333669];
void init(){
ptr = 0, ye = 0;
fread(str, 1, 8333667, stdin);
}
inline int number(){
int i, j, val = 0;
while (str[ptr] < 45 || str[ptr] > 57) ptr++;
while (str[ptr] > 47 && str[ptr] < 58) val = (val * 10) + (str[ptr++] - 48);
return val;
}
inline void convert(long long x){
int i, d = 0;
for (; ;){
temp[++d] = (x % 10) + 48;
x /= 10;
if (!x) break;
}
for (i = d; i; i--) out[ye++] = temp[i];
out[ye++] = 10;
}
inline void print(){
fwrite(out, 1, ye, stdout);
} }
struct query{
int l, r, d, i;
inline query() {}
inline query(int a, int b, int c){
i = c;
l = a, r = b, d = l / block_size;
}
inline bool operator < (const query& other) const{
if (d != other.d) return (d < other.d);
return ((d & 1) ? (r < other.r) : (r > other.r));
} } Q[MAXQ];
void compress(int n, int* in, int* out){
unordered_map <int, int> mp;
for (int i = 0; i < n; i++) out[i] = mp.emplace(in[i], mp.size()).first->second; }
inline void insert(int i){
res += (long long)val[i] * (1 + 2 * freq[ar[i]]++); }
inline void erase(int i){
res -= (long long)val[i] * (1 + 2 * --freq[ar[i]]); }
inline void run(){
sort(Q, Q + q);
int i, l, r, a = 0, b = 0;
for (res = 0, i = 0; i < q; i++){
l = Q[i].l, r = Q[i].r;
while (a > l) insert(--a);
while (b <= r) insert(b++);
while (a < l) erase(a++);
while (b > (r + 1)) erase(--b);
out[Q[i].i] = res;
}
for (i = 0; i < q; i++) fastio::convert(out[i]); }
int main(){
fastio::init();
int n, i, j, k, a, b;
n = fastio::number();
q = fastio::number();
for (i = 0; i < n; i++) val[i] = fastio::number();
compress(n, val, ar);
for (i = 0; i < q; i++){
a = fastio::number();
b = fastio::number();
Q[i] = query(a - 1, b - 1, i);
}
run();
fastio::print();
return 0; }
This solution, http://codeforces.com/contest/86/submission/22526466 (624 ms, 32 MB RAM uses) uses single fread and manual parsing of numbers from memory (so it uses more memory); many other solutions are slower and uses scanf (http://codeforces.com/contest/86/submission/27561563 1620 ms 9MB) or C++ iostream cin (http://codeforces.com/contest/86/submission/27558562 3118 ms, 15 MB). Not all difference of solutions comes from input-output and parsing (solutions methods have differences too), but some is.
fread(str, 1, 8333667, stdin);
This code uses single fread libcall to read up to 8MB, which is full file. The file may have up to 2 (n,t) + 200000 (a_i) + 2*200000 (l,r) 6/7-digit numbers with or without line breaks or separated by one (?) space, so around 8 chars max for number (6 or 7 for number, as 1000000 is allowed too, and 1 space or \n); max input file size is like 0.6 M * 8 bytes =~ 5 MB.
inline int number(){
int i, j, val = 0;
while (str[ptr] < 45 || str[ptr] > 57) ptr++;
while (str[ptr] > 47 && str[ptr] < 58) val = (val * 10) + (str[ptr++] - 48);
return val;
}
Then code uses manual code of parsing decimal int numbers. According to ascii table, http://www.asciitable.com/ decimal codes of 48...57 are decimal digits (second while loop): '0'...'9', and we can just subtract 48 from the letter code to get the digit; multiply partially read val by 10 and add current digit. And chr<45 || chr > 57 in the first while loops sound like skipping non-digits from input. This is incorrect, as this code will not parse codes 45, 46, 47 = '-', '.', '/', and no any number after these chars will be read.
n = fastio::number();
q = fastio::number();
for (i = 0; i < n; i++) val[i] = fastio::number();
for (i = 0; i < q; i++){
a = fastio::number();
b = fastio::number();
Actual reading uses this fastio::number() method; and other solutions uses calling of scanf or iostream operator << in loop:
for (int i = 0; i < N; i++) {
scanf("%d", &(arr[i]));
add(arr[i]);
}
or
for (int i = 1; i <= n; ++i)
cin >> a[i];
Both methods are more universal, but they do library call, which will read some chars from internal buffer (like 4KB) or call OS syscall for buffer refill, and every function does many checks and has error reporting: For every number of input scanf will reparse the same format string of first argument, and will do all the logic described in POSIX http://pubs.opengroup.org/onlinepubs/7908799/xsh/fscanf.html and all error-checking. C++ iostream has no format string, but it is still more universal: https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/istream.tcc#L156 'operator>>(int& __n)'.
So, standard library functions have more logic inside, more calls, more branching; and they are more universal and much safer, and should be used in real-world programming. And this "sport programming" contest allow users to solve the task with standard library functions which are fast enough, if you can imagine the algorithm. Authors or task are required to write several solutions with standard i/o functions to check that timelimit of the task is correct and task can be solved. (The TopCoder system is better with i/o, you will not implement i/o, the data is already passed into your function in some language structs/collections).
Sometimes tasks in sport programming have tight limits on memory: input files several times bigger than allowed memory usage, and programmer can't read whole file into memory. For example: get 20 mln of digits of single verylong number from input file and add 1 to it, with memory limit of 2 MB; you can't read full input number from file in forward direction; it is very hard to do correct reading in chunks in backward direction; and you just need to forget standard method of addition (Columnar addition) and build FSM (Finite-state machine) with state, counting sequences of 9s.

Julia Set rendering code

I am working on escape-time fractals as my 12th grade project, to be written in c++ , using the simple graphics.h library that is outdated but seems sufficient.
The code for generating the Mandelbrot set seems to work, and I assumed that Julia sets would be a variation of the same. Here is the code:
(Here, fx and fy are simply functions to convert the actual complex co-ordinates like (-0.003,0.05) to an actual value of a pixel on the screen.)
int p;
x0=0, y0=0;
long double r, i;
cout<<"Enter c"<<endl;
cin>>r>>i;
for(int i= fx(-2); i<=fx(2); i++)
{
for(int j= fy(-2); j>=fy(2); j--)
{
long double x=0.0, y= 0.0,t;
x= gx(i), y= gy(j);
int k= -1;
while(( x*x + y*y <4)&& k<it-1)
{
t= x*x - y*y + r;
y= 2*x*y + i ;
x=t;
k++;
}
p= k*pd;
setcolor(COLOR(colour[p][0],colour[p][1],colour[p][2]));
putpixel(i,j,getcolor());
}
}
But this does not seem to be the case. The output window shows the entire circle of radius=2 with the colour corresponding to an escape time of 1 iteration.
Also, on trying to search for a solution to this problem, I've seen that all the algorithms others have used initializes the initial co-ordinates somewhat like this:
x = (col - width/2)*4.0/width;
y = (row - height/2)*4.0/width;
Could somebody explain what I'm missing out?
I guess that the main problem is that the variable i (imaginary part) is mistakenly overridden by the loop variable i. So the line
y= 2*x*y + i;
gives the incorrect result. This variable should be renamed as, say im. The corrected version is attached below, Since I don't have graphics.h, I used the screen as the output.
#include <iostream>
using namespace std;
#define WIDTH 40
#define HEIGHT 60
/* real to screen */
#define fx(x) ((int) ((x + 2)/4.0 * WIDTH))
#define fy(y) ((int) ((2 - y)/4.0 * HEIGHT))
/* screen to real */
#define gx(i) ((i)*4.0/WIDTH - 2)
#define gy(j) ((j)*4.0/HEIGHT - 2)
static void julia(int it, int pd)
{
int p;
long double re = -0.75, im = 0;
long double x0 = 0, y0 = 0;
cout << "Enter c" << endl;
cin >> re >> im;
for (int i = fx(-2.0); i <= fx(2.0); i++)
{
for (int j = fy(-2.0); j >= fy(2.0); j--)
{
long double x = gx(i), y = gy(j), t;
int k = 0;
while (x*x + y*y < 4 && k < it)
{
t = x*x - y*y + re;
y = 2*x*y + im;
x = t;
k++;
}
p = (int) (k * pd);
//setcolor(COLOR(colour[p][0],colour[p][1],colour[p][2]));
//putpixel(i,j,getcolor());
cout << p; // for ASCII output
}
cout << endl; // for ASCII output
}
}
int main(void)
{
julia(9, 1);
return 0;
}
and the output with input -0.75 0 is given below.
0000000000000000000000000000000000000000000000000000000000000
0000000000000000000001111111111111111111000000000000000000000
0000000000000000011111111111111111111111111100000000000000000
0000000000000001111111111111111111111111111111000000000000000
0000000000000111111111111122222222211111111111110000000000000
0000000000011111111111122222349432222211111111111100000000000
0000000001111111111112222233479743322222111111111111000000000
0000000011111111111222222334999994332222221111111111100000000
0000000111111111112222223345999995433222222111111111110000000
0000011111111111122222234479999999744322222211111111111100000
0000011111111111222222346899999999986432222221111111111100000
0000111111111111222223359999999999999533222221111111111110000
0001111111111112222233446999999999996443322222111111111111000
0011111111111112222233446999999999996443322222111111111111100
0011111111111122222333456899999999986543332222211111111111100
0111111111111122223334557999999999997554333222211111111111110
0111111111111122233345799999999999999975433322211111111111110
0111111111111122233457999999999999999997543322211111111111110
0111111111111122334469999999999999999999644332211111111111110
0111111111111122345999999999999999999999995432211111111111110
0111111111111122379999999999999999999999999732211111111111110
0111111111111122345999999999999999999999995432211111111111110
0111111111111122334469999999999999999999644332211111111111110
0111111111111122233457999999999999999997543322211111111111110
0111111111111122233345799999999999999975433322211111111111110
0111111111111122223334557999999999997554333222211111111111110
0011111111111122222333456899999999986543332222211111111111100
0011111111111112222233446999999999996443322222111111111111100
0001111111111112222233446999999999996443322222111111111111000
0000111111111111222223359999999999999533222221111111111110000
0000011111111111222222346899999999986432222221111111111100000
0000011111111111122222234479999999744322222211111111111100000
0000000111111111112222223345999995433222222111111111110000000
0000000011111111111222222334999994332222221111111111100000000
0000000001111111111112222233479743322222111111111111000000000
0000000000011111111111122222349432222211111111111100000000000
0000000000000111111111111122222222211111111111110000000000000
0000000000000001111111111111111111111111111111000000000000000
0000000000000000011111111111111111111111111100000000000000000
0000000000000000000001111111111111111111000000000000000000000
0000000000000000000000000000000000000000000000000000000000000
would you please tell how you display the image by using these graphics.h library
//setcolor(COLOR(colour[p][0],colour[p][1],colour[p][2]));
//putpixel(i,j,getcolor());

Red-Black Gauss Seidel and OpenMP

I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.
The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.
The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower.
Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.
My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.
See the example below.
Anyone has any clue on how to make this more effective with OpenMP?
void redBlackSmooth(std::vector<double> const & b,
std::vector<double> & x,
double h)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
int const N = static_cast<int>(x.size());
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2/2.0)*(b[i] - sigma);
}
}
Addition:
I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.
First of all, make sure that the x vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:
double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);
Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.
In my case this implementation is much faster when x is not cache-aligned.
const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;
#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
base = j * workGroupSize;
for (int k = 0; k < workGroupSize; k+=2)
{
i = base + k + (redSweep ? 1 : 0);
if ( i == 0 || i+1 == N) continue;
sigma = -invh2* ( x[i-1] + x[i+1] );
x[i] = ( h2/2.0 ) * ( b[i] - sigma );
}
}
In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.
I think the main problem is about type of array structure you are using. Lets try comparing results with vectors and arrays. (Arrays = c-arrays using new operator).
Vector and array sizes are N = 10000000. I force the smoothing function to repeat in order to maintain runtime > 0.1secs.
Vector Time: 0.121007 Repeat: 1 MLUPS: 82.6399
Array Time: 0.164009 Repeat: 2 MLUPS: 121.945
MLUPS = ((N-2)*repeat/runtime)/1000000 (Million Lattice Points Update per second)
MFLOPS are misleading when it comes to grid calculation. A few changes in the basic equation can lead to consider high performance for the same runtime.
The modified code:
double my_redBlackSmooth(double *b, double* x, double h, int N)
{
// Setup relevant constants.
double const invh2 = 1.0/(h*h);
double const h2 = (h*h);
double sigma = 0;
// Setup some boundary conditions.
x[0] = 0.0;
x[N-1] = 0.0;
double runtime(0.0), wcs, wce;
int repeat = 1;
timing(&wcs);
for(; runtime < 0.1; repeat*=2)
{
for(int r = 0; r < repeat; ++r)
{
// Red sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 1; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// Black sweep.
#pragma omp parallel for shared(b, x) private(sigma)
for (int i = 2; i < N-1; i+=2)
{
sigma = -invh2*(x[i-1] + x[i+1]);
x[i] = (h2*0.5)*(b[i] - sigma);
}
// cout << "In Array: " << r << endl;
}
if(x[0] != 0) dummy(x[0]);
timing(&wce);
runtime = (wce-wcs);
}
// cout << "Before division: " << repeat << endl;
repeat /= 2;
cout << "Array Time:\t" << runtime << "\t" << "Repeat:\t" << repeat
<< "\tMLUPS:\t" << ((N-2)*repeat/runtime)/1000000.0 << endl;
return runtime;
}
I didn't change anything in the code except than array type. For better cache access and blocking you should look into data alignment (_mm_malloc).