C++ Can someone help me optimize this code? - c++

I am trying to perform a circular convolution on a large boundary can someone help me optimize this to run faster? I am trying to cconv to signal with a large number of samples. downsampling is not an option.
#include <iostream>
#include <time.h> /* clock_t, clock, CLOCKS_PER_SEC */
#include <math.h> /* sqrt */
using namespace std;
void fillarray(double* x, int N)
{
for (int i = 0; i < N; i++)
x[i] = i + 1;
}
void circcon(double* x, double* y, double* u, int N)
{
for (int m = 0; m < N; ++m)
for (int n = 0; n < N; ++n) {
if ((m - n) < 0)
u[m] += x[n] * y[m - n + N];
else
u[m] += x[n] * y[m - n];
}
}
int main(void)
{
int N = 447650;
double* x = new double[N];
double* y = new double[N];
double* u = new double[N];
clock_t t;
t = clock();
fillarray(x, N);
fillarray(y, N);
for (int i = 0; i < N; i++)
u[i] = 0.0;
circcon(x, y, u, N);
t = clock() - t;
printf("It took me %d clicks (%f seconds).\n", t, ((float)t) / CLOCKS_PER_SEC);
return 0;
}

It depends on by how much you need to improve the performance...
First I would make y of size 2N with second part being copy of the first so that instead of the if statement if((m-n) < 0) u[m] += x[n]*y[m-n+N]; else u[m] += x[n]*y[m-n]; one can write simply u[m] += x[n]*y[m-n+N];.
Then, you could try making it multi-threaded, seek tutorials on that. There are plenty.
Consider using SIMD instructions, though I believe that nowadays compilers use them automatically in simple enough cases.
However, the best solution would be to find an implementation of FFT (fast fourier transform). Then you could complete the convolution in O(n log n) operations instead of O(n^2). I just googled and found a library that does it:
http://www.alglib.net/fasttransforms/convolution.php
Edit: if you have matlab - they have had FFT for ages.

Related

Constructing distance matrix in parallel in C++11 using OpenMP

I would like to construct a distance matrix in parallel in C++11 using OpenMP. I read various documentations, introductions, examples etc. Yet, I still have a few questions. To facilitate answering this post, I state my questions as assumptions numbered 1 through 7. This way, you can quickly browse through them and point out which ones are correct and which ones are not.
Let us begin with a simple serially executed function computing a dense Armadillo matrix:
// [[Rcpp::export]]
arma::mat compute_dist_mat(arma::mat &coordinates, unsigned int n_points) {
arma::mat dist_mat(n_points, n_points, arma::fill::zeros);
double dist {};
for(unsigned int i {0}; i < n_points; i++) {
for(unsigned int j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
dist_mat.at(i, j) = dist;
dist_mat.at(j, i) = dist;
}
}
return dist_mat;
}
As a side note: this function is supposed to be called from R through the Rcpp interface - indicated by the // [[Rcpp::export]]. And accordingly the top of the file includes
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::plugins(cpp11)]]
#include <omp.h>
// [[Rcpp::plugins(openmp)]]
using namespace Rcpp;
using namespace arma;
However, the function should work also fine without the R interface.
In an attempt to parallelize the code, I replace the loops with
unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
for(j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
dist_mat.at(i, j) = dist;
dist_mat.at(j, i) = dist;
}
}
and add n_threads as an argument to the compute_dist_mat function.
This distributes the iterations of the outer loop across threads, with the iterations of the inner loop executed by the respective thread handling the outer loop.
The two loop levels cannot be combined because the inner loop depends on the outer one.
dist, i, and j are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
The # pragma line does not have any effect when n_treads = 1, inducing a serial execution.
Extending the dense matrix application, the following code block illustrates the serial sparse matrix case with batch insertion. To motivate the use of sparse matrices here, I set distances below a certain threshold to zero.
// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold) {
std::vector<double> dists;
std::vector<unsigned int> dist_i;
std::vector<unsigned int> dist_j;
double dist {};
for(unsigned long int i {0}; i < n_points; i++) {
for(unsigned long int j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
if(dist >= dist_threshold) {
dists.push_back(dist);
dist_i.push_back(i);
dist_j.push_back(j);
}
}
}
unsigned int mat_size = dist_i.size();
arma::umat index_mat(2, mat_size * 2);
arma::vec dists_vec(mat_size * 2);
unsigned int j {};
for(unsigned int i {0}; i < mat_size; i++) {
j = i * 2;
index_mat.at(0, j) = dist_i[i];
index_mat.at(1, j) = dist_j[i];
index_mat.at(0, j + 1) = dist_j[i];
index_mat.at(1, j + 1) = dist_i[i];
dists_vec.at(j) = dists[i];
dists_vec.at(j + 1) = dists[i];
}
arma::sp_mat dist_mat(index_mat, values_vec, n_points, n_points);
return dist_mat;
}
Because the function does ex ante not know how many distances are above the threshold, it first stores the non-zero values in standard vectors and then constructs the Armadillo objects from them.
I parallelize the function as follows:
// [[Rcpp::export]]
arma::sp_mat compute_dist_spmat(arma::mat &coordinates, unsigned int n_points, double dist_threshold, unsigned short int n_threads) {
std::vector<std::vector<double>> dists(n_points);
std::vector<std::vector<unsigned int>> dist_j(n_points);
double dist {};
unsigned int i {};
unsigned int j {};
# pragma omp parallel for private(dist, i, j) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
for(j = i + 1; j < n_points; j++) {
dist = compute_dist(coordinates(i, 1), coordinates(j, 1), coordinates(i, 0), coordinates(j, 0));
if(dist >= dist_threshold) {
dists[i].push_back(dist);
dist_j[i].push_back(j);
}
}
}
unsigned int vec_intervals[n_points + 1];
vec_intervals[0] = 0;
for (i = 0; i < n_points; i++) {
vec_intervals[i + 1] = vec_intervals[i] + dist_j[i].size();
}
unsigned int mat_size {vec_intervals[n_points]};
arma::umat index_mat(2, mat_size * 2);
arma::vec dists_vec(mat_size * 2);
unsigned int vec_begins_i {};
unsigned int vec_length_i {};
unsigned int k {};
# pragma omp parallel for private(i, j, k, vec_begins_i, vec_length_i) num_threads(n_threads) if(n_threads > 1)
for(i = 0; i < n_points; i++) {
vec_begins_i = vec_intervals[i];
vec_length_i = vec_intervals[i + 1] - vec_begins_i;
for(j = 0, j < vec_length_i, j++) {
k = (vec_begins_i + j) * 2;
index_mat.at(0, k) = i;
index_mat.at(1, k) = dist_j[i][j];
index_mat.at(0, k + 1) = dist_j[i][j];
index_mat.at(1, k + 1) = i;
dists_vec.at(k) = dists[i][j];
dists_vec.at(k + 1) = dists[i][j];
}
}
arma::sp_mat dist_mat(index_mat, dists_vec, n_points, n_points);
return dist_mat;
}
Using dynamic vectors in the loop is thread-safe.
dist, i, j, k, vec_begins_i, and vec_length_i are all to be initialized above the # pragma line and then declared private rather than initializing them in the loops.
Nothing has to be marked as a section.
Are any of the seven statements incorrect?
The following does not directly answer your question (it's just some dev code I copied from a personal GitHub repo), but it makes several points clear that may be of use in your application:
OpenMP automatically determines private members so long as you are not doing any dynamic memory allocation within the parallel loop
For sparse matrix distance calculations, it becomes important to move beyond a simple calculation of distance at each non-zero index and instead consider the structure of sparsity that is expected, and optimize for that. In the example below, I assume both matrices are very sparse and their intersection is less than their union. Thus, I "precondition" each distance calculation with squared column sums (for calculating Euclidean distance), and then adjust the calculation for the intersection only. This avoids complicated iterator structures and is very fast.
Using as few temporaries as possible is much to your benefit, and sparse matrix iterators do as good of a job of this as any alternative code anyone may ever write.
Eigen provides better vectorization than Armadillo (across the board, I might add) which means you want Eigen instead of Armadillo if those last 20% of performance gains are important to you.
This function calculates the Euclidean distance between all unique pairs of columns in an Eigen::SparseMatrix<double> object:
// sparse column-wise Euclidean distance between all columns
Eigen::MatrixXd distance(Eigen::SparseMatrix<double>& A) {
Eigen::MatrixXd dists(A.cols(), A.cols());
Eigen::VectorXd sq_colsums(A.cols());
for (int col = 0; col < A.cols(); ++col)
for (Eigen::SparseMatrix<double>::InnerIterator it(A, col); it; ++it)
sq_colsums(col) += it.value() * it.value();
#pragma omp parallel for
for (unsigned int i = 0; i < (A.cols() - 1); ++i) {
for (unsigned int j = (i + 1); j < A.cols(); ++j) {
double dist = sq_colsums(i) + sq_colsums(j);
Eigen::SparseMatrix<double>::InnerIterator it1(A, i), it2(A, j);
while (it1 && it2) {
if (it1.row() < it2.row()) ++it1;
else if (it1.row() > it2.row()) ++it2;
else {
dist -= it1.value() * it1.value();
dist -= it2.value() * it2.value();
dist += std::pow(it1.value() - it2.value(), 2);
++it1; ++it2;
}
}
dists(i, j) = std::sqrt(dist);
dists(j, i) = dists(i, j);
}
}
dists.diagonal().array() = 1;
return dists;
}
As Dirk and others have said, there are packages out there (i.e. ParallelDist) that seem to do everything you're after (for dense matrices). Look at wordspace for fast cosine distance calculations. See here for some comparisons. Cosine distance is easy to efficiently calculate in R without use of Rcpp using crossprod operations (see qlcMatrix::cosSparse source code for algorithmic inspiration).

Using Eigen class to sum certain numbers in a vector

I am new to C++ and I am using the Eigen library. I was wondering if there was a way to sum certain elements in a vector. For example, say I have a vector that is a 100 by 1 and I just want to sum the first 10 elements. Is there a way of doing that using the Eigen library?
What I am trying to do is this: say I have a vector that is 1000 by 1 and I want to take the mean of the first 10 elements, then the next 10 elements, and so on and store that in some vector. Hence I will have a vector of size 100 of the averages. Any thoughts or suggestions are greatly appreciated.
Here is the beginning steps I have in my code. I have a S_temp4vector that is 1000 by 1. Now I intialize a new vector S_A that I want to have as the vector of the means. Here is my messy sloppy code so far: (Note that my question resides in the crudeMonteCarlo function)
#include <iostream>
#include <cmath>
#include <math.h>
#include <Eigen/Dense>
#include <Eigen/Geometry>
#include <random>
#include <time.h>
using namespace Eigen;
using namespace std;
void crudeMonteCarlo(int N,double K, double r, double S0, double sigma, double T, int n);
VectorXd time_vector(double min, double max, int n);
VectorXd call_payoff(VectorXd S, double K);
int main(){
int N = 100;
double K = 100;
double r = 0.2;
double S0 = 100;
double sigma = 0.4;
double T = 0.1;
int n = 10;
crudeMonteCarlo(N,K,r,S0,sigma,T,n);
return 0;
}
VectorXd time_vector(double min, double max, int n){
VectorXd m(n + 1);
double delta = (max-min)/n;
for(int i = 0; i <= n; i++){
m(i) = min + i*delta;
}
return m;
}
MatrixXd generateGaussianNoise(int M, int N){
MatrixXd Z(M,N);
static random_device rd;
static mt19937 e2(time(0));
normal_distribution<double> dist(0.0, 1.0);
for(int i = 0; i < M; i++){
for(int j = 0; j < N; j++){
Z(i,j) = dist(e2);
}
}
return Z;
}
VectorXd call_payoff(VectorXd S, double K){
VectorXd C(S.size());
for(int i = 0; i < S.size(); i++){
if(S(i) - K > 0){
C(i) = S(i) - K;
}else{
C(i) = 0.0;
}
}
return C;
}
void crudeMonteCarlo(int N,double K, double r, double S0, double sigma, double T, int n){
// Create time vector
VectorXd tt = time_vector(0.0,T,n);
VectorXd t(n);
double dt = T/n;
for(int i = 0; i < n; i++){
t(i) = tt(i+1);
}
// Generate standard normal Z matrix
//MatrixXd Z = generateGaussianNoise(N,n);
// Generate the log normal stock process N times to get S_A for crude Monte Carlo
MatrixXd SS(N,n+1);
MatrixXd Z = generateGaussianNoise(N,n);
for(int i = 0; i < N; i++){
SS(i,0) = S0;
for(int j = 1; j <= n; j++){
SS(i,j) = SS(i,j-1)*exp((double) (r - pow(sigma,2.0))*dt + sigma*sqrt(dt)*(double)Z(i,j-1));
}
}
// This long bit of code gives me my S_A.....
Map<RowVectorXd> S_temp1(SS.data(), SS.size());
VectorXd S_temp2(S_temp1.size());
for(int i = 0; i < S_temp2.size(); i++){
S_temp2(i) = S_temp1(i);
}
VectorXd S_temp3(S_temp2.size() - N);
int count = 0;
for(int i = N; i < S_temp2.size(); i++){
S_temp3(count) = S_temp2(i);
count++;
}
VectorXd S_temp4(S_temp3.size());
for(int i = 0; i < S_temp4.size(); i++){
S_temp4(i) = S_temp3(i);
}
VectorXd S_A(N);
S_A(0) = (S_temp4(0) + S_temp4(1) + S_temp4(2) + S_temp4(3) + S_temp4(4) + S_temp4(5) + S_temp4(6) + S_temp4(7) + S_temp4(8) + S_temp4(9))/(n);
S_A(1) = (S_temp4(10) + S_temp4(11) + S_temp4(12) + S_temp4(13) + S_temp4(14) + S_temp4(15) + S_temp4(16) + S_temp4(17) + S_temp4(18) + S_temp4(19))/(n);
int count1 = 0;
for(int i = 0; i < S_temp4.size(); i++){
S_A(count1) =
}
// Calculate payoff of Asian option
//VectorXd call_fun = call_payoff(S_A,K);
}
This question includes a lot of code, which makes it hard to understand the question you're trying to ask. Consider including only the code specific to your question.
In any case, you can use Eigen directly to do all of these things quite simply. In Eigen, Vectors are just matrices with 1 column, so all of the reasoning here is directly applicable to what you've written.
const Eigen::Matrix<double, 100, 1> v = Eigen::Matrix<double, 100, 1>::Random();
const int num_rows = 10;
const int num_cols = 1;
const int starting_row = 0;
const int starting_col = 0;
const double sum_of_first_ten = v.block(starting_row, starting_col, num_rows, num_cols).sum();
const double mean_of_first_ten = sum_of_first_ten / num_rows;
In summary: You can use .block to get a block object, .sum() to sum that block, and then conventional division to get the mean.
You can reshape the input using Map and then do all sub-summations at once without any loop:
VectorXd A(1000); // input
Map<MatrixXd> B(A.data(), 10, A.size()/10); // reshaped version, no copy
VectorXd res = B.colwise().mean(); // partial reduction, you can also use .sum(), .minCoeff(), etc.
The Eigen documentation at https://eigen.tuxfamily.org/dox/group__TutorialBlockOperations.html says an Eigen block is a rectangular part of a matrix or array accessed by matrix.block(i,j,p,q) where i and j are the starting values (eg 0 and 0) and p and q are the block size (eg 10 and 1). Presumably you would then iterate i in steps of 10, and use std::accumulate or perhaps an explicit summation to find the mean of matrix.block(i,0,10,1).

Bitonic sorting in cuda misorders some values

i'm making a sorting algorithm on CUDA for a bigger project and i decided implementing a Bitonic sorting. The number of elements i'll be sorting will be allways a power of two, in fact will be 512. I need an array which will have the final positions because this method will be used for ordering an array that represents the quality matrix of another solution.
fitness is the array i'll sort, numElements is the number of elements, and orden is initially an empty array with numElements positions which will be filled at the very beginning in this way: orden[i]=i. Actually orden is not relevant for this issue but I kept it.
My problem is that some values aren't sorted properly and until now i've been unable to figure out what problem do I have.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <ctime>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#include <device_functions.h>
#include "float.h"
__global__ void sorting(int * orden, float * fitness, int numElements);
// Populating array with random values for testing purposes
__global__ void populate( curandState * state, float * fitness{
curandState localState = state[threadIdx.x];
int a = curand(&localState) % 500;
fitness[threadIdx.x] = a;
}
//Curand setup for the populate method
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
int main()
{
float * arrayx;
int numelements = 512;
int * orden;
float arrayCPU[512] = { 0 };
curandState * state;
cudaDeviceReset();
cudaSetDevice(0);
cudaMalloc(&state, numelements * sizeof(curandState));
cudaMalloc((void **)&arrayx, numelements*sizeof(float));
cudaMalloc((void **)&orden, numelements*sizeof(int));
setup_cuRand << <1, numelements >> >(state, unsigned(time(NULL)));
populate << <1, numelements >> > (state, arrayx);
cudaMemcpy(&arrayCPU, arrayx, numelements * sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < numelements; i++)
printf("fitness[%i] = %f\n", i, arrayCPU[i]);
sorting << <1, numelements >> >(orden, arrayx, numelements);
printf("\n\n");
cudaMemcpy(&arrayCPU, arrayx, numelements * sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < numelements; i++)
printf("fitness[%i] = %f\n", i, arrayCPU[i]);
cudaDeviceReset();
return 0;
}
__device__ bool isValid(float n){
return !(isnan(n) || isinf(n) || n != n || n <= FLT_MIN || n >= FLT_MAX);
}
__global__ void sorting(int * orden, float * fitness, int numElements){
int i = 0;
int j = 0;
float f = 0.0;
int aux = 0;
//initial orden registered (1, 2, 3...)
orden[threadIdx.x] = threadIdx.x;
//Logarithm on base 2 of numElements
for (i = 2; i <= numElements; i = i * 2){
// descending from i reducing to half each iteration
for (j = i; j >= 2; j = j / 2){
if (threadIdx.x % j < j / 2){
__syncthreads();
// ascending or descending consideration using (threadIdx.x % (i*2) < i)
if ((threadIdx.x % (i * 2) < i) && (fitness[threadIdx.x] > fitness[threadIdx.x + j / 2] || !isValid(fitness[threadIdx.x])) ||
((threadIdx.x % (i * 2) >= i) && (fitness[threadIdx.x] <= fitness[threadIdx.x + j / 2] || !isValid(fitness[threadIdx.x + j / 2])))){
aux = orden[threadIdx.x];
orden[threadIdx.x] = orden[threadIdx.x + j / 2];
orden[threadIdx.x + j / 2] = aux;
//Se reubican los fitness
f = fitness[threadIdx.x];
fitness[threadIdx.x] = fitness[threadIdx.x + j / 2];
fitness[threadIdx.x + j / 2] = f;
}
}
}
}
}
For example, an output i got on a random execution:
A random execution
This is a representation of my bitonic sorting:
Bitonic sorting Schema, the arrows point where the worst of the values compared goes to
Here are the issues I found:
In your posted code, this does not compile:
__global__ void populate( curandState * state, float * fitness{
^
missing close parenthesis
I added a close parenthesis there.
It's not necessary to take the address of the array in these cudaMemcpy statements:
cudaMemcpy(&arrayCPU, arrayx, numelements * sizeof(float), cudaMemcpyDeviceToHost);
....
cudaMemcpy(&arrayCPU, arrayx, numelements * sizeof(float), cudaMemcpyDeviceToHost);
the array name is already the address of the array, so I removed the ampersands. If you use a dynamically allocated array, such usage would be broken.
Your usage of __syncthreads() here is broken:
for (j = i; j >= 2; j = j / 2){
if (threadIdx.x % j < j / 2){
__syncthreads();
usage of __syncthreads() inside a conditional statement is generally incorrect unless the conditional statement evaluates uniformly across the threadblock. This is covered in the documentation. We can achieve the desired effect with a slight change:
for (j = i; j >= 2; j = j / 2){
__syncthreads();
if (threadIdx.x % j < j / 2){
With the above changes, your code appears to run correctly for me, for most cases. Your usage of FLT_MIN in your validity check is also questionable, if you intend 0 (or any negative values) to be sorted correctly. Speaking generally, FLT_MIN is a number that is very small, close to zero. If you were thinking that this is a large negative number, it is not. As a result, zero is a possible output of your random number generator, and it will not be sorted correctly. I'll leave this one to you to fix, it should be straightforward, but it will depend on what you ultimately want to achieve. (If you only want to sort positive non-zero floating point values, the test may be OK, but in this case your random number generator can return 0.)

Equivalent of curand for OpenCL

I am looking at switching from nvidia to amd for my compute card because I want double precision support. Before doing this I decided to learn opencl on my nvidia card to see if I like it. I want to convert the following code from CUDA to OpenCL. I am using the curand library to generate uniformly and normally distributed random numbers. Each thread needs to be able to create a different sequence of random numbers and generate a few million per thread. Here is the code. How would I go about this in OpenCL. Everything I have read online seems to imply that I should generate a buffer of random numbers and then use that on the gpu but this is not practical for me.
template<int NArgs, typename OptimizationFunctor>
__global__
void statistical_solver_kernel(float* args_lbounds,
float* args_ubounds,
int trials,
int initial_temp,
unsigned long long seed,
float* results,
OptimizationFunctor f)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= trials)
return;
curandState rand;
curand_init(seed, idx, 0, &rand);
float x[NArgs];
for(int i = 0; i < NArgs; i++)
{
x[i] = curand_uniform(&rand) * (args_ubounds[i]- args_lbounds[i]) + args_lbounds[i];
}
float y = f(x);
for(int t = initial_temp - 1; t > 0; t--)
{
float t_percent = (float)t / initial_temp;
float x_prime[NArgs];
for(int i = 0; i < NArgs; i++)
{
x_prime[i] = curand_normal(&rand) * (args_ubounds[i] - args_lbounds[i]) * t_percent + x[i];
x_prime[i] = fmaxf(args_lbounds[i], x_prime[i]);
x_prime[i] = fminf(args_ubounds[i], x_prime[i]);
}
float y_prime = f(x_prime);
if(y_prime < y || (y_prime - y) / y_prime < t_percent)
{
y = y_prime;
for(int i = 0; i < NArgs; i++)
{
x[i] = x_prime[i];
}
}
}
float* rptr = results + idx * (NArgs + 1);
rptr[0] = y;
for(int i = 1; i <= NArgs; i++)
rptr[i] = x[i - 1];
}
The VexCL library provides an implementation of counter-based generators. You can use those inside larger expressions, see this slide for an example.
EDIT: Take this with a grain of sault, as I am the author of VexCL :).

Vectors and matrices in C++ for generating a spectrogram

This is my first attempt to generate a spectrogram of a sinusoidal signal with C++.
To generate the spectrogram:
I divided the real sinusoidal signal into B blocks
Applied Hanning window on each block (I assumed there is no overlap). This should give me the inputs for the fft, in[j][k] where k is the block number
Apply fft on in[j][k] for each block and store it.
Here is the script:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <fftw3.h>
#include <iostream>
#include <cmath>
#include <fstream>
using namespace std;
int main(){
int i;
int N = 500; // sampled
int Windowsize = 100;
double Fs = 200; // sampling frequency
double T = 1 / Fs; // sample time
double f = 50; // frequency
double *in;
fftw_complex *out;
double t[N]; // time vector
fftw_plan plan_forward;
std::vector<double> signal(N);
int B = N / Windowsize; //number of blocks
in = (double*)fftw_malloc(sizeof(double) * N);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
//Generating the signal
for(int i = 0; i < = N; i++){
t[i] = i * T;
signal[i] = 0.7 * sin(2 * M_PI * f * t[i]);// generate sine waveform
}
//Applying the Hanning window function on each block B
for(int k = 0; i <= B; k++){
for(int j = 0; j <= Windowsize; j++){
double multiplier = 0.5 * (1 - cos(2 * M_PI * j / (N-1))); // Hanning Window
in[j][k] = multiplier * signal[j];
}
plan_forward = fftw_plan_dft_r2c_1d (Windowsize, in, out, FFTW_ESTIMATE );
fftw_execute(plan_forward);
v[j][k]=(20 * log(sqrt(out[i][0] * out[i][0] + out[i][1] * out[i][1]))) / N;
}
fftw_destroy_plan(plan_forward);
fftw_free(in);
fftw_free(out);
return 0;
}
So, the question is: What is the correct way to declare in[j][k] and v[j][k] variables.
Update:I have declared my v [j] [k] as a matrix : double v [5][249]; according to this site :http://www.cplusplus.com/doc/tutorial/arrays/ so now my script looks like:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <fftw3.h>
#include <iostream>
#include <cmath>
#include <fstream>
using namespace std;
int main()
{
int i;
double y;
int N=500;//Number of pints acquired inside the window
double Fs=200;//sampling frequency
int windowsize=100;
double dF=Fs/N;
double T=1/Fs;//sample time
double f=50;//frequency
double *in;
fftw_complex *out;
double t[N];//time vector
double tt[5];
double ff[N];
fftw_plan plan_forward;
double v [5][249];
in = (double*) fftw_malloc(sizeof(double) * N);
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
plan_forward = fftw_plan_dft_r2c_1d ( N, in, out, FFTW_ESTIMATE );
for (int i=0; i<= N;i++)
{
t[i]=i*T;
in[i] =0.7 *sin(2*M_PI*f*t[i]);// generate sine waveform
}
for (int k=0; k< 5;k++){
for (int i = 0; i<windowsize; i++){
double multiplier = 0.5 * (1 - cos(2*M_PI*i/(windowsize-1)));//Hanning Window
in[i] = multiplier * in[i+k*windowsize];
fftw_execute ( plan_forward );
for (int i = 0; i<= (N/2); i++)
{
v[k][i]=(20*log10(sqrt(out[i][0]*out[i][0]+ out[i][1]*out[i] [1])));//Here I have calculated the y axis of the spectrum in dB
}
}
}
for (int k=0; k< 5;k++)//Center time for each block
{
tt[k]=(2*k+1)*T*(windowsize/2);
}
fstream myfile;
myfile.open("example2.txt",fstream::out);
myfile << "plot '-' using 1:2" << std::endl;
for (int k=0; k< 5;k++){
for (int i = 0; i<= ((N/2)-1); i++)
{
myfile << v[k][i]<< " " << tt[k]<< std::endl;
}
}
myfile.close();
fftw_destroy_plan ( plan_forward );
fftw_free ( in );
fftw_free ( out );
return 0;
}
I do not get errors anymore but the spectrogram plot is not right.
As indicated in FFTW's documentation, the size of the output (out in your case) when using fftw_plan_dft_r2c_1d is not the same as the size of the input. More specifically for an input of N real samples, the output consists of N/2+1 complex values. You may then allocate out with:
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * (N/2 + 1));
For the spectrogram output you will then similarly have (N/2+1) magnitudes for each of the B blocks, resulting in the 2D array:
double** v = new double*[B];
for (int i = 0; i < B; i++){
v[i] = new double[(N/2+1)];
}
Also, note that you may reuse the input buffer in for each iteration (filling it with data for a new block). However since you have chosen to compute an N-point FFT and will be storing smaller blocks of Windowsize samples (in this case N=500 and Windowsize=100), make sure to initialize the remaining samples with zeros:
in = (double*)fftw_malloc(sizeof(double) * N);
for (int i = 0; i < N; i++){
in[i] = 0;
}
Note that in addition to the declaration and allocation of the in and v variables, the code you posted suffers from a few additional issues:
When computing the Hanning window, you should divide by the Windowsize-1 not N-1 (since in your case N correspond to the FFT size).
You are taking the FFT of the same block of signal over and over again since you are always indexing with j in the [0,Windowsize] range. You would most likely want to add an offset each time you process a different block.
Since the FFT size does not change, you only need to create the plan once. At the very least if you are going to create your plan at every iteration, you should similarly destroy it (with fftw_destroy_plan) at every iteration.
And a few additional points which may require some thoughts:
Scaling the log-scaled magnitudes by dividing by N might not do what you think. You are much more likely to want to scale the linear-scale magnitudes (ie. divide the magnitude before taking the logarithm). Note that this will result in a constant offset of the spectrum curve, which for many application is not that significant. If the scaling is important for your application, you may have a look at another answer of mine for more details.
The common formula 20*log10(x) typically used to convert linear scale to decibels uses a base-10 logarithm instead of the natural log (base e~2.7182) function which you've used. This would result in a multiplicative scaling (stretching), which may or may not be significant depending on your application.
To summarize, the following code might be more in line with what you are trying to do:
// Allocate & initialize buffers
in = (double*)fftw_malloc(sizeof(double) * N);
for (int i = 0; i < N; i++){
in[i] = 0;
}
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * (N/2 + 1));
v = new (double*)[B];
for (int i = 0; i < B; i++){
v[i] = new double[(N/2+1)];
}
// Generate the signal
...
// Create the plan once
plan_forward = fftw_plan_dft_r2c_1d (Windowsize, in, out, FFTW_ESTIMATE);
// Applying the Hanning window function on each block B
for(int k = 0; k < B; k++){
for(int j = 0; j < Windowsize; j++){
// Hanning Window
double multiplier = 0.5 * (1 - cos(2 * M_PI * j / (Windowsize-1)));
in[j] = multiplier * signal[j+k*Windowsize];
}
fftw_execute(plan_forward);
for (int j = 0; j <= N/2; j++){
// Factor of 2 is to account for the fact that we are only getting half
// the spectrum (the other half is not return by a R2C plan due to symmetry)
v[k][j] = 2*(out[j][0] * out[j][0] + out[j][1] * out[j][1])/(N*N);
}
// DC component and at Nyquist frequency do not have a corresponding symmetric
// value, so should not have been doubled up above. Correct those special cases.
v[k][0] *= 0.5;
v[k][N/2] *= 0.5;
// Convert to decibels
for (int j = 0; j <= N/2; j++){
// 20*log10(sqrt(x)) is equivalent to 10*log10(x)
// also use some small epsilon (e.g. 1e-5) to avoid taking the log of 0
v[k][j] = 10 * log10(v[k][j] + epsilon);
}
}
// Clean up
fftw_destroy_plan(plan_forward);
fftw_free(in);
fftw_free(out);
// Delete this last one after you've done something useful with the spectrogram
for (int i = 0; i < B; i++){
delete[] v[i];
}
delete[] v;
Looks like you're missing the initial declaration for 'v' altogether, and 'in' is not declared properly.
See this page for a related question about creating 2D arrays in C++. As I understand, fftw_malloc() is basically new() or malloc() but aligns the variable properly for the FFTW algorithm.
Since you're not supplying 'v' to the anything related to FFTW, you could use standard malloc() for that.