creating matrix with probabilities - c++

I want to generate a matrix of NxN to test some code that I have where each row contains floats as the elements and has to add up to 1 (i.e. a row with a set of probabilities).
Where it gets tricky is that I want to make sure that randomly some of the elements should be 0 (in fact most of the elements should be 0 except for some random ones to be the probabilities). I need the probabilities to be 1/m where m is the number of elements that are not 0 within a single row. I tried to think of ways to output this, but essentially I would need this stored in a C++ array. So even if I output to a file I would still have the issue of not having it in array as I need it. At the end of it all I need that array because I want to generate a Market Matrix file. I found an implementation in C++ to take an array and convert it to the market matrix file, so this is what I am basing my findings on. My input for the rest of the code takes in this market matrix file so I need that to be the primary form of output. The language does not matter, I just want to generate the file at the end (I found a way mmwrite and mmread in python as well)
Please help, I am stuck and not really sure how to implement this.

import random
N = 10
matrix = []
for j in range(N):
t = [int(random.random()<0.6) for i in range(N)]
ones = t.count(1)
row = [float(x)/ones for x in t] if ones else t
matrix.append(row)
for r in matrix:
print r

By C++ array, do you mean a C array or a STL vector<vector< > >? The latter would be cleaner, but here's an example using C arrays:
#include <stdlib.h>
#include <stdio.h>
float* makeProbabilityMatrix(int N, float zeroProbability)
{
float* matrix = (float*)malloc(N*N*sizeof(float));
for (int ii = 0; ii < N; ii++)
{
int m = 0;
for (int jj = 0; jj < N; jj++)
{
int val = (rand() / (RAND_MAX*1.0) < zeroProbability) ? 0 : 1;
matrix[ii*N+jj] = val;
m += val;
}
for (int jj = 0; jj < N; jj++)
{
matrix[ii*N+jj] /= m;
}
}
return matrix;
}
int main()
{
srand(234);
int N = 10;
float* matrix = makeProbabilityMatrix(N, 0.70);
for (int ii = 0; ii < N; ii++)
{
for (int jj = 0; jj < N; jj++)
{
printf("%.2f ", matrix[ii*N+jj]);
}
printf("\n");
}
free(matrix);
return 0;
};
Output:
0.00 0.20 0.20 0.00 0.00 0.00 0.00 0.20 0.20 0.20
0.25 0.00 0.00 0.00 0.00 0.25 0.00 0.25 0.25 0.00
0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.50 0.00 0.00 0.50 0.00
0.25 0.25 0.00 0.00 0.00 0.00 0.25 0.00 0.25 0.00
0.00 0.25 0.00 0.00 0.00 0.25 0.25 0.00 0.25 0.00
0.00 0.00 0.33 0.00 0.33 0.00 0.00 0.00 0.33 0.00
0.00 0.20 0.20 0.20 0.20 0.00 0.00 0.20 0.00 0.00
0.20 0.00 0.20 0.00 0.00 0.00 0.00 0.20 0.20 0.20
0.00 0.00 0.00 0.00 0.00 0.50 0.00 0.50 0.00 0.00

Related

why does my circle anti aliasing algorithm in C++ give asymmetrical results?

Some background:
So my plan is to create a stippling algorithm in C++ and I basically just plan on storing a whole bunch of data for each radius of a circle to write onto a texture map in OpenGL I'm not sure if this is the right thing to do but I feel like it
would be quicker than the computer dynamically calculating the radius for each circle especially if lots of circles are the same size, my plan is to create a function that just writes a whole text document full of radiuses up to a certain size and this data will be stored bitwise inside an array of long's std::array <long> bit = {0x21, 0x0A ect... } so that I can encode 4X4 arrays of values with 2 bits assigned to the antialiasing value of each pixel however to create this database of ant-aliased circles I need to write a function that I keep getting wrong;
The actual question:
So this may seem lazy but I can promise I have tried everything to wrap my head around what I am getting wrong here basically i have written this code to anti=alias by dividing up the pixels into sub pixels however it seems to be returning values greater than 1 which shouldn't be possible as i have divided each pixel into 100 pixels of size 0.01
float CircleConst::PixelAA(int I, int J)
{
float aaValue = 0;
for (float i = (float) I; i < I + 1; i += 0.1f)
{
for (float j = (float) J; j < J + 1; j += 0.1f)
{
if ((pow((i - center), 2) + pow((j - center), 2) < pow(rad, 2)))
aaValue += 0.01f;
}
}
return aaValue;
}
also here is the code that writes the actual circle
CircleConst::CircleConst(float Rad)
{
rad = Rad;
dataSize = (unsigned int) ceil(2 * rad);
center = (float) dataSize/2;
arrData.reserve((int) pow(dataSize, 2));
for (int i = 0; i < dataSize; i++)
{
for (int j = 0; j < dataSize; j++)
{
if ( CircleBounds(i, j, rad-1) )
arrData.push_back(1);
else if (!CircleBounds(i, j, rad - 1) && CircleBounds(i, j, rad + 1))
{
arrData.push_back(PixelAA(i,j));
}
else
arrData.push_back(0);
}
}
}
so I noticed without the antialiasing that the way the circle is written is shifted over by one line, but this could be fixed by changing the value of the centre of the circle todataSize/2 - 0.5f but this causes problems later on when the circle is asymmetrical with the antialiasing, here is an example of radius 3.5
0.4 1.0 1.1 1.1 1.1 0.4 0.0
1.0 1.0 1.0 1.0 1.0 1.1 0.2
1.1 1.0 1.0 1.0 1.0 1.0 0.5
1.1 1.0 1.0 1.0 1.0 1.0 0.5
1.1 1.0 1.0 1.0 1.0 1.0 0.2
0.4 1.1 1.0 1.0 1.0 0.5 0.0
0.0 0.2 0.5 0.5 0.2 0.0 0.0
as you can see some of the values are over 1.0 which should not be possible, I'm sure there is an obvious answer to why this is but I'm completely missing it.
The problem lies with lines such as this one:
for (float i = (float) I; i < I + 1; i += 0.1f)
Floating point numbers cannot be stored or manipulated with infinite precision. By repeatedly adding one floating point number to another, the inaccuracies accumulate. This is why you're seeing values higher than 1.0.
The solution is to iterate using an integer type and compute the desired floating point numbers. For example:
for (unsigned i = 0U; i < 10U; ++i)
{
float x = 0.1F * static_cast<float>(i);
printf("%f\n", x);
}
In addition to what #Yun (the round-off error of floating point numbers) indicates, you must also pay attention to the sampling point (which must be at the pixel center).
Here your code, with some modification and addition:
#include <iostream>
#include <vector>
#include <iomanip>
#include <math.h>
float rad, radSquared, center;
const int filterSize = 8;
const float invFilterSize = 1.0f / filterSize;
// Sample the circle returning 1 when inside, 0 otherwise.
int SampleCircle(int i, int j) {
float di = (i + 0.5f) * invFilterSize - center;
float dj = (j + 0.5f) * invFilterSize - center;
return ((di * di + dj * dj) < radSquared) ? 1 : 0;
}
// NOTE: This sampling method works with any filter size.
float PixelAA(int I, int J)
{
int aaValue = 0;
for (int i = 0; i < filterSize; ++i)
for (int j = 0; j < filterSize; ++j)
aaValue += SampleCircle(I + i, J + j);
return (float)aaValue / (float)(filterSize * filterSize);
}
// NOTE: This sampling method works only with filter sizes that are power of two.
float PixelAAQuadTree(int i, int j, int filterSize)
{
if (filterSize == 1)
return (float)SampleCircle(i, j);
// We sample the four corners of the filter. Note that on left and bottom corners
// 1 is subtracted to avoid sampling overlap.
int topLeft = SampleCircle(i, j);
int topRight = SampleCircle(i + filterSize - 1, j);
int bottomLeft = SampleCircle(i, j + filterSize - 1);
int bottomRight = SampleCircle(i + filterSize - 1, j + filterSize - 1);
// If all samples have same value we can stop here. All samples lies outside or inside the circle.
if (topLeft == topRight && topLeft == bottomLeft && topLeft == bottomRight)
return (float)topLeft;
// Half the filter dimension.
filterSize /= 2;
// Recurse.
return (PixelAAQuadTree(i, j, filterSize) +
PixelAAQuadTree(i + filterSize, j, filterSize) +
PixelAAQuadTree(i, j + filterSize, filterSize) +
PixelAAQuadTree(i + filterSize, j + filterSize, filterSize)) / 4.0f;
}
void CircleConst(float Rad, bool useQuadTree)
{
rad = Rad;
radSquared = rad * rad;
center = Rad;
int dataSize = (int)ceil(rad * 2);
std::vector<float> arrData;
arrData.reserve(dataSize * dataSize);
if (useQuadTree)
{
for (int i = 0; i < dataSize; i++)
for (int j = 0; j < dataSize; j++)
arrData.push_back(PixelAAQuadTree(i * filterSize, j * filterSize, filterSize));
}
else
{
for (int i = 0; i < dataSize; i++)
for (int j = 0; j < dataSize; j++)
arrData.push_back(PixelAA(i * filterSize, j * filterSize));
}
for (int i = 0; i < dataSize; i++)
{
for (int j = 0; j < dataSize; j++)
std::cout << std::fixed << std::setw(2) << std::setprecision(2)
<< std::setfill('0') << arrData[i + j * dataSize] << " ";
std::cout << std::endl;
}
}
int main() {
CircleConst(3.5f, false);
std::cout << std::endl;
CircleConst(4.0f, false);
std::cout << std::endl;
std::cout << std::endl;
CircleConst(3.5f, true);
std::cout << std::endl;
CircleConst(4.0f, true);
return 0;
}
Which gives these results (the second ones with use of quad-tree to optimize number of samples required to compute the AA value):
0.00 0.36 0.84 1.00 0.84 0.36 0.00
0.36 1.00 1.00 1.00 1.00 1.00 0.36
0.84 1.00 1.00 1.00 1.00 1.00 0.84
1.00 1.00 1.00 1.00 1.00 1.00 1.00
0.84 1.00 1.00 1.00 1.00 1.00 0.84
0.36 1.00 1.00 1.00 1.00 1.00 0.36
0.00 0.36 0.84 1.00 0.84 0.36 0.00
0.00 0.16 0.70 0.97 0.97 0.70 0.16 0.00
0.16 0.95 1.00 1.00 1.00 1.00 0.95 0.16
0.70 1.00 1.00 1.00 1.00 1.00 1.00 0.70
0.97 1.00 1.00 1.00 1.00 1.00 1.00 0.97
0.97 1.00 1.00 1.00 1.00 1.00 1.00 0.97
0.70 1.00 1.00 1.00 1.00 1.00 1.00 0.70
0.16 0.95 1.00 1.00 1.00 1.00 0.95 0.16
0.00 0.16 0.70 0.97 0.97 0.70 0.16 0.00
0.00 0.36 0.84 1.00 0.84 0.36 0.00
0.36 1.00 1.00 1.00 1.00 1.00 0.36
0.84 1.00 1.00 1.00 1.00 1.00 0.84
1.00 1.00 1.00 1.00 1.00 1.00 1.00
0.84 1.00 1.00 1.00 1.00 1.00 0.84
0.36 1.00 1.00 1.00 1.00 1.00 0.36
0.00 0.36 0.84 1.00 0.84 0.36 0.00
0.00 0.16 0.70 0.97 0.97 0.70 0.16 0.00
0.16 0.95 1.00 1.00 1.00 1.00 0.95 0.16
0.70 1.00 1.00 1.00 1.00 1.00 1.00 0.70
0.97 1.00 1.00 1.00 1.00 1.00 1.00 0.97
0.97 1.00 1.00 1.00 1.00 1.00 1.00 0.97
0.70 1.00 1.00 1.00 1.00 1.00 1.00 0.70
0.16 0.95 1.00 1.00 1.00 1.00 0.95 0.16
0.00 0.16 0.70 0.97 0.97 0.70 0.16 0.00
As further notes:
you can see how quad-trees work here https://en.wikipedia.org/wiki/Quadtree
you can further modify the code and implement fixed-point math (https://en.wikipedia.org/wiki/Fixed-point_arithmetic) which has no round-off problems like floats because numbers are always represented as integers
given that these data are part of a pre-calculation phase, privilege the simplicity of the code over performance

Null space basis from QR decomposition with GSL

I'm trying to get the basis for the null space of a relatively large matrix, A^T, using GSL. So far I've been extracting right-singular vectors of the SVD corresponding to vanishing singular values, but this is becoming too slow for the sizes of matrices I'm interested in.
I know that the nullspace can be extracted as the last m-r columns of the Q-matrix in the QR decomposition of A, where r is the rank of A, but I'm not sure how rank-revealing decompositions work.
Here's my first attempt using gsl_linalg_QR_decomp:
int m = 4;
int n = 3;
gsl_matrix* A = gsl_matrix_calloc(m,n);
gsl_matrix_set(A, 0,0, 3);gsl_matrix_set(A, 0,1, 6);gsl_matrix_set(A, 0,2, 1);
gsl_matrix_set(A, 1,0, 1);gsl_matrix_set(A, 1,1, 2);gsl_matrix_set(A, 1,2, 1);
gsl_matrix_set(A, 2,0, 1);gsl_matrix_set(A, 2,1, 2);gsl_matrix_set(A, 2,2, 1);
gsl_matrix_set(A, 3,0, 1);gsl_matrix_set(A, 3,1, 2);gsl_matrix_set(A, 3,2, 1);
std::cout<<"A:"<<endl;
for(int i=0;i<m;i++){ for(int j=0;j<n;j++) printf(" %5.2f",gsl_matrix_get(A,i,j)); std::cout<<std::endl;}
gsl_matrix* Q = gsl_matrix_alloc(m,m);
gsl_matrix* R = gsl_matrix_alloc(m,n);
gsl_vector* tau = gsl_vector_alloc(std::min(m,n));
gsl_linalg_QR_decomp(A, tau);
gsl_linalg_QR_unpack(A, tau, Q, R);
std::cout<<"Q:"<<endl;
for(int i=0;i<m;i++){ for(int j=0;j<m;j++) printf(" %5.2f",gsl_matrix_get(Q,i,j)); std::cout<<std::endl;}
std::cout<<"R:"<<endl;
for(int i=0;i<m;i++){ for(int j=0;j<n;j++) printf(" %5.2f",gsl_matrix_get(R,i,j)); std::cout<<std::endl;}
This outputs
A:
3.00 6.00 1.00
1.00 2.00 1.00
1.00 2.00 1.00
1.00 2.00 1.00
Q:
-0.87 -0.29 0.41 -0.00
-0.29 0.96 0.06 -0.00
-0.29 -0.04 -0.64 -0.71
-0.29 -0.04 -0.64 0.71
R:
-3.46 -6.93 -1.73
0.00 0.00 0.58
0.00 0.00 -0.82
0.00 0.00 0.00
but I'm not sure how to compute the rank, r, from this. My second attempt uses gsl_linalg_QRPT_decomp by replacing the last part with
gsl_vector* tau = gsl_vector_alloc(std::min(m,n));
gsl_permutation* perm = gsl_permutation_alloc(n);
gsl_vector* norm = gsl_vector_alloc(n);
int* sign = new int(); *sign = 1;
gsl_linalg_QRPT_decomp2(A, Q, R, tau, perm, sign, norm );
std::cout<<"Q:"<<endl;
for(int i=0;i<m;i++){ for(int j=0;j<m;j++) printf(" %5.2f",gsl_matrix_get(Q,i,j)); std::cout<<std::endl;}
std::cout<<"R:"<<endl;
for(int i=0;i<m;i++){ for(int j=0;j<n;j++) printf(" %5.2f",gsl_matrix_get(R,i,j)); std::cout<<std::endl;}
std::cout<<"Perm:"<<endl;
for(int i=0;i<n;i++) std::cout<<" "<<gsl_permutation_get(perm,i);
which results in
Q:
-0.87 0.50 0.00 0.00
-0.29 -0.50 -0.58 -0.58
-0.29 -0.50 0.79 -0.21
-0.29 -0.50 -0.21 0.79
R:
-6.93 -1.73 -3.46
0.00 -1.00 0.00
0.00 0.00 0.00
0.00 0.00 0.00
Perm:
1 2 0
Here, I believe that the rank is the number of non-zero diagonal elements in R, but I'm not sure which elements to extract from Q. Which approach should I take?
For 4×3 A, the “null space” will consist of 3-dimensional vectors, whereas the QR decomposition on A only gives you 4-dimensional vectors. (And of course you can generalize this for A with size M×N where M > N.)
Therefore, take the QR decomposition of the transpose of A, whose Q is now 3×3.
Sketching the process using Python/Numpy in IPython (sorry, I can’t seem to figure out how to call gsl_linalg_QR_decomp using PyGSL):
In [16]: import numpy as np
In [17]: A = np.array([[3.0, 6, 1], [1.0, 2, 1], [1.0, 2, 1], [1.0, 2, 1]])
In [18]: Q, R = np.linalg.qr(A.T) # <---- A.T means transpose(A)
In [19]: np.diag(R)
Out[19]: array([ -6.78232998e+00, 6.59380473e-01, 2.50010468e-17])
In [20]: np.round(Q * 1000) / 1000 # <---- Q to 3 decimal places
Out[20]:
array([[-0.442, -0.066, -0.894],
[-0.885, -0.132, 0.447],
[-0.147, 0.989, 0. ]])
The 19th output (i.e., Out[19], result of np.diag(R)) tells us the column-rank of A is 2. And looking at the 3rd column of Out[20] (Q to three decimal places), we see that the right answer is returned: [-0.894, 0.447, 0] is proportional to [1, 0.5, 0], and we know this is right because the first two columns of A are linearly-dependent.
Can you check with larger matrixes that the QR decomposition of transpose(A) gives you equivalent null spaces as your current SVD method?

Inconsistency when profiling my code with gprof

I am using a relatively simple code parallelize with OpenMP to familiarize myself with gprof.
My code mainly consists of gathering data from input files, perform some array manipulations and write the new data to different output files. I placed some calls to the intrinsic subroutine CPU_TIME to see if gprof was being accurate:
PROGRAM main
USE global_variables
USE fileio, ONLY: read_old_restart, write_new_restart, output_slice, write_solution
USE change_vars
IMPLICIT NONE
REAL(dp) :: t0, t1
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL allocate_data
CALL CPU_TIME(t1)
PRINT*, "Allocate data =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL build_grid
CALL CPU_TIME(t1)
PRINT*, "Build grid =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL read_old_restart
CALL CPU_TIME(t1)
PRINT*, "Read restart =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL regroup_all
CALL CPU_TIME(t1)
PRINT*, "Regroup all =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL redistribute_all
CALL CPU_TIME(t1)
PRINT*, "Redistribute =", t1 - t0
!~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CALL CPU_TIME(t0)
CALL write_new_restart
CALL CPU_TIME(t1)
PRINT*, "Write restart =", t1 - t0
END PROGRAM main
Here is the output:
Allocate data = 1.000000000000000E-003
Build grid = 0.000000000000000E+000
Read restart = 10.7963590000000
Regroup all = 6.65998700000000
Redistribute = 14.3518180000000
Write restart = 53.5218640000000
Therefore, the write_new_restart subroutine is the most time consuming and takes about 62% of the total run time. However according to grof, the subroutine redistribute_vars, which is called multiple times by redistribute_all is the most time consuming with 70% of the total time. Here is the output from gprof:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
74.40 8.95 8.95 61 0.15 0.15 change_vars_mp_redistribute_vars_
19.12 11.25 2.30 60 0.04 0.04 change_vars_mp_regroup_vars_
6.23 12.00 0.75 63 0.01 0.01 change_vars_mp_fill_last_blocks_
0.08 12.01 0.01 1 0.01 2.31 change_vars_mp_regroup_all_
0.08 12.02 0.01 __intel_ssse3_rep_memcpy
0.08 12.03 0.01 for_open
0.00 12.03 0.00 1 0.00 12.01 MAIN__
0.00 12.03 0.00 1 0.00 0.00 change_vars_mp_build_grid_
0.00 12.03 0.00 1 0.00 9.70 change_vars_mp_redistribute_all_
0.00 12.03 0.00 1 0.00 0.00 fileio_mp_read_old_restart_
0.00 12.03 0.00 1 0.00 0.00 fileio_mp_write_new_restart_
0.00 12.03 0.00 1 0.00 0.00 global_variables_mp_allocate_data_
index % time self children called name
0.00 12.01 1/1 main [2]
[1] 99.8 0.00 12.01 1 MAIN__ [1]
0.00 9.70 1/1 change_vars_mp_redistribute_all_ [3]
0.01 2.30 1/1 change_vars_mp_regroup_all_ [5]
0.00 0.00 1/1 global_variables_mp_allocate_data_ [13]
0.00 0.00 1/1 change_vars_mp_build_grid_ [10]
0.00 0.00 1/1 fileio_mp_read_old_restart_ [11]
0.00 0.00 1/1 fileio_mp_write_new_restart_ [12]
-----------------------------------------------
<spontaneous>
[2] 99.8 0.00 12.01 main [2]
0.00 12.01 1/1 MAIN__ [1]
-----------------------------------------------
0.00 9.70 1/1 MAIN__ [1]
[3] 80.6 0.00 9.70 1 change_vars_mp_redistribute_all_ [3]
8.95 0.00 61/61 change_vars_mp_redistribute_vars_ [4]
0.75 0.00 63/63 change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
8.95 0.00 61/61 change_vars_mp_redistribute_all_ [3]
[4] 74.4 8.95 0.00 61 change_vars_mp_redistribute_vars_ [4]
-----------------------------------------------
0.01 2.30 1/1 MAIN__ [1]
[5] 19.2 0.01 2.30 1 change_vars_mp_regroup_all_ [5]
2.30 0.00 60/60 change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
2.30 0.00 60/60 change_vars_mp_regroup_all_ [5]
[6] 19.1 2.30 0.00 60 change_vars_mp_regroup_vars_ [6]
-----------------------------------------------
0.75 0.00 63/63 change_vars_mp_redistribute_all_ [3]
[7] 6.2 0.75 0.00 63 change_vars_mp_fill_last_blocks_ [7]
-----------------------------------------------
<spontaneous>
[8] 0.1 0.01 0.00 for_open [8]
-----------------------------------------------
<spontaneous>
[9] 0.1 0.01 0.00 __intel_ssse3_rep_memcpy [9]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[10] 0.0 0.00 0.00 1 change_vars_mp_build_grid_ [10]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[11] 0.0 0.00 0.00 1 fileio_mp_read_old_restart_ [11]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[12] 0.0 0.00 0.00 1 fileio_mp_write_new_restart_ [12]
-----------------------------------------------
0.00 0.00 1/1 MAIN__ [1]
[13] 0.0 0.00 0.00 1 global_variables_mp_allocate_data_ [13]
-----------------------------------------------
For your information, regroup_all calls regroup_vars multiple times and redistribute_all calls redistribute_vars and fill_last_blocks multiple times.
I am compiling my code with ifort with the -openmp -O2 -pg options.
QUESTION:
Why is gprof not seeing the time my file i/o subroutines take? (read_old_restart, write_new_restart)
gprof specifically does not include I/O time. It only tries to measure CPU time.
That's because it only does two things: 1) sample the program counter on a 1/100 second clock, and the program counter is meaningless during I/O, and 2) count the number of times any function B is called by any function A.
From the call-counts, it tries to guess how much of each function's CPU time can be attributed to each caller.
That's it's whole advance over pre-existing profilers.
When you use gprof, you should understand what it does and what it doesn't do.

FFT output is blank when using FFTW_MEASURE, but works fine with FFTW_ESTIMATE

I'm having the following issue in my attempt to use fftw3. For some reason, whenever I do an FFT using FFTW_MEASURE instead of FFTW_ESTIMATE, I get blank output. Ultimately I'm trying to implement fft convolution, so my example below includes both the FFT and the inverse FFT.
Clearly I'm missing something... is anyone able to educate me? Thank you!
I'm on Linux (OpenSUSE Leap 42.1), using the version of fftw3 available from my package manager.
Minimum working example:
#include <iostream>
#include <iomanip>
#include <cmath>
#include <fftw3.h>
using namespace std;
int main(int argc, char ** argv)
{
int width = 10;
int height = 8;
cout.setf(ios::fixed|ios::showpoint);
cout << setprecision(2);
double * inp = (double *) fftw_malloc(sizeof(double) * width * height);
fftw_complex * cplx = (fftw_complex *)fftw_malloc(sizeof(fftw_complex) * height * (width/2 + 1));
for(int i = 0; i < width * height; i++) inp[i] = sin(i);
fftw_plan fft = fftw_plan_dft_r2c_2d(height, width, inp, cplx, FFTW_MEASURE );
fftw_plan ifft = fftw_plan_dft_c2r_2d(height, width, cplx, inp, FFTW_MEASURE );
fftw_execute(fft);
for(int j = 0; j < height; j++)
{
for(int i = 0; i < (width/2 + 1); i++)
{
cout << cplx[i+width*j][0] << " ";
}
cout << endl;
}
cout << endl << endl;
fftw_execute(ifft);
for(int j = 0; j < height; j++)
{
for(int i = 0; i < width; i++)
{
cout << inp[i+width*j] << " ";
}
cout << endl;
}
fftw_destroy_plan(fft);
fftw_destroy_plan(ifft);
fftw_free(cplx);
fftw_free(inp);
return 0;
}
Just change between FFTW_ESTIMATE and FFTW_MEASURE.
Compiled with:
g++ *.cpp -lm -lfftw3 --std=c++11
Output with FFTW_ESTIMATE (first block is the real part of the FT, second block is after inverse FT):
1.51 2.24 -1.52 -0.05 0.15 0.19
0.23 0.15 1.77 1.19 0.54 0.41
1.97 -0.15 -1.32 -2.51 -1.20 -3.38
4.34 15.21 -24.82 -7.44 -4.16 -2.51
-0.43 -0.06 1.55 2.93 -2.81 -0.42
0.00 0.00 0.00 -nan 0.00 0.00
0.00 0.00 0.00 0.00 0.00 -nan
0.00 0.00 0.00 0.00 0.00 0.00
0.00 67.32 72.74 11.29 -60.54 -76.71 -22.35 52.56 79.15 32.97
-43.52 -80.00 -42.93 33.61 79.25 52.02 -23.03 -76.91 -60.08 11.99
73.04 66.93 -0.71 -67.70 -72.45 -10.59 61.00 76.51 21.67 -53.09
-79.04 -32.32 44.11 79.99 42.33 -34.25 -79.34 -51.48 23.71 77.10
59.61 -12.69 -73.32 -66.54 1.42 68.07 72.14 9.89 -61.46 -76.30
-20.99 53.62 78.93 31.67 -44.70 -79.98 -41.72 34.89 79.43 50.94
-24.38 -77.29 -59.13 13.39 73.60 66.15 -2.12 -68.44 -71.83 -9.18
61.91 76.08 20.31 -54.14 -78.81 -31.02 45.29 79.96 41.12 -35.53
Output with FFTW_MEASURE (first block is the real part of the FT, second block is after inverse FT):
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 -nan 0.00 0.00
0.00 0.00 0.00 0.00 0.00 -nan
0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The comment of #Paul_R. is sufficient to solve the problem. The input array can be modified as fftw_plan_dft_r2c_2d() is called. Hence, the input array must be initialized after the creation of the fftw plan.
The documentation of the planner flags of FFTW details what is happening. I am pretty sure that you have already guessed the reason why FFTW_ESTIMATE preserve the input array and FTTW_MEASURE modifies it.
Important: the planner overwrites the input array during planning unless a saved plan (see Wisdom) is available for that problem, so you should initialize your input data after creating the plan.*** The only exceptions to this are the FFTW_ESTIMATE and FFTW_WISDOM_ONLY flags, as mentioned below.
...
FFTW_ESTIMATE specifies that, instead of actual measurements of different algorithms, a simple heuristic is used to pick a (probably sub-optimal) plan quickly. With this flag, the input/output arrays are not overwritten during planning.
FFTW_MEASURE tells FFTW to find an optimized plan by actually computing several FFTs and measuring their execution time. Depending on your machine, this can take some time (often a few seconds). FFTW_MEASURE is the default planning option.
...
The documentation also tells us that the flag FFTW_ESTIMATE will preserve the input. Yet, the best advise is to initialize the array once the plan is created.

Strange profiler behavior: same functions, different performances

I was learning to use gprof and then i got weird results for this code:
int one(int a, int b)
{
int i, r = 0;
for (i = 0; i < 1000; i++)
{
r += b / (a + 1);
}
return r;
}
int two(int a, int b)
{
int i, r = 0;
for (i = 0; i < 1000; i++)
{
r += b / (a + 1);
}
return r;
}
int main()
{
for (int i = 1; i < 50000; i++)
{
one(i, i * 2);
two(i, i * 2);
}
return 0;
}
and this is the profiler output
% cumulative self self total
time seconds seconds calls us/call us/call name
50.67 1.14 1.14 49999 22.80 22.80 two(int, int)
49.33 2.25 1.11 49999 22.20 22.20 one(int, int)
If i call one then two the result is the inverse, two takes more time than one
both are the same functions, but the first calls always take less time then the second
Why is that?
Note: The assembly code is exactly the same and code is being compiled with no optimizations
I'd guess it is some fluke in run-time optimisation - one uses a register and the other doesn't or something minor like that.
The system clock probably runs to a precision of 100nsec. The average call time 30nsec or 25nsec is less than one clock tick. A rounding error of 5% of a clock tick is pretty small. Both times are near enough zero.
My guess: it is an artifact of the way mcount data gets interpreted. The granularity for mcount (monitor.h) is on the order of a 32 bit longword - 4 bytes on my system. So you would not expect this: I get different reports from prof vs gprof on the EXACT same mon.out file.
solaris 9 -
prof
%Time Seconds Cumsecs #Calls msec/call Name
46.4 2.35 2.3559999998 0.0000 .div
34.8 1.76 4.11120000025 0.0000 _mcount
10.1 0.51 4.62 1 510. main
5.3 0.27 4.8929999999 0.0000 one
3.4 0.17 5.0629999999 0.0000 two
0.0 0.00 5.06 1 0. _fpsetsticky
0.0 0.00 5.06 1 0. _exithandle
0.0 0.00 5.06 1 0. _profil
0.0 0.00 5.06 20 0.0 _private_exit, _exit
0.0 0.00 5.06 1 0. exit
0.0 0.00 5.06 4 0. atexit
gprof
% cumulative self self total
time seconds seconds calls ms/call ms/call name
71.4 0.90 0.90 1 900.00 900.00 key_2_text <cycle 3> [2]
5.6 0.97 0.07 106889 0.00 0.00 _findbuf [9]
4.8 1.03 0.06 209587 0.00 0.00 _findiop [11]
4.0 1.08 0.05 __do_global_dtors_aux [12]
2.4 1.11 0.03 mem_init [13]
1.6 1.13 0.02 102678 0.00 0.00 _doprnt [3]
1.6 1.15 0.02 one [14]
1.6 1.17 0.02 two [15]
0.8 1.18 0.01 414943 0.00 0.00 realloc <cycle 3> [16]
0.8 1.19 0.01 102680 0.00 0.00 _textdomain_u <cycle 3> [21]
0.8 1.20 0.01 102677 0.00 0.00 get_mem [17]
0.8 1.21 0.01 $1 [18]
0.8 1.22 0.01 $2 [19]
0.8 1.23 0.01 _alloc_profil_buf [22]
0.8 1.24 0.01 _mcount (675)
Is it always the first one called that is slightly slower? If that's the case, I would guess it is a CPU cache doing it's thing. or it could be lazy paging by the operating system.
BTW: what optimization flags are compiling with?