Jumping ahead in parallelised PRNGs in C++ - c++

I am implementing a Monte Carlo simulation, where I need to run multiple realisations of some dynamics and then take an average over the end state for all the simulations. Since the number of realisation is large, I run them in parallel using OpenMP. Every realisation starts from the same initial conditions and then at each time step a process happens with a given probability, and to determine which process I draw a random number from a uniform distribution.
I want to make sure that all simulations are statistically independent and that there is no overlap in the random numbers that are being drawn.
I use OpenMP to parallelise the for loops, so the skeleton code looks like this:
vector<int> data(number_of_sims);
double t;
double r;
#pragma omp parallel for
for(int i = 0; i < number_of_sims; i++){
// run sim
t = 0;
while (t < T) {
r = draw_random_uniform();
if (r < p) do_something();
else do_something_else();
t += 1.0; // increment time
}
// some calculation
data[i] = calculate();
}
So every time I want a random number, I would call a function which used the Mersenne Twister seeded with random device.
double draw_random_uniform(){
static thread_local auto seed = std::random_device{}();
static thread_local mt19937 mt(seed);
std::uniform_real_distribution<double> distribution(0.0, 1.0);
double r = distribution(mt);
return r;
}
However, since I ultimately want to run this code on a high power computing cluster I want to avoid using std::random_device() as it is risky for systems with little entropy.
So instead I want to create an initial random number generator and then jump it forward a large amount for each of the threads. I have been attempting to do this with the Xoroshiro256+ PRNG (I found some good implementation here: https://github.com/Reputeless/Xoshiro-cpp). Something like this for example:
XoshiroCpp::Xoshiro256Plus prng(42); // properly seeded prng
#pragma omp parallel num_threads()
{
static thread_local XoshiroCpp::Xoshiro256Plus lprng(prng); // thread local copy
lprng.longJump(); // jump ahead
// code as before, except use lprng to generate random numbers
# pragma omp for
....
}
However, I cannot get such an implementation to work. I suspect because of the double OpenMP for loops. I had the thought of pre-generating all of the PNRGs and storing in a container, then accessing the relevant one by using omp_get_thread_num() inside the parallelised for loop.
I am unsure if this is the best way to go about doing all this. Any advice is appreciated.

Coordinating random number generators with long jump can be tricky. Alternatively there is a much simpler method.
Here is a quote from the authors website:
It is however important that the period is long enough.
Moreover, if you run n independent
computations starting at random seeds, the sequences used by each
computation should not overlap.
Now, given a generator with period P, the probability that
subsequences of length L starting at random points in the state space
overlap is bounded by n² L/P. If your generator has period 2^256 and you run on 2^64
cores (you will never have them) a computation using 2^64 pseudorandom
numbers (you will never have the time) the probability of overlap
would be less than 2^-64.
So instead of trying to coordinate, you could in each thread just randomly seed a new generator from std::random_device{}. The period is so large that it will not collide.
While this sounds like a very add-hock approach, this random-seeding method is actually a widely used and classic method.
You just need to make sure the seeds are different. Depending on the platform usually different random seeds are proposed.
Using a truly random source
Having an atomic int that is incremented and some hashing
Using another pseudo random number generator to generate a seed sequence
Using a combination of thread id and time to create a seed
If repeatability is not needed, seeds from a random source is the most easiest and safest solution.
The paper from L'Ecuyer et. al. from 2017 gives a good overview of methods for generating parallel streams. He calls this approach "RNG with a “random” seed for each stream` under chapter 4.
vector<int> data(number_of_sims);
double t;
double r;
#pragma omp parallel for
for(int i = 0; i < number_of_sims; i++){
// random 128 bit seed
auto rd = std::random_device{};
auto seed = std::seed_seq {rd(), rd(), rd(), rd()};
auto mt = std::mt19937 {seed};
// run sim
t = 0;
while (t < T) {
r = draw_random_uniform(mt);
if (r < p) do_something();
else do_something_else();
t += 1.0; // increment time
}
// some calculation
data[i] = calculate();
}
and
double draw_random_uniform(mt19937 &mt){
std::uniform_real_distribution<double> distribution(0.0, 1.0);
return distribution(mt);
}
If number_of_sims is not extremely large there is no need for static or thread_local initialization.

You should read "Parallel Random Numbers, as easy as one, two three"
http://www.thesalmons.org/john/random123/papers/random123sc11.pdf
This paper explicitly addresses your forward stepping issues.
You can now find implementations of this generator in maths libraries (such as Intel's MKL, which uses the specialized encryption instructions, so will be hard to beat by hand!)

Related

Optimal way to handle array indexing in parallel?

I have the following situation: I have a list of particles in a box of size L, where L is the length of one of the sides.
Next, I split the box into cells, where L/cell_dim = 7. So there are 7*7*7 cells.
Finally, I read through all the particles, note their position, and calculate which cell they are in.
I accomplish the above in an openMP parallel for loop. However, I need to capture the information in a thread safe fashion such that I don't have to loop through all the particles for each cell. So I need some way to record an arbitrary subset of the particles into each cell, in parallel.
The method I have right now makes use of the OpenMP critical code block. I have an array size [7][7][7][max_particles], where max_particles is the highest number of particles per cell, but which is much less than the total number of particles. I record the index of the last particle added in a counter array size [7][7][7], and update the cell array according to the latest count in my parallel loop:
int cube[7][7][7][10];
int cube_counts[7][7][7]={0};
#pragma omp parallel for num_threads(a lot)
for (int i = 0; i < num_particles; i++){
cell_x = //cell calculation;
cell_y = //ditto;
cell_z = //...;
#pragma omp critical
{
cube_counts[cell_x][cell_y][cell_z] += 1;
// for readability
int index = cube_counts[cell_x][cell_y][cell_z];
cube[cell_x][cell_y][cell_z][index] = i;
}
}
// rest in pseudo code:
foreach cell:
adjacent_cell = cell2
particle_countA = cube_counts[cellx][celly][cellz]
particle_countB = cube_counts[cell2x][cell2y][cell2z]
// these two for loops will cover ~2-4 particles,
// so super small...as a result of the cell analysis above.
for particle in cell:
for particle in cell2:
...do stuff
Although this works, it increases in speed by a factor of more than 2 when I am able to eliminate the critical block (I am on an intel coprocessor with 60 physical, 240 logical).
How would I accomplish this without need for the critical block? I thought of doing a big array...but then I lose everything I gained when I iterate through the 7*7*7*257 (where 257 is the particle count) array. Linked lists still have the race conditions.
Maybe some kind of unordered, thread safe list...?
Using a lock instead of the critical section can be driven further:
You may use atomic increment and atomic assignment pseudo calls ("intrinsics") that the compiler will translate to the correct x86 specific assembler instructions. This is however platform or even compiler dependent.
If your use a modern c++ compiler (C++11) then std::atomic_* might be the best way to do it.

OpenMP - Poor performance when solving system of linear equations

I am trying to use OpenMP to parallelize a simple c++ code that solves a system of linear equations by Gauss elimination.
The relevant part of my code is:
#include <iostream>
#include <time.h>
using namespace std;
#define nl "\n"
void LinearSolve(double **& M, double *& V, const int N, bool parallel, int threads){
//...
for (int i=0;i<N;i++){
#pragma omp parallel for num_threads(threads) if(parallel)
for (int j=i+1;j<N;j++){
double aux, * Mi=M[i], * Mj=M[j];
aux=Mj[i]/Mi[i];
Mj[i]=0;
for (int k=i+1;k<N;k++) {
Mj[k]-=Mi[k]*aux;
};
V[j]-=V[i]*aux;
};
};
//...
};
class Time {
clock_t startC, endC;
time_t startT, endT;
public:
void start() {startC=clock(); time (&startT);};
void end() {endC=clock(); time (&endT);};
double timedifCPU() {return(double(endC-startC)/CLOCKS_PER_SEC);};
int timedif() {return(int(difftime (endT,startT)));};
};
int main (){
Time t;
double ** M, * V;
int N=5000;
cout<<"number of equations "<<N<<nl<<nl;
M= new double * [N];
V=new double [N];
for (int i=0;i<N;i++){
M[i]=new double [N];
};
for (int m=1;m<=16;m=2*m){
cout<<m<<" threads"<<nl;
for (int i=0;i<N;i++){
V[i]=i+1.5*i*i;
for (int j=0;j<N;j++){
M[i][j]=(j+2.3)/(i-0.2)+(i+2)/(j+3); //some function to get regular matrix
};
};
t.start();
LinearSolve(M,V,N,m!=1,m);
t.end();
cout<<"time "<<t.timedif()<<", CPU time "<<t.timedifCPU()<<nl<<nl;
};
}
Since the code is extremely simple I would expect that the time would be
inversely proportional to the number of threads. However the typical result I get is (the code is compiled with gcc on Linux)
number of equations 5000
1 threads
time 217, CPU time 215.89
2 threads
time 125, CPU time 245.18
4 threads
time 80, CPU time 302.72
8 threads
time 67, CPU time 458.55
16 threads
time 55, CPU time 634.41
There is a decrease in time, but much less that I would like to and the CPU time mysteriously grows.
I suspect the problem is in memory sharing, but I have been unable to identify it. Access to row M[j] should not be a problem, since each thread writes to the a different row of the matrix. There could be a problem in reading from row M[i], so I also tried to make a separate copy of this row for each thread by replacing the parallel loop by
#pragma omp parallel num_threads(threads) if(parallel)
{
double Mi[N];
for (int j=i;j<N;j++) Mi[j]=M[i][j];
#pragma omp for
for (int j=i+1;j<N;j++){
double aux, * Mj=M[j];
aux=Mj[i]/Mi[i];
Mj[i]=0;
for (int k=i+1;k<N;k++) {
Mj[k]-=Mi[k]*aux;
};
V[j]-=V[i]*aux;
};
};
Unfortunately it does not help at all.
I would very much appreciate any help.
Your problem is excessive OpenMP synchronization.
Having the #omp parallel inside the first loop means that each iteration of the outer loop comes with the whole synchronization overhead.
Take a look at the top-most chart on the image here (more detail can be found on the Allinea MAP OpenMP profiler introduction). The top line is application activity - and dark gray means "OpenMP synchronization" and green means "doing compute".
You can see a lot of dark gray in the right hand side of that top graph/chart - that is when the 16 threads are running. You're spending a lot of time synchronizing.
I also see a lot of time being spent in memory access (more than in compute) - so it's probably this that is making what should be balanced workload actually be highly unbalanced and giving the synchronization delay.
As the other respondent suggested - it's worth reading other literature for ideas here.
I think the underlying problem may be that traditional Gaussian elimination may not be suitable for parallelization.
Gaussian elimination is a process where by each subsequent step relies on the result of the previous step i.e. each iteration of your linear solve loop is dependent on the results of the previous iteration, i.e. it must be done serially. Try searching the literature for "parallel row reduction algorithms".
Also glancing at your code it looks like you will have race condition.

rand() function not generating enough random

I am working on an openGL project and my rand function is not giving me a big enough random range.
I am tasked with writing a diamond program to where one diamond is centered on the screen and 5 are randomly placed elsewhere on the screen. What is happening is my center diamond is where it is supposed to be and the other five are bunching with a small range of random left of center. I have included the function that draws the diamonds.
void myDisplay(void)
{
srand(time(0));
GLintPoint CenterPoint;
int const size = 20;
CenterPoint.x = screenWidth / 2;
CenterPoint.y = screenHeight / 2;
glClear(GL_COLOR_BUFFER_BIT);
drawDiamond(CenterPoint, size);
for (int i = 0; i < 5; i++)
{
GLfloat x = rand() % 50 + 10;
GLfloat y = rand() % 200 + 100;
GLfloat size = rand() % 100;
GLintPoint diam = {x,y};
drawDiamond(diam, size);
}
If more code is needed, please let me know and I will edit. Does anyone have any ideas on how I can correct this? I have "toyed" with the numbers of the rand() function and really doesn't seem to do much. They still seem to bunch just at different points on the screen. I appreciate the help. Just in case anyone needs to know, I am creating this in VS2012.
This has no business being in this function:
srand(time(0));
This should be called once at the beginning of your program (a good place is just inside main()); and most-certainly not in your display routine. Once the seed is set, you should never do it again for your process unless you want to repeat a prior sequence (which by the looks of it, you don't).
That said, I would strongly advise using the functionality in <random> that comes with your C++11 standard library. With it you can establish distributions (ex: uniform_int_distribution<>) that will do much of your modulo work for you, and correctly account for the problems such things can encounter (Andon pointed out one regarding likeliness of certain numbers based on the modulus).
Spend some time with <random>. Its worth it. An example that uses the three ranges you're using:
#include <iostream>
#include <random>
using namespace std;
int main()
{
std::random_device rd;
std::default_random_engine rng(rd());
// our distributions.
std::uniform_int_distribution<> dist1(50,60);
std::uniform_int_distribution<> dist2(200,300);
std::uniform_int_distribution<> dist3(0,100);
for (int i=0;i<10;++i)
std::cout << dist1(rng) << ',' << dist2(rng) << ',' << dist3(rng) << std::endl;
return EXIT_SUCCESS;
}
Output (obviously varies).
58,292,70
56,233,41
57,273,98
52,204,8
50,284,43
51,292,48
53,220,42
54,281,64
50,290,51
53,220,7
Yeah, it really is just that simple. Like I said, that library is this cat's pajamas. There are many more things it offers, including random normal distributions, different engine backends, etc. I highly encourage you to check into it.
As WhozCraig mentioned, seeding your random number generator with the time every time you call myDisplay (...) is a bad idea. This is because time (NULL) has a granularity of 1 second, and in real-time graphics you usually draw your scene more than one time per-second. Thus, you are repeating the same sequence of random numbers every time you call myDisplay (...) when less than 1 second has elapsed.
Also, using modulo arithmetic on a call to rand (...) adversely affects the quality of the returned values. This is because it changes the probability distribution for numbers occurring. The preferred technique should be to cast rand (...) to float and then divide by RAND_MAX, and then multiply this result by your desired range.
GLfloat x = rand() % 50 + 10; /* <-- Bad! */
/* Consider this instead */
GLfloat x = (GLfloat)rand () / RAND_MAX * 50.0f + 10.0f;
Although, come to think of it. Why are you using GLfloat for x and y if you are going to store them in an integer data structure 2 lines later?

FFTW vs Matlab FFT

I posted this on matlab central but didn't get any responses so I figured I'd repost here.
I recently wrote a simple routine in Matlab that uses an FFT in a for-loop; the FFT dominates the calculations. I wrote the same routine in mex just for experimentation purposes and it calls the FFTW 3.3 library. It turns out that the matlab routine runs faster than the mex routine for very large arrays (about twice as fast). The mex routine uses wisdom and and performs the same FFT calculations. I also know matlab uses FFTW, but is it possible their version is slightly more optimized? I even used the FFTW_EXHAUSTIVE flag and its still about twice as slow for large arrays than the MATLAB counterpart. Furthermore I ensured the matlab I used was single threaded with the "-singleCompThread" flag and the mex file I used was not in debug mode. Just curious if this was the case - or if there are some optimizations matlab is using under the hood that I dont know about. Thanks.
Here's the mex portion:
void class_cg_toeplitz::analysis() {
// This method computes CG iterations using FFTs
// Check for wisdom
if(fftw_import_wisdom_from_filename("cd.wis") == 0) {
mexPrintf("wisdom not loaded.\n");
} else {
mexPrintf("wisdom loaded.\n");
}
// Set FFTW Plan - use interleaved FFTW
fftw_plan plan_forward_d_buffer;
fftw_plan plan_forward_A_vec;
fftw_plan plan_backward_Ad_buffer;
fftw_complex *A_vec_fft;
fftw_complex *d_buffer_fft;
A_vec_fft = fftw_alloc_complex(n);
d_buffer_fft = fftw_alloc_complex(n);
// CREATE MASTER PLAN - Do this on an empty vector as creating a plane
// with FFTW_MEASURE will erase the contents;
// Use d_buffer
// This is somewhat dangerous because Ad_buffer is a vector; but it does not
// get resized so &Ad_buffer[0] should work
plan_forward_d_buffer = fftw_plan_dft_r2c_1d(d_buffer.size(),&d_buffer[0],d_buffer_fft,FFTW_EXHAUSTIVE);
plan_forward_A_vec = fftw_plan_dft_r2c_1d(A_vec.height,A_vec.value,A_vec_fft,FFTW_WISDOM_ONLY);
// A_vec_fft.*d_buffer_fft will overwrite d_buffer_fft
plan_backward_Ad_buffer = fftw_plan_dft_c2r_1d(Ad_buffer.size(),d_buffer_fft,&Ad_buffer[0],FFTW_EXHAUSTIVE);
// Get A_vec_fft
fftw_execute(plan_forward_A_vec);
// Find initial direction - this is the initial residual
for (int i=0;i<n;i++) {
d_buffer[i] = b.value[i];
r_buffer[i] = b.value[i];
}
// Start CG iterations
norm_ro = norm(r_buffer);
double fft_reduction = (double)Ad_buffer.size(); // Must divide by size of vector because inverse FFT does not do this
while (norm(r_buffer)/norm_ro > relativeresidual_cutoff) {
// Find Ad - use fft
fftw_execute(plan_forward_d_buffer);
// Get A_vec_fft.*fft(d) - A_vec_fft is only real, but d_buffer_fft
// has complex elements; Overwrite d_buffer_fft
for (int i=0;i<n;i++) {
d_buffer_fft[i][0] = d_buffer_fft[i][0]*A_vec_fft[i][0]/fft_reduction;
d_buffer_fft[i][1] = d_buffer_fft[i][1]*A_vec_fft[i][0]/fft_reduction;
}
fftw_execute(plan_backward_Ad_buffer);
// Calculate r'*r
rtr_buffer = 0;
for (int i=0;i<n;i++) {
rtr_buffer = rtr_buffer + r_buffer[i]*r_buffer[i];
}
// Calculate alpha
alpha = 0;
for (int i=0;i<n;i++) {
alpha = alpha + d_buffer[i]*Ad_buffer[i];
}
alpha = rtr_buffer/alpha;
// Calculate new x
for (int i=0;i<n;i++) {
x[i] = x[i] + alpha*d_buffer[i];
}
// Calculate new residual
for (int i=0;i<n;i++) {
r_buffer[i] = r_buffer[i] - alpha*Ad_buffer[i];
}
// Calculate beta
beta = 0;
for (int i=0;i<n;i++) {
beta = beta + r_buffer[i]*r_buffer[i];
}
beta = beta/rtr_buffer;
// Calculate new direction vector
for (int i=0;i<n;i++) {
d_buffer[i] = r_buffer[i] + beta*d_buffer[i];
}
*total_counter = *total_counter+1;
if(*total_counter >= iteration_cutoff) {
// Set total_counter to -1, this indicates failure
*total_counter = -1;
break;
}
}
// Store Wisdom
fftw_export_wisdom_to_filename("cd.wis");
// Free fft alloc'd memory and plans
fftw_destroy_plan(plan_forward_d_buffer);
fftw_destroy_plan(plan_forward_A_vec);
fftw_destroy_plan(plan_backward_Ad_buffer);
fftw_free(A_vec_fft);
fftw_free(d_buffer_fft);
};
Here's the matlab portion:
% Take FFT of A_vec.
A_vec_fft = fft(A_vec); % Take fft once
% Find initial direction - this is the initial residual
x = zeros(n,1); % search direction
r = zeros(n,1); % residual
d = zeros(n+(n-2),1); % search direction; pad to allow FFT
for i = 1:n
d(i) = b(i);
r(i) = b(i);
end
% Enter CG iterations
total_counter = 0;
rtr_buffer = 0;
alpha = 0;
beta = 0;
Ad_buffer = zeros(n+(n-2),1); % This holds the product of A*d - calculate this once per iteration and using FFT; only 1:n is used
norm_ro = norm(r);
while(norm(r)/norm_ro > 10^-6)
% Find Ad - use fft
Ad_buffer = ifft(A_vec_fft.*fft(d));
% Calculate rtr_buffer
rtr_buffer = r'*r;
% Calculate alpha
alpha = rtr_buffer/(d(1:n)'*Ad_buffer(1:n));
% Calculate new x
x = x + alpha*d(1:n);
% Calculate new residual
r = r - alpha*Ad_buffer(1:n);
% Calculate beta
beta = r'*r/(rtr_buffer);
% Calculate new direction vector
d(1:n) = r + beta*d(1:n);
% Update counter
total_counter = total_counter+1;
end
In terms of time, for N = 50000 and b = 1:n it takes about 10.5 seconds with mex and 4.4 seconds with matlab. I'm using R2011b. Thanks
A few observations rather than a definite answer since I do not know any of the specifics of the MATLAB FFT implementation:
Based on the code you have, I can see two explanations for the speed difference:
the speed difference is explained by differences in levels of optimization of the FFT
the while loop in MATLAB is executed a significantly smaller number of times
I will assume you already looked into the second issue and that the number of iterations are comparable. (If they aren't, this is most likely to some accuracy issues and worth further investigations.)
Now, regarding FFT speed comparison:
Yes, the theory is that FFTW is faster than other high-level FFT implementations but it is only relevant as long as you compare apples to apples: here you are comparing implementations at a level further down, at the assembly level, where not only the selection of the algorithm but its actual optimization for a specific processor and by software developers with varying skills comes at play
I have optimized or reviewed optimized FFTs in assembly on many processors over the year (I was in the benchmarking industry) and great algorithms are only part of the story. There are considerations that are very specific to the architecture you are coding for (accounting for latencies, scheduling of instructions, optimization of register usage, arrangement of data in memory, accounting for branch taken/not taken latencies, etc.) and that make differences as important as the selection of the algorithm.
With N=500000, we are also talking about large memory buffers: yet another door for more optimizations that can quickly get pretty specific to the platform you run your code on: how well you manage to avoid cache misses won't be dictated by the algorithm so much as by how the data flow and what optimizations a software developer may have used to bring data in and out of memory efficiently.
Though I do not know the details of the MATLAB FFT implementation, I am pretty sure that an army of DSP engineers has been (and is still) honing on its optimization as it is key to so many designs. This could very well mean that MATLAB had the right combination of developers to produce a much faster FFT.
This is classic performance gain thanks to low-level and architecture-specific optimization.
Matlab uses FFT from the Intel MKL (Math Kernel Library) binary (mkl.dll). These are routines optimized (at assembly level) by Intel for Intel processors. Even on AMD's it seems to give nice performance boosts.
FFTW seems like a normal c library that is not as optimized. Hence the performance gain to use the MKL.
I have found the following comment on the MathWorks website [1]:
Note on large powers of 2: For FFT dimensions that are powers of
2, between 2^14 and 2^22, MATLAB software uses special preloaded
information in its internal database to optimize the FFT computation.
No tuning is performed when the dimension of the FTT is a power of 2,
unless you clear the database using the command fftw('wisdom', []).
Although it relates to powers of 2, it may hint upon that MATLAB employs its own 'special wisdom' when using FFTW for certain (large) array sizes. Consider: 2^16 = 65536.
[1] R2013b Documentation available from http://www.mathworks.de/de/help/matlab/ref/fftw.html (accessed on 29 Oct 2013)
EDIT: #wakjah 's reply to this answer is accurate: FFTW does support split real and imaginary memory storage via its Guru interface. My claim about hacking is thus not accurate but can very well apply if FFTW's Guru interface is not used - which is the case by default, so beware still!
First, sorry for being a year late. I'm not convinced that the speed increase you see comes from MKL or other optimizations. There is something quite fundamentally different between FFTW and Matlab, and that is how complex data is stored in memory.
In Matlab, the real and imaginary parts of a complex vector X are separate arrays Xre[i] and Xim[i] (linear in memory, efficient when operating on either of them separately).
In FFTW, the real and imaginary parts are interlaced as double[2] by default, i.e. X[i][0] is the real part, and X[i][1] is the imaginary part.
Thus, to use the FFTW library in mex files one cannot use the Matlab array directly, but must allocate new memory first, then pack the input from Matlab into FFTW format, and then unpack the output from FFTW into Matlab format. i.e.
X = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
Y = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * N);
then
for (size_t i=0; i<N; ++i) {
X[i][0] = Xre[i];
X[i][1] = Xim[i];
}
then
for (size_t i=0; i<N; ++i) {
Yre[i] = Y[i][0];
Yim[i] = Y[i][1];
}
Hence, this requires 2x memory allocations + 4x reads + 4x writes -- all of size N. This does take a toll speed-wise on large problems.
I have a hunch that Mathworks may have hacked the FFTW3 code to allow it to read input vectors directly in the Matlab format, which avoids all of the above.
In this scenario, one can only allocate X and use X for Y to run FFTW in-place (as fftw_plan_*(N, X, X, ...) instead of fftw_plan_*(N, X, Y, ...)), since it'll be copied to the Yre and Yim Matlab vector, unless the application requires/benefits from keeping X and Y separate.
EDIT: Looking at the memory consumption in real-time when running Matlab's fft2() and my code based on the fftw3 library, it shows that Matlab only allocates only one additional complex array (the output), whereas my code needs two such arrays (the *fftw_complex buffer plus the Matlab output). An in-place conversion between the Matlab and fftw formats is not possible because the Matlab's real and imaginary arrays are not consecutive in memory. This suggests that Mathworks hacked the fftw3 library to read/write the data using the Matlab format.
One other optimization for multiple calls, is to allocate persistently (using mexMakeMemoryPersistent()). I'm not sure if the Matlab implementation does this as well.
Cheers.
p.s. As a side note, the Matlab complex data storage format is more efficient for operating on the real or imaginary vectors separately. On FFTW's format you'd have to do ++2 memory reads.

How to efficiently determine the minimum necessary size of a pre-rendered sine wave audio buffer for looping?

I've written a program that generates a sine-wave at a user-specified frequency, and plays it on a 96kHz audio channel. To save a few CPU cycles I employ the old trick of pre-rendering a short section of audio into a buffer, and then playing back the buffer in a loop, so that I can avoid calling the sin() function 96000 times per second for the duration of the program and just do simple memory-copying instead.
My problem is efficiently determining what the minimum usable size of this pre-rendered buffer would be. For some frequencies it is easy -- for example, an 8kHz sine wave can be perfectly represented by generating a 12-sample buffer and playing it in a looping, because (8000*12 == 96000). For other frequencies, however, a single cycle of the sine wave requires a non-integral number of samples to represent, and therefore looping a single cycle's worth of samples would cause unacceptable glitching.
For some of those frequencies, however, it's possible to get around that problem by pre-rendering more than one cycle of the sine wave and looping that -- if I can figure out how many cycles are required so that the number of cycles present in the buffer will be integral, while also guaranteeing that the number of samples in the buffer are integral. For example, a sine-wave frequency of 12.8kHz translates to a single-cycle buffer-size of 7.5 samples, which won't loop cleanly, but if I render two consecutive cycles of the sine wave into a 15-sample buffer, then I can cleanly loop the result.
My current approach to solving this issue is brute force: I try all possible cycle-counts and see if any of them result in a buffer size with an integral number of samples in it. I think that approach is unsatisfactory for the following reasons:
1) It's very inefficient. For example, the program shown below (which prints buffer-size results for 480,000 possible frequency values between 0Hz and 48kHz) takes 35 minutes to complete on my 2.7GHz machine. I think there must be a much faster way to do this.
2) I suspect that the results are not 100% accurate, due to floating-point errors.
3) The algorithm gives up if it can't find an acceptable buffer size less than 10 seconds long. (I could make the limit higher, but of course that would make the algorithm even slower).
So, is there any way to calculate the minimum-usable-buffer-size analytically, preferably in O(1) time? It seems like it should be easy, but I haven't been able to figure out what kind of math I should use.
Thanks in advance for any advice!
#include <stdio.h>
#include <math.h>
static const long long SAMPLES_PER_SECOND = 96000;
static const long long MAX_ALLOWED_BUFFER_SIZE_SAMPLES = (SAMPLES_PER_SECOND * 10);
// Returns the length of the pre-render buffer needed to properly
// loop a sine wave at the given frequence, or -1 on failure.
static int GetNumCyclesNeededForPreRenderedBuffer(float freqHz)
{
double oneCycleLengthSamples = SAMPLES_PER_SECOND/freqHz;
for (int count=1; (count*oneCycleLengthSamples) < MAX_ALLOWED_BUFFER_SIZE_SAMPLES; count++)
{
double remainder = fmod(oneCycleLengthSamples*count, 1.0);
if (remainder > 0.5) remainder = 1.0-remainder;
if (remainder <= 0.0) return count;
}
return -1;
}
int main(int, char **)
{
for (int i=0; i<48000*10; i++)
{
double freqHz = ((double)i)/10.0f;
int numCyclesNeeded = GetNumCyclesNeededForPreRenderedBuffer(freqHz);
if (numCyclesNeeded >= 0)
{
double oneCycleLengthSamples = SAMPLES_PER_SECOND/freqHz;
printf("For %.1fHz, use a pre-render-buffer size of %f samples (%i cycles, %f samples/cycle)\n", freqHz, (numCyclesNeeded*oneCycleLengthSamples), numCyclesNeeded, oneCycleLengthSamples);
}
else printf("For %.1fHz, there was no suitable pre-render-buffer size under the allowed limit!\n", freqHz);
}
return 0;
}
number_of_cycles/size_of_buffer = frequency/samples_per_second
This implies that if you can simplify your frequency/samples_per_second fraction, you can find the size of your buffer and the number of cycles in the buffer. If frequency and samples_per_second are integers, you can simplify the fraction by finding the greatest common divisor, otherwise you can use the method of continued fractions.
Example:
Say your frequency is 1234.5, and your samples_per_second is 96000. We can make these into two integers by multiplying by 10, so we get the ratio:
frequency/samples_per_second = 12345/960000
The greatest common divisor is 15, so it can be reduced to 823/64000.
So you would need 823 cycles in a 64000 sample buffer to reproduce the frequency exactly.