Generating Random Numbers with CUDA via rejection method. Performance problems - c++

I'm running a Monte Carlo code for particle simulation, written in CUDA. Basically, in each step I calculate the velocity of each particle and update its position. The velocity is directly proportional to the path length. For a given material, the path length has a certain distribution. I know the probability density function of this path length. I now try to sample random numbers according to this function via rejection method. I would describe my CUDA knowledge as limited. I understood, that it is preferable to create large chunks of random numbers at once instead of multiple small chunks. However, for the rejection method, I generate only two random numbers, check a certain condition and repeat this procedure in the case of failure. Therefore I generate my random numbers on the kernel.
Using the profiler / nvvp I noticed, that basically 50% of my time is spend during the rejection method.
Here is my question: Are there any ways to optimize the rejection methods?
I appreciate every answer.
CODE
Here is the rejection method.
__global__ void rejectSamplePathlength(float* P, curandState* globalState,
int numParticles, float sigma, int timestep,curandState state) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numParticles) {
bool success = false;
float p;
float rho1, rho2;
float a, b;
a = 0.0;
b = 10.0;
curand_init(i, 0, 0, &state);
while (!success) {
rho1 = curand_uniform(&globalState[i]);
rho2 = curand_uniform(&globalState[i]);
if (rho2 < pathlength(a, b, rho1, sigma)) {
p = a + rho1 * (b - a);
success = true;
}
}
P[i] = abs(p);
}
}
The pathlength function in the if statement computes a value y=f(x) on the kernel.
I"m pretty sure, that curand_init is problematic in terms of time, but without this statement, each kernel would generate the same numbers?

Maybe you could create a pool of random generated uniform variable in a previous kernel and then you pick your uniform in that pool and cycling over that pool. But it should be large enough to avoid infinite loop..

Related

CUDA: Runge-Kutta trajectory on each GPU thread

Summary: How do you avoid performance loss caused by different work loads for different threads? (Kernel with a while loop on each thread)
Problem:
I want to solve particle trajectories (described by a 2nd order differential equation) using Runge-Kutta for many different initial conditions. The trajectories will generally have different lengths (each trajectory ends when a particle hits some target). Furthermore, to ensure numerical stability, the Runge-Kutta stepsize is set adaptively. This leads to two nested while-loops, with unknown number of iterations (see serial example below).
I want to implement the Runge-Kutta routine to run on a GPU with CUDA/C++. The trajectories have no dependency of each other, so as a first approach, I will just parallelize over the different initial conditions such that each thread will correspond to a unique trajectory. When a thread is done with a particle trajectory, I want it to start with a new one.
If I understand it correctly, however, the unknown length of each while loop (particle trajectory) means that different threads will get different amounts of work, which might lead to a severe performance loss on GPU.
Question: Is this possible to overcome (in a simple way) the performance losses caused by different work load for different threads? For example setting each warp size to be only 1, such that each thread(warp) can then run independently? r will this lead to other performance losses (e.g. no coalesced memory reads)?
Serial pseudo-code:
// Solve a particle trajectory for each inital condition
// N_trajectories: much larger than 1e6
for( int t_i = 0; t_i < N_trajectories; ++t_i )
{
// Set start coordinates
double x = x_init[p_i];
double y = y_init[p_i];
double vx = vx_init[p_i];
double vy = vy_init[p_i];
double stepsize = ...;
double tolerance = ...;
...
// Solve Runge-Kutta trajectory until convergence
int converged = 0;
while ( !converged )
{
// Do a Runge-Kutta step, if step-size too large then decrease it
int goodStepSize = 0
while( !goodStepSize )
{
// Update x, y, vx, vy
double error = doRungeKutta(x, y, vx, vy, stepsize);
if( error < tolerance )
goodStepSize = 1;
else
stepsize *= 0.5;
}
if( (abs(x-x_final) < epsilon) && (abs(y-y_final) < epsilon) )
converged = 1;
}
}
A short test of my code shows that the inner while-loop runs 2-4 times in 99% of all cases and >10 times in 1% of all cases, before a satisfactory Runge-Kutta step-size was found.
Parallel pseudo-code:
int tpb = 64;
int bpg = (N_trajectories + tpb-1) / tpb;
RungeKuttaKernel<<<bpg, tpb>>>( ... );
__global__ void RungeKuttaKernel( ... )
{
int idx = ...;
// Set start coordinates
double x = x_init[idx];
...
while ( !converged )
{
...
while( !goodStepSize )
{
double error = doRungeKutta( ... );
...
}
...
}
}
I will attempt to answer the question myself, until someone comes up with a better solution.
Pitfalls with directly porting the serial code:
The two while loops will lead to significant branch divergence and performance loss. The outer loop is the "full" trajectory, while the inner loop is one Runge-Kutta step with adaptive step size correction. Inner loop: If we attempt to solve Runge-Kutta with a too large step size then the approximation error will be too large, and we need to redo the step with a smaller step size until the error is smaller than our tolerance. This means that threads that need very few iterations to find an appropriate step size will have to wait for threads that need more iterations. Outer loop: this reflects how many successful Runge-Kutta steps we need before the trajectory is completed. Different trajectories will reach their target in different amount of steps. We will always have to wait for the trajectory with the most iterations before we are completely done.
Proposed parallel approach:
We notice that every iteration consists of doing one Runge-Kutta step. The branching comes from the fact that we either need to reduce the step size for the next iteration, or update the Runge-Kutta coefficients (e.g. positon/velocity) if the step size was OK. I therefore propose that we replace the two while-loops with one for-loop. The first step of the for-loop is to solve Runge-Kutta, followed by an if-statement to check if the step size was small enough or if updating the positions (and checking for total convergence). All threads will now solve only one Runge-Kutta step at a time, and we trade away low occupancy (all threads wait for the thread that need the most attempts to find the correct step size) for the cost of branch divergence of a single if-statement. In my case, solving Runge-Kutta is expensive compared with the evaluations of this if-statement, so we have made an improvement. The issue now lies in setting an appropriate limit on the for-loop and flagging the threads that need more work. This limit will set an upper bound on the longest time a finished thread has to wait for others. Pseudo-code:
int N_trajectories = 1e6;
int trajectoryStepsPerKernel = 50;
thrust::device_vector<int> isConverged(N_trajectories, 0); // Set all trajectories to unconverged
int tpb = 64;
int bpg = (N_trajectories + tpb-1) / tpb;
// Run until all trajectories are converged
while ( vectorSum(isConverged) != N_trajectories )
{
RungeKuttaKernel<<<bpg, tpb>>>( trajectoryStepsPerKernel, isConverged, ... );
cudaDeviceSynchronize();
}
__global__ void RungeKuttaKernel( ... )
{
int idx = ...;
// Set start coordinates
int converged = 0;
double x = x_init[idx];
...
for ( int i = 0; i < trajectoryStepsPerKernel; ++i )
{
double error = doRungeKutta( x_new, y_new, ... );
if( error > tolerance )
{
stepsize *= 0.5;
} else {
converged = checkConvergence( x, x_new, y, y_new, ... );
x = x_new;
y = y_new;
...
}
}
// Update start positions in case we need to continue on trajectory
isConverged[idx] = converged;
x_init[idx] = x;
y_init[idx] = y;
...
}

CUDA parallelizing a dependent 2D array

I have a sample loop of following form. Notice that my psi[i][j] is dependent on psi[i+1][j], psi[i-1][j], psi[i][j+1] and psi[i][j-1] and I have to calculate psi for inner matrix only. Now I tried writing this in CUDA but the results are not same as sequential.
for(i=1;i<=leni-2;i++)
for(j=1;j<=lenj-2;j++){
psi[i][j]=(omega[i][j]*(dx*dx)*(dy*dy)+(psi[i+1][j]+psi[i-1][j])*(dy*dy)+(psi[i][j+1]+psi[i][j-1])*(dx*dx) )/(2.0*(dx*dx)+2.0*(dy*dy));
}
Here's my CUDA format.
//KERNEL
__global__ void ComputePsi(double *psi, double *omega, int imax, int jmax)
{
int x = blockIdx.x;
int y = blockIdx.y;
int i = (jmax*x) + y;
double beta = 1;
double dx=(double)30/(imax-1);
double dy=(double)1/(jmax-1);
if((i)%jmax!=0 && (i+1)%jmax!=0 && i>=jmax && i<imax*jmax-jmax){
psi[i]=(omega[i]*(dx*dx)*(dy*dy)+(psi[i+jmax]+psi[i-jmax])*(dy*dy)+(psi[i+1]+psi[i-1])*(dx*dx) )/(2.0*(dx*dx)+2.0*(dy*dy));
}
}
//Code
cudaMalloc((void **) &dev_psi, leni*lenj*sizeof(double));
cudaMalloc((void **) &dev_omega, leni*lenj*sizeof(double));
cudaMemcpy(dev_psi, psi, leni*lenj*sizeof(double),cudaMemcpyHostToDevice);
cudaMemcpy(dev_omega, omega, leni*lenj*sizeof(double),cudaMemcpyHostToDevice);
dim3 grids(leni,lenj);
for(iterpsi=0;iterpsi<30;iterpsi++)
ComputePsi<<<grids,1>>>(dev_psi, dev_omega, leni, lenj);
Where psi[leni][lenj] and omega[leni][lenj] and double arrays.
The problem is sequential and CUDA codes are giving different results. Is there any modification needed in the code?
You are working in global memory and you are changing psi entries while other threads might need the old values. Just store the values of the new iteration in a separate variable. But keep in mind that you have to swap the variables after each iteration !!
A more sophisticated approach would be a solution working with shared memory and spatial domain assignment to the separate threads. Just google for CUDA tutorials for the solving of the heat/diffusion equation and you will get the idea.
for(i=1;i<=leni-2;i++)
for(j=1;j<=lenj-2;j++){
psi[i][j]= ( omega[i][j]*(dx*dx)*(dy*dy) +
(psi[i+1][j]+psi[i-1][j]) * (dy*dy) +
(psi[i][j+1]+psi[i][j-1]) * (dx*dx)
)/(2.0*(dx*dx)+2.0*(dy*dy));
}
I think that this kernel is not correct sequentially either: the value of psi[i][j] depends on the order of the operations here - so you will be using not updated psi[i+1][j] and psi[i][j+1], but psi[i-1][j] and psi[i][j-1] have been updated in this sweep.
Be sure that with CUDA the result will be different, where the order of the operations is different.
To enforce such an ordering, if possible at all, you would need to insert so many synchronizations that probably it's not worthwhile for CUDA. Is it really what you need to do?

Is it possible to run the sum computation in parallel in OpenCL?

I am a newbie in OpenCL. However, I understand the C/C++ basics and the OOP.
My question is as follows: is it somehow possible to run the sum computation task in parallel? Is it theoretically possible? Below I will describe what I've tried to do:
The task is, for example:
double* values = new double[1000]; //let's pretend it has some random values inside
double sum = 0.0;
for(int i = 0; i < 1000; i++) {
sum += values[i];
}
What I tried to do in OpenCL kernel (and I feel it is wrong because perhaps it accesses the same "sum" variable from different threads/tasks at the same time):
__kernel void calculate2dim(__global float* vectors1dim,
__global float output,
const unsigned int count) {
int i = get_global_id(0);
output += vectors1dim[i];
}
This code is wrong. I will highly appreciate if anyone answers me if it is theoretically possible to run such tasks in parallel and if it is - how!
If you want to sum the values of your array in a parallel fashion, you should make sure you reduce contention and make sure there's no data dependencies across threads.
Data dependencies will cause threads to have to wait for each other, creating contention, which is what you want to avoid to get true parallellization.
One way you could do that is to split your array into N arrays, each containing some subsection of your original array, and then calling your OpenCL kernel function with each different array.
At the end, when all kernels have done the hard work, you can just sum up the results of each array into one. This operation can easily be done by the CPU.
The key is to not have any dependencies between the calculations done in each kernel, so you have to split your data and processing accordingly.
I don't know if your data has any actual dependencies from your question, but that is for you to figure out.
The piece of code I've provided for reference should do the job.
E.g. you have N elements, and size of your workgroup is WS = 64. I assume that N is multiple of 2*WS (this is important, one workgroup calculates sum of 2*WS elements). Then you need to run kernel specifying:
globalSizeX = 2*WS*(N/(2*WS));
As a result sum array will have partial sums of 2*WS elements. ( e.g. sum[1] - will contain sum of elements whose indices are from 2*WS to 4*WS-1).
If your globalSizeX is 2*WS or less (which means that you have only one workgroup), then you are done. Just use sum[0] as a result.
If not - you need to repeat procedure, this time using sum array as input array and output to other array (create 2 arrays and ping-pong between them). And so on untill you will have only one workgroup.
Search also for Hilli Steele / Blelloch parallel algorithms.
This article could be useful as well
Here is the actual example:
__kernel void par_sum(__global unsigned int* input, __global unsigned int* sum)
{
int li = get_local_id(0);
int groupId = get_group_id(0);
__local int our_h[2 * get_group_size(0)];
our_h[2*li + 0] = hist[2*get_group_size(0)*blockId + 2*li + 0];
our_h[2*li + 1] = hist[2*get_group_size(0)*blockId + 2*li + 1];
// sweep up
int width = 2;
int num_el = 2*get_group_size(0)/width;
int wby2 = width>>1;
for(int i = 2*BLK_SIZ>>1; i>0; i>>=1)
{
barrier(CLK_LOCL_MEM_FENCE);
if(li < num_el)
{
int idx = width*(li+1) - 1;
our_h[idx] = our_h[idx] + our_h[(idx - wby2)];
}
width<<=1;
wby2 = width>>1;
num_el>>=1;
}
barrier(CLK_LOCL_MEM_FENCE);
// down-sweep
if(0 == li)
sum[groupId] = our_h[2*get_group_size(0)-1]; // save sum
}

Mathematica/CUDA reduce execution time

I'm writing a simple monte carlo simulation for particle transport. My approach is writing a kernel for CUDA and execute it as a Mathematica function.
Kernel:
#include "curand_kernel.h"
#include "math.h"
extern "C" __global__ void monteCarlo(Real_t *transmission, mint seed, mint pathN) {
curandState rngState;
int index = threadIdx.x + blockIdx.x*blockDim.x;
curand_init(seed, index, 0, &rngState);
if (index < pathN) {
//-------------start one packet run----------------------
float packetWeight = 1.0;
int m = 0;
while(packetWeight > 0.0){
//MONTE CARLO CODE
// Test: still in the sample?
if(z_coordinate > sampleThickness){
packetWeight = 0;
z_coordinate = sampleThickness;
transmission[index]=1;
}
}
}
//-------------end one packet run------------------------
}
}
Mathematica code:
Needs["CUDALink`"];
cudaBM = CUDAFunctionLoad[code,
"monteCarlo", {{_Real, "Output"}, _Integer, _Integer}, 256,
"UnmangleCode" -> False];
pathN = 100000;
result = 0; (*count for transmitted particles*)
For[j = 0, j < 10, j++,
buffer = CUDAMemoryAllocate["Float", 100000];
cudaBM[buffer, 1490, pathN];
resultOneRun = Total[CUDAMemoryGet[buffer]];
result = result + resultOneRun;
];
Everything seems to work so far, but the speed improvement compared to the pure C code without CUDA is neglible. I have two problems:
the curand_init() function is executed by all threads at the beginning of every sumulation step -> can I call this function once for all threads?
the kernel returns to Mathematica a very large array of reals (100 000). I know, that the bottleneck of CUDA is the channel bandwidth between GPU and CPU. I need only the sum of all elements of the list, so it would be more efficient to calculate the sum of the list elements in the GPU and send only one real number to the CPU.
1) If you need to execute curand_init() once for all threads, can you just do that in the CPU and pass that as an argument to CUDA?
2) How about a "device float sumTotal" function which sums and returns your values? Have you copied as much *transmission data into a shared memory buffer?
As per CURAND docs,
"Calls to curand_init() are slower than calls to curand() or curand_uniform(). Large offsets to curand_init() take more time than smaller offsets. It is much faster to save and restore random generator state than to recalculate the starting state repeatedly."
http://docs.nvidia.com/cuda/curand/index.html#topic_1_3_4
Also please look into this thread for more details
CUDA program causes nvidia driver to crash

Can/Should I run this code of a statistical application on a GPU?

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.
Several methods performing different, but independent, calculations on the array in nested loops, for example:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
for (float x = 0f; x < 100f; x += 0.0001f) {
int noOfOccurrences = 0;
foreach (float y in largeFloatingPointArray) {
if (x == y) {
noOfOccurrences++;
}
}
noOfNumbers.Add(x, noOfOccurrences);
}
The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:
Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?
UPDATE GPU Version
__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
float y; // compute one (or more) floats
int noOfOccurrences = 0;
int a;
while( x < size ) // While there is work to do each thread will:
{
dictionary[x] = 0; // Initialize the position in each it will work
noOfOccurrences = 0;
for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
{ // that are equal
// to it assign float
y = largeFloatingPointArray[j]; // Take a candidate from the floats array
y *= 10000; // e.g if y = 0.0001f;
a = y + 0.5; // a = 1 + 0.5 = 1;
if (a == x) noOfOccurrences++;
}
dictionary[x] += noOfOccurrences; // Update in the dictionary
// the number of times that the float appears
x += blockDim.x * gridDim.x; // Update the position here the thread will work
}
}
This one I just tested for smaller inputs, because I am testing in my laptop. Nevertheless, it is working, but more tests are needed.
UPDATE Sequential Version
I just did this naive version that executes your algorithm for an array with 30,000,000 element in less than 20 seconds (including the time taken by function that generates the data).
This naive version first sorts your array of floats. Afterward, will go through the sorted array and check the number of times a given value appears in the array and then puts this value in a dictionary along with the number of times it has appeared.
You can use sorted map, instead of the unordered_map that I used.
Heres the code:
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>
typedef std::tr1::unordered_map<float, int> Mymap;
void generator(float *data, long int size)
{
float LO = 0.0;
float HI = 100.0;
for(long int i = 0; i < size; i++)
data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}
void print_array(float *data, long int size)
{
for(long int i = 2; i < size; i++)
printf("%f\n",data[i]);
}
std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
float previous = data[0];
int count = 1;
std::tr1::unordered_map<float, int> dict;
for(long int i = 1; i < size; i++)
{
if(previous == data[i])
count++;
else
{
dict.insert(Mymap::value_type(previous,count));
previous = data[i];
count = 1;
}
}
dict.insert(Mymap::value_type(previous,count)); // add the last member
return dict;
}
void printMAP(std::tr1::unordered_map<float, int> dict)
{
for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
{
std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
}
}
int main(int argc, char** argv)
{
int size = 1000000;
if(argc > 1) size = atoi(argv[1]);
printf("Size = %d",size);
float data[size];
using namespace __gnu_cxx;
std::tr1::unordered_map<float, int> dict;
generator(data,size);
sort(data, data + size);
dict = fill_dict(data,size);
return 0;
}
If you have the library thrust installed in you machine your should use this:
#include <thrust/sort.h>
thrust::sort(data, data + size);
instead of this
sort(data, data + size);
For sure it will be faster.
Original Post
I'm working on a statistical application which has a large array
containing 10 - 30 millions of floating point values.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Yes, it is. A month ago, I ran an entirely Molecular Dynamic simulation on a GPU. One of the kernels, which calculated the force between pairs of particles, received as parameter 6 array each one with 500,000 doubles, for a total of 3 Millions doubles (22 MB).
So if you are planning to put 30 Million floating points, which is about 114 MB of global Memory, it will not be a problem.
In your case, can the number of calculations be an issue? Based on my experience with the Molecular Dynamic (MD), I would say no. The sequential MD version takes about 25 hours to complete while the GPU version took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the MD.
Here's the force calculation example:
__global__ void add(double *fx, double *fy, double *fz,
double *x, double *y, double *z,...){
int pos = (threadIdx.x + blockIdx.x * blockDim.x);
...
while(pos < particles)
{
for (i = 0; i < particles; i++)
{
if(//inside of the same radius)
{
// calculate force
}
}
pos += blockDim.x * gridDim.x;
}
}
A simple example of a code in CUDA could be the sum of two 2D arrays:
In C:
for(int i = 0; i < N; i++)
c[i] = a[i] + b[i];
In CUDA:
__global__ add(int *c, int *a, int*b, int N)
{
int pos = (threadIdx.x + blockIdx.x)
for(; i < N; pos +=blockDim.x)
c[pos] = a[pos] + b[pos];
}
In CUDA you basically took each for iteration and assigned to each thread,
1) threadIdx.x + blockIdx.x*blockDim.x;
Each block has an ID from 0 to N-1 (N the number maximum of blocks) and each block has a 'X' number of threads with an ID from 0 to X-1.
Gives you the for loop iteration that each thread will compute based on its ID and the block ID which the thread is in; the blockDim.x is the number of threads that a block has.
So if you have 2 blocks each one with 10 threads and N=40, the:
Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39
Looking at your current code, I have made this draft of what your code could look like in CUDA:
__global__ hash (float *largeFloatingPointArray, int *dictionary)
// You can turn the dictionary in one array of int
// here each position will represent the float
// Since x = 0f; x < 100f; x += 0.0001f
// you can associate each x to different position
// in the dictionary:
// pos 0 have the same meaning as 0f;
// pos 1 means float 0.0001f
// pos 2 means float 0.0002f ect.
// Then you use the int of each position
// to count how many times that "float" had appeared
int x = blockIdx.x; // Each block will take a different x to work
float y;
while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
int noOfOccurrences = 0;
float z = converting_int_to_float(x); // This function will convert the x to the
// float like you use (x / 0.0001)
// each thread of each block
// will takes the y from the array of largeFloatingPointArray
for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
{
y = largeFloatingPointArray[j];
if (z == y)
{
noOfOccurrences++;
}
}
if(threadIdx.x == 0) // Thread master will update the values
atomicAdd(&dictionary[x], noOfOccurrences);
__syncthreads();
}
You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences concurrently, so you have to ensure mutual exclusion.
This is just one approach; you can even assign the iterations of the outer loop to the threads instead of the blocks.
Tutorials
The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.
and anothers:
Volume I: Introduction to CUDA Programming
Getting started with CUDA
CUDA Resources List
Take a look on the last item, you will find many link to learn CUDA.
OpenCL: OpenCL Tutorials | MacResearch
I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.
In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.
I don't know any C#, but a single pass implementation of your sample would look something like this:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
foreach (float x in largeFloatingPointArray)
{
if (math.Truncate(x/0.0001f)*0.0001f == x)
{
if (noOfNumbers.ContainsKey(x))
noOfNumbers.Add(x, noOfNumbers[x]+1);
else
noOfNumbers.Add(x, 1);
}
}
Hope this helps.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.
If yes: Does anyone know any tutorial or got any sample code
(programming language doesn't matter)?
When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.
CUDA is mature with a lot of tools but is NVidia GPUs centric.
OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).
As you use C# you can use bindings like OpenCL.Net or Cloo.
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.
In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.
The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.
I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.
I am not sure whether using GPUs would be a good match given that
'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.
I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.
You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.
Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.
On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.
The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.
With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.
I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.
I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.
If you need help in implementing a distributed solution, I'd be happy to discuss it with you.
Added Nov 16, 2012.
I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.
/*
* Convert the X range from 0f to 100f in steps of 0.0001f
* into a range of integers 0 to 1 + (100 * 10000) to use as an
* index into an array.
*/
#define X_MAX (1 + (100 * 10000))
/*
* Number of floats in largeFloatingPointArray needs to be defined
* below to be whatever your value is.
*/
#define LARGE_ARRAY_MAX (1000)
main()
{
int j, y, *noOfOccurances;
float *largeFloatingPointArray;
/*
* Allocate memory for largeFloatingPointArray and populate it.
*/
largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));
if (largeFloatingPointArray == 0) {
printf("out of memory\n");
exit(1);
}
/*
* Allocate memory to hold noOfOccurances. The index/10000 is the
* the floating point number. The contents is the count.
*
* E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
* in largeFloatingPointArray.
*/
noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
if (noOfOccurances == 0) {
printf("out of memory\n");
exit(1);
}
for (j = 0; j < LARGE_ARRAY_MAX; j++) {
y = (int)(largeFloatingPointArray[j] * 10000);
if (y >= 0 && y <= X_MAX) {
noOfOccurances[y]++;
}
}
}