CUDA: Runge-Kutta trajectory on each GPU thread - c++

Summary: How do you avoid performance loss caused by different work loads for different threads? (Kernel with a while loop on each thread)
Problem:
I want to solve particle trajectories (described by a 2nd order differential equation) using Runge-Kutta for many different initial conditions. The trajectories will generally have different lengths (each trajectory ends when a particle hits some target). Furthermore, to ensure numerical stability, the Runge-Kutta stepsize is set adaptively. This leads to two nested while-loops, with unknown number of iterations (see serial example below).
I want to implement the Runge-Kutta routine to run on a GPU with CUDA/C++. The trajectories have no dependency of each other, so as a first approach, I will just parallelize over the different initial conditions such that each thread will correspond to a unique trajectory. When a thread is done with a particle trajectory, I want it to start with a new one.
If I understand it correctly, however, the unknown length of each while loop (particle trajectory) means that different threads will get different amounts of work, which might lead to a severe performance loss on GPU.
Question: Is this possible to overcome (in a simple way) the performance losses caused by different work load for different threads? For example setting each warp size to be only 1, such that each thread(warp) can then run independently? r will this lead to other performance losses (e.g. no coalesced memory reads)?
Serial pseudo-code:
// Solve a particle trajectory for each inital condition
// N_trajectories: much larger than 1e6
for( int t_i = 0; t_i < N_trajectories; ++t_i )
{
// Set start coordinates
double x = x_init[p_i];
double y = y_init[p_i];
double vx = vx_init[p_i];
double vy = vy_init[p_i];
double stepsize = ...;
double tolerance = ...;
...
// Solve Runge-Kutta trajectory until convergence
int converged = 0;
while ( !converged )
{
// Do a Runge-Kutta step, if step-size too large then decrease it
int goodStepSize = 0
while( !goodStepSize )
{
// Update x, y, vx, vy
double error = doRungeKutta(x, y, vx, vy, stepsize);
if( error < tolerance )
goodStepSize = 1;
else
stepsize *= 0.5;
}
if( (abs(x-x_final) < epsilon) && (abs(y-y_final) < epsilon) )
converged = 1;
}
}
A short test of my code shows that the inner while-loop runs 2-4 times in 99% of all cases and >10 times in 1% of all cases, before a satisfactory Runge-Kutta step-size was found.
Parallel pseudo-code:
int tpb = 64;
int bpg = (N_trajectories + tpb-1) / tpb;
RungeKuttaKernel<<<bpg, tpb>>>( ... );
__global__ void RungeKuttaKernel( ... )
{
int idx = ...;
// Set start coordinates
double x = x_init[idx];
...
while ( !converged )
{
...
while( !goodStepSize )
{
double error = doRungeKutta( ... );
...
}
...
}
}

I will attempt to answer the question myself, until someone comes up with a better solution.
Pitfalls with directly porting the serial code:
The two while loops will lead to significant branch divergence and performance loss. The outer loop is the "full" trajectory, while the inner loop is one Runge-Kutta step with adaptive step size correction. Inner loop: If we attempt to solve Runge-Kutta with a too large step size then the approximation error will be too large, and we need to redo the step with a smaller step size until the error is smaller than our tolerance. This means that threads that need very few iterations to find an appropriate step size will have to wait for threads that need more iterations. Outer loop: this reflects how many successful Runge-Kutta steps we need before the trajectory is completed. Different trajectories will reach their target in different amount of steps. We will always have to wait for the trajectory with the most iterations before we are completely done.
Proposed parallel approach:
We notice that every iteration consists of doing one Runge-Kutta step. The branching comes from the fact that we either need to reduce the step size for the next iteration, or update the Runge-Kutta coefficients (e.g. positon/velocity) if the step size was OK. I therefore propose that we replace the two while-loops with one for-loop. The first step of the for-loop is to solve Runge-Kutta, followed by an if-statement to check if the step size was small enough or if updating the positions (and checking for total convergence). All threads will now solve only one Runge-Kutta step at a time, and we trade away low occupancy (all threads wait for the thread that need the most attempts to find the correct step size) for the cost of branch divergence of a single if-statement. In my case, solving Runge-Kutta is expensive compared with the evaluations of this if-statement, so we have made an improvement. The issue now lies in setting an appropriate limit on the for-loop and flagging the threads that need more work. This limit will set an upper bound on the longest time a finished thread has to wait for others. Pseudo-code:
int N_trajectories = 1e6;
int trajectoryStepsPerKernel = 50;
thrust::device_vector<int> isConverged(N_trajectories, 0); // Set all trajectories to unconverged
int tpb = 64;
int bpg = (N_trajectories + tpb-1) / tpb;
// Run until all trajectories are converged
while ( vectorSum(isConverged) != N_trajectories )
{
RungeKuttaKernel<<<bpg, tpb>>>( trajectoryStepsPerKernel, isConverged, ... );
cudaDeviceSynchronize();
}
__global__ void RungeKuttaKernel( ... )
{
int idx = ...;
// Set start coordinates
int converged = 0;
double x = x_init[idx];
...
for ( int i = 0; i < trajectoryStepsPerKernel; ++i )
{
double error = doRungeKutta( x_new, y_new, ... );
if( error > tolerance )
{
stepsize *= 0.5;
} else {
converged = checkConvergence( x, x_new, y, y_new, ... );
x = x_new;
y = y_new;
...
}
}
// Update start positions in case we need to continue on trajectory
isConverged[idx] = converged;
x_init[idx] = x;
y_init[idx] = y;
...
}

Related

PID Control: Is adding a delay before the next loop a good idea?

I am implementing PID control in c++ to make a differential drive robot turn an accurate number of degrees, but I am having many issues.
Exiting control loop early due to fast loop runtime
If the robot measures its error to be less than .5 degrees, it exits the control loop and consider the turn "finished" (the .5 is a random value that I might change at some point). It appears that the control loop is running so quickly that the robot can turn at a very high speed, turn past the setpoint, and exit the loop/cut motor powers, because it was at the setpoint for a short instant. I know that this is the entire purpose of PID control, to accurately reach the setpoint without overshooting, but this problem is making it very difficult to tune the PID constants. For example, I try to find a value of kp such that there is steady oscillation, but there is never any oscillation because the robot thinks it has "finished" once it passes the setpoint. To fix this, I have implemented a system where the robot has to be at the setpoint for a certain period of time before exiting, and this has been effective, allowing oscillation to occur, but the issue of exiting the loop early seems like an unusual problem and my solution may be incorrect.
D term has no effect due to fast runtime
Once I had the robot oscillating in a controlled manner using only P, I tried to add D to prevent overshoot. However, this was having no effect for the majority of the time, because the control loop is running so quickly that 19 loops out of 20, the rate of change of error is 0: the robot did not move or did not move enough for it to be measured in that time. I printed the change in error and the derivative term each loop to confirm this and I could see that these would both be 0 for around 20 loop cycles before taking a reasonable value and then back to 0 for another 20 cycles. Like I said, I think that this is because the loop cycles are so quick that the robot literally hasn't moved enough for any sort of noticeable change in error. This was a big problem because it meant that the D term had essentially no effect on robot movement because it was almost always 0. To fix this problem, I tried using the last non-zero value of the derivative in place of any 0 values, but this didn't work well, and the robot would oscillate randomly if the last derivative didn't represent the current rate of change of error.
Note: I am also using a small feedforward for the static coefficient of friction, and I call this feedforward "f"
Should I add a delay?
I realized that I think the source of both of these issues is the loop running very very quickly, so something I thought of was adding a wait statement at the end of the loop. However, it seems like an overall bad solution to intentionally slow down a loop. Is this a good idea?
turnHeading(double finalAngle, double kp, double ki, double kd, double f){
std::clock_t timer;
timer = std::clock();
double pastTime = 0;
double currentTime = ((std::clock() - timer) / (double)CLOCKS_PER_SEC);
const double initialHeading = getHeading();
finalAngle = angleWrapDeg(finalAngle);
const double initialAngleDiff = initialHeading - finalAngle;
double error = angleDiff(getHeading(), finalAngle);
double pastError = error;
double firstTimeAtSetpoint = 0;
double timeAtSetPoint = 0;
bool atSetpoint = false;
double integral = 0;
double derivative = 0;
double lastNonZeroD = 0;
while (timeAtSetPoint < .05)
{
updatePos(encoderL.read(), encoderR.read());
error = angleDiff(getHeading(), finalAngle);
currentTime = ((std::clock() - timer) / (double)CLOCKS_PER_SEC);
double dt = currentTime - pastTime;
double proportional = error / fabs(initialAngleDiff);
integral += dt * ((error + pastError) / 2.0);
double derivative = (error - pastError) / dt;
//FAILED METHOD OF USING LAST NON-0 VALUE OF DERIVATIVE
// if(epsilonEquals(derivative, 0))
// {
// derivative = lastNonZeroD;
// }
// else
// {
// lastNonZeroD = derivative;
// }
double power = kp * proportional + ki * integral + kd * derivative;
if (power > 0)
{
setMotorPowers(-power - f, power + f);
}
else
{
setMotorPowers(-power + f, power - f);
}
if (fabs(error) < 2)
{
if (!atSetpoint)
{
atSetpoint = true;
firstTimeAtSetpoint = currentTime;
}
else //at setpoint
{
timeAtSetPoint = currentTime - firstTimeAtSetpoint;
}
}
else //no longer at setpoint
{
atSetpoint = false;
timeAtSetPoint = 0;
}
pastTime = currentTime;
pastError = error;
}
setMotorPowers(0, 0);
}
turnHeading(90, .37, 0, .00004, .12);

OpenMP: parallel for doesn't do anything

I'm trying to make a parallel version of SIFT algorithm in OpenCV.
In particular in sift.cpp:
static void calcDescriptors(const std::vector<Mat>& gpyr, const std::vector<KeyPoint>& keypoints,
Mat& descriptors, int nOctaveLayers, int firstOctave )
{
...
#pragma omp parallel for
for( size_t i = 0; i < keypoints.size(); i++ )
{
...
calcSIFTDescriptor(img, ptf, angle, size*0.5f, d, n, descriptors.ptr<float>((int)i));
...
}
Gives already a speed-up from 84ms to 52ms on a quad-core machine. It doesn't scale so much, but it's already a good result for adding 1 line of codes.
Anyway most of the computation inside the loop is performed by calcSIFTDescriptor(), but anyway it takes on average 100us. So most of the computation time is given by the really high number of times that calcSIFTDescriptor() is called (thousands of times). So accomulating all these 100us results in several ms.
Anyway, I'm trying to optimize the calcSIFTDescriptor() performance. In particular the code is devide between two for and the following one take on average 60us:
for( k = 0; k < len; k++ )
{
float rbin = RBin[k], cbin = CBin[k];
float obin = (Ori[k] - ori)*bins_per_rad;
float mag = Mag[k]*W[k];
int r0 = cvFloor( rbin );
int c0 = cvFloor( cbin );
int o0 = cvFloor( obin );
rbin -= r0;
cbin -= c0;
obin -= o0;
if( o0 < 0 )
o0 += n;
if( o0 >= n )
o0 -= n;
// histogram update using tri-linear interpolation
float v_r1 = mag*rbin, v_r0 = mag - v_r1;
float v_rc11 = v_r1*cbin, v_rc10 = v_r1 - v_rc11;
float v_rc01 = v_r0*cbin, v_rc00 = v_r0 - v_rc01;
float v_rco111 = v_rc11*obin, v_rco110 = v_rc11 - v_rco111;
float v_rco101 = v_rc10*obin, v_rco100 = v_rc10 - v_rco101;
float v_rco011 = v_rc01*obin, v_rco010 = v_rc01 - v_rco011;
float v_rco001 = v_rc00*obin, v_rco000 = v_rc00 - v_rco001;
int idx = ((r0+1)*(d+2) + c0+1)*(n+2) + o0;
hist[idx] += v_rco000;
hist[idx+1] += v_rco001;
hist[idx+(n+2)] += v_rco010;
hist[idx+(n+3)] += v_rco011;
hist[idx+(d+2)*(n+2)] += v_rco100;
hist[idx+(d+2)*(n+2)+1] += v_rco101;
hist[idx+(d+3)*(n+2)] += v_rco110;
hist[idx+(d+3)*(n+2)+1] += v_rco111;
}
So I tried to add #pragma omp parallel for private(k) before it, and the weird thing happens: nothing happens!!!
Introducing this parallel for make the code computation on average 53ms (against 52ms of before). I would have expected one or more of the following results:
Taking >52ms given by the overhead of a new parallel for
Taking <52ms given by the gain obtained by the parallel for
Some sort of inconsistency in the result, since as you can see the shared vector hist is updated concurrently. Nothing of this happens: the result is still correct and no atomic or critical are used.
I'm an OpenMP newbie, but from I see is like this inner parllel for is like ignored. Why this happens?
NOTE: all the reported times are the average time with the same input for 10.000 times.
UPDATE:
I tried to remove the first parallel for, leaving the one in calcSIFTDescriptor and it happened was I was expecting: inconsistency has been observed due to the lack of any thread-safety mechanism. Introducing #pragma omp critical(dataupdate) before updating hist gave consistency again but now performances are horribles: 245ms on average.
I think that this is because of the overhead given by the parallel for in calcSIFTDescriptor, which is not worth for parallelize 30us.
BUT THE QUESTION STILL REMAINS: why the first version (with two parallel for) didn't produce any change (both in performance and consistency)?
I found out the answer by myself: the second (nested) parallel for doesn't make any effect for the reason described here:
OpenMP parallel regions can be nested inside each other. If nested
parallelism is disabled, then the new team created by a thread
encountering a parallel construct inside a parallel region consists
only of the encountering thread. If nested parallelism is enabled,
then the new team may consist of more than one thread.
So since the first parallel for takes all the possible thread, the second one has as team the encountering thread itself. So nothing happens.
Cheers to myself!

Generating Random Numbers with CUDA via rejection method. Performance problems

I'm running a Monte Carlo code for particle simulation, written in CUDA. Basically, in each step I calculate the velocity of each particle and update its position. The velocity is directly proportional to the path length. For a given material, the path length has a certain distribution. I know the probability density function of this path length. I now try to sample random numbers according to this function via rejection method. I would describe my CUDA knowledge as limited. I understood, that it is preferable to create large chunks of random numbers at once instead of multiple small chunks. However, for the rejection method, I generate only two random numbers, check a certain condition and repeat this procedure in the case of failure. Therefore I generate my random numbers on the kernel.
Using the profiler / nvvp I noticed, that basically 50% of my time is spend during the rejection method.
Here is my question: Are there any ways to optimize the rejection methods?
I appreciate every answer.
CODE
Here is the rejection method.
__global__ void rejectSamplePathlength(float* P, curandState* globalState,
int numParticles, float sigma, int timestep,curandState state) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numParticles) {
bool success = false;
float p;
float rho1, rho2;
float a, b;
a = 0.0;
b = 10.0;
curand_init(i, 0, 0, &state);
while (!success) {
rho1 = curand_uniform(&globalState[i]);
rho2 = curand_uniform(&globalState[i]);
if (rho2 < pathlength(a, b, rho1, sigma)) {
p = a + rho1 * (b - a);
success = true;
}
}
P[i] = abs(p);
}
}
The pathlength function in the if statement computes a value y=f(x) on the kernel.
I"m pretty sure, that curand_init is problematic in terms of time, but without this statement, each kernel would generate the same numbers?
Maybe you could create a pool of random generated uniform variable in a previous kernel and then you pick your uniform in that pool and cycling over that pool. But it should be large enough to avoid infinite loop..

GPU for loops: avoid warp divergence & implicit syncthreads

My situation: each thread in a warp operates on its own completely independent & distinct data array. All threads loop over their data array. The number of loop iterations is different for each thread. (This incurs a cost, I know).
Within the for loop, each thread needs to save the maximum value after calculating three floats. After the for-loop, threads in warp will "communicate" by checking the maximum value calculated by only their "neighboring thread" in the warp (determined by parity).
Questions:
If I avoid the conditionals in a "max" operation by doing multiplication, this will avoid warp divergence, right? (see example code below)
The extra multiplication operations mentioned in (1.) are worth it, right? - i.e. far faster than any sort of warp divergence.
The same mechanism that causes warp divergence (one set of instructions for all threads) can be exploited as an implicit "thread barrier" (for the warp) at the end of the for-loop (much the same way as with an "#pragma omp for" statement in non-gpu computing). Thus I don't need to make a "syncthreads" call for a warp after the for loop before one thread checks the value saved by another thread, right? (This would be because "synthreads" is only for the "entire GPU", i.e. inter-warp and inter-MP, right?)
example code:
__shared__ int N_per_data; // loaded from host
__shared__ float ** data; //loaded from host
data = new float*[num_threads_in_warp];
for (int j = 0; j < num_threads_in_warp; ++j)
data[j] = new float[N_per_data[j]];
// the values of jagged matrix "data" are loaded from host.
__shared__ float **max_data = new float*[num_threads_in_warp];
for (int j = 0; j < num_threads_in_warp; ++j)
max_data[j] = new float[N_per_data[j]];
for (uint j = 0; j < N_per_data[threadIdx.x]; ++j)
{
const float a = f(data[threadIdx.x][j]);
const float b = g(data[threadIdx.x][j]);
const float c = h(data[threadIdx.x][j]);
const int cond_a = (a > b) && (a > c);
const int cond_b = (b > a) && (b > c);
const int cond_c = (c > a) && (c > b);
// avoid if-statements. question (1) and (2)
max_data[threadIdx.x][j] = conda_a * a + cond_b * b + cond_c * c;
}
// Question (3):
// No "syncthreads" necessary in next line:
// access data of your mate at some magic positions (assume it exists):
float my_neighbors_max_at_7 = max_data[threadIdx.x + pow(-1,(threadIdx.x % 2) == 1) ][7];
Before implementing my algorithm on a GPU, I am investigating every aspect of the algorithm to ensure that it will be worth the implementation effort. So please bear with me..
Yes
My guess would be NO - depends on how you would write the other version with the ifs.
The compiler will probably use predicates to mask out the unwanted writes, in which case there would be no real thread divergence, just a few executed but masked out write instructions.
You should let the compiler do it's magic and compare the decompiled code for both versions to determine what is the better solution.
In your particular case of calculating a maximum of signed integer d = a > b ? a : b translates to one PTX ISA instruction max.s32 so there is really no need to make it as complicated as you did... just compute the maximum into a temporary variable and do one unconditional write.
Yes, but the synthreads barrier is an intra-block barrier, not inter-block and certainly not inter-mp.

Can/Should I run this code of a statistical application on a GPU?

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.
Several methods performing different, but independent, calculations on the array in nested loops, for example:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
for (float x = 0f; x < 100f; x += 0.0001f) {
int noOfOccurrences = 0;
foreach (float y in largeFloatingPointArray) {
if (x == y) {
noOfOccurrences++;
}
}
noOfNumbers.Add(x, noOfOccurrences);
}
The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:
Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?
UPDATE GPU Version
__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
float y; // compute one (or more) floats
int noOfOccurrences = 0;
int a;
while( x < size ) // While there is work to do each thread will:
{
dictionary[x] = 0; // Initialize the position in each it will work
noOfOccurrences = 0;
for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
{ // that are equal
// to it assign float
y = largeFloatingPointArray[j]; // Take a candidate from the floats array
y *= 10000; // e.g if y = 0.0001f;
a = y + 0.5; // a = 1 + 0.5 = 1;
if (a == x) noOfOccurrences++;
}
dictionary[x] += noOfOccurrences; // Update in the dictionary
// the number of times that the float appears
x += blockDim.x * gridDim.x; // Update the position here the thread will work
}
}
This one I just tested for smaller inputs, because I am testing in my laptop. Nevertheless, it is working, but more tests are needed.
UPDATE Sequential Version
I just did this naive version that executes your algorithm for an array with 30,000,000 element in less than 20 seconds (including the time taken by function that generates the data).
This naive version first sorts your array of floats. Afterward, will go through the sorted array and check the number of times a given value appears in the array and then puts this value in a dictionary along with the number of times it has appeared.
You can use sorted map, instead of the unordered_map that I used.
Heres the code:
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>
typedef std::tr1::unordered_map<float, int> Mymap;
void generator(float *data, long int size)
{
float LO = 0.0;
float HI = 100.0;
for(long int i = 0; i < size; i++)
data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}
void print_array(float *data, long int size)
{
for(long int i = 2; i < size; i++)
printf("%f\n",data[i]);
}
std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
float previous = data[0];
int count = 1;
std::tr1::unordered_map<float, int> dict;
for(long int i = 1; i < size; i++)
{
if(previous == data[i])
count++;
else
{
dict.insert(Mymap::value_type(previous,count));
previous = data[i];
count = 1;
}
}
dict.insert(Mymap::value_type(previous,count)); // add the last member
return dict;
}
void printMAP(std::tr1::unordered_map<float, int> dict)
{
for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
{
std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
}
}
int main(int argc, char** argv)
{
int size = 1000000;
if(argc > 1) size = atoi(argv[1]);
printf("Size = %d",size);
float data[size];
using namespace __gnu_cxx;
std::tr1::unordered_map<float, int> dict;
generator(data,size);
sort(data, data + size);
dict = fill_dict(data,size);
return 0;
}
If you have the library thrust installed in you machine your should use this:
#include <thrust/sort.h>
thrust::sort(data, data + size);
instead of this
sort(data, data + size);
For sure it will be faster.
Original Post
I'm working on a statistical application which has a large array
containing 10 - 30 millions of floating point values.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Yes, it is. A month ago, I ran an entirely Molecular Dynamic simulation on a GPU. One of the kernels, which calculated the force between pairs of particles, received as parameter 6 array each one with 500,000 doubles, for a total of 3 Millions doubles (22 MB).
So if you are planning to put 30 Million floating points, which is about 114 MB of global Memory, it will not be a problem.
In your case, can the number of calculations be an issue? Based on my experience with the Molecular Dynamic (MD), I would say no. The sequential MD version takes about 25 hours to complete while the GPU version took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the MD.
Here's the force calculation example:
__global__ void add(double *fx, double *fy, double *fz,
double *x, double *y, double *z,...){
int pos = (threadIdx.x + blockIdx.x * blockDim.x);
...
while(pos < particles)
{
for (i = 0; i < particles; i++)
{
if(//inside of the same radius)
{
// calculate force
}
}
pos += blockDim.x * gridDim.x;
}
}
A simple example of a code in CUDA could be the sum of two 2D arrays:
In C:
for(int i = 0; i < N; i++)
c[i] = a[i] + b[i];
In CUDA:
__global__ add(int *c, int *a, int*b, int N)
{
int pos = (threadIdx.x + blockIdx.x)
for(; i < N; pos +=blockDim.x)
c[pos] = a[pos] + b[pos];
}
In CUDA you basically took each for iteration and assigned to each thread,
1) threadIdx.x + blockIdx.x*blockDim.x;
Each block has an ID from 0 to N-1 (N the number maximum of blocks) and each block has a 'X' number of threads with an ID from 0 to X-1.
Gives you the for loop iteration that each thread will compute based on its ID and the block ID which the thread is in; the blockDim.x is the number of threads that a block has.
So if you have 2 blocks each one with 10 threads and N=40, the:
Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39
Looking at your current code, I have made this draft of what your code could look like in CUDA:
__global__ hash (float *largeFloatingPointArray, int *dictionary)
// You can turn the dictionary in one array of int
// here each position will represent the float
// Since x = 0f; x < 100f; x += 0.0001f
// you can associate each x to different position
// in the dictionary:
// pos 0 have the same meaning as 0f;
// pos 1 means float 0.0001f
// pos 2 means float 0.0002f ect.
// Then you use the int of each position
// to count how many times that "float" had appeared
int x = blockIdx.x; // Each block will take a different x to work
float y;
while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
int noOfOccurrences = 0;
float z = converting_int_to_float(x); // This function will convert the x to the
// float like you use (x / 0.0001)
// each thread of each block
// will takes the y from the array of largeFloatingPointArray
for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
{
y = largeFloatingPointArray[j];
if (z == y)
{
noOfOccurrences++;
}
}
if(threadIdx.x == 0) // Thread master will update the values
atomicAdd(&dictionary[x], noOfOccurrences);
__syncthreads();
}
You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences concurrently, so you have to ensure mutual exclusion.
This is just one approach; you can even assign the iterations of the outer loop to the threads instead of the blocks.
Tutorials
The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.
and anothers:
Volume I: Introduction to CUDA Programming
Getting started with CUDA
CUDA Resources List
Take a look on the last item, you will find many link to learn CUDA.
OpenCL: OpenCL Tutorials | MacResearch
I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.
In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.
I don't know any C#, but a single pass implementation of your sample would look something like this:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
foreach (float x in largeFloatingPointArray)
{
if (math.Truncate(x/0.0001f)*0.0001f == x)
{
if (noOfNumbers.ContainsKey(x))
noOfNumbers.Add(x, noOfNumbers[x]+1);
else
noOfNumbers.Add(x, 1);
}
}
Hope this helps.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.
If yes: Does anyone know any tutorial or got any sample code
(programming language doesn't matter)?
When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.
CUDA is mature with a lot of tools but is NVidia GPUs centric.
OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).
As you use C# you can use bindings like OpenCL.Net or Cloo.
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.
In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.
The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.
I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.
I am not sure whether using GPUs would be a good match given that
'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.
I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.
You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.
Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.
On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.
The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.
With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.
I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.
I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.
If you need help in implementing a distributed solution, I'd be happy to discuss it with you.
Added Nov 16, 2012.
I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.
/*
* Convert the X range from 0f to 100f in steps of 0.0001f
* into a range of integers 0 to 1 + (100 * 10000) to use as an
* index into an array.
*/
#define X_MAX (1 + (100 * 10000))
/*
* Number of floats in largeFloatingPointArray needs to be defined
* below to be whatever your value is.
*/
#define LARGE_ARRAY_MAX (1000)
main()
{
int j, y, *noOfOccurances;
float *largeFloatingPointArray;
/*
* Allocate memory for largeFloatingPointArray and populate it.
*/
largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));
if (largeFloatingPointArray == 0) {
printf("out of memory\n");
exit(1);
}
/*
* Allocate memory to hold noOfOccurances. The index/10000 is the
* the floating point number. The contents is the count.
*
* E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
* in largeFloatingPointArray.
*/
noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
if (noOfOccurances == 0) {
printf("out of memory\n");
exit(1);
}
for (j = 0; j < LARGE_ARRAY_MAX; j++) {
y = (int)(largeFloatingPointArray[j] * 10000);
if (y >= 0 && y <= X_MAX) {
noOfOccurances[y]++;
}
}
}