Co-occurrence matrix creation with cuda - c++

//This is my kernel function
__global__ void createSCM(Pixel*pixelMat, //image
int imgRows, //image dimensions
int imgCols,
int*matrizSCM, //Coocurrence matrix
int numNiveles, //coocurrence matrix levels = 256
int delta_R, //value = {-1,0 or 1}
int delta_C) //value = {-1,0 or 1}
{
int i = blockIdx.y*blockDim.y+threadIdx.y;
int j = blockIdx.x*blockDim.x+threadIdx.x;
int cols = numNiveles;
int posx,posy;
if ( (j + delta_C) < imgCols && (i + delta_R) < imgRows &&
((j + delta_C) >= 0) && ((i + delta_R) >= 0) )
{
posx = pixelMat[i*imgCols+j].channel_0;
posy = pixelMat[(i + delta_R)*imgCols+(j + delta_C)].channel_0;
matrizSCM[posx*cols+posy]++;
matrizSCM[posy*cols+posx]++;
}
}
struct Pixel {
int channel_0;
};
I have counting errors in the coocurrence matrix, because
pixelMat[i*imgCols+j] and pixelMat[(i + delta_R)*imgCols+(j + delta_C)]
are accessing to different positions with the same thread.
This is my kernel call
int Grid_Dim_x=imagenTest.rows, Grid_Dim_y=imagenTest.cols;
int Block_Dim_x=1, Block_Dim_y=1;
dim3 Grid(Grid_Dim_x, Grid_Dim_y);
dim3 Block(Block_Dim_x,Block_Dim_x);
createSCM<<<Grid,Block>>>(...)
There is just one thread on each block, and each block represents a pixel
is there a nice solution to this problem?
Thanks :)

Reading from different memory cells of immutable input incurs no parallel hazard that you would have to deal with. The problem lies within the matrizSCM where the same memory cell can be incremented by multiple threads at once.
An atomicAdd(addr,1) is a quick fix --- it should make the algorithm correct, but it may be fairly slow. Making it correct should be the first step; then you can look on available examples on the web of histogram computation and parallel reduction algorithm and check if it can be applied to your problem.
Finally, as Robert pointed out in the comment, launching just one thread in a block is very inefficient. You need a multiple of 32 to utilize the hardware SIMD unit, and usually about 256 threads to hide various memory latencies.
Also, if your image is big and you still need thousands of 256-thread blocks, you may consider launching less blocks (around 60-120) but having each block process multiple pixels sequentially. If you do that, you might be able to put a copy of matrixSCM in shared memory. This will make a separate copy of matrixSCM for each block, resulting in less atomic conflicts between the blocks. Obviously, at the end of the kernel, your block will still need to "submit" the partial result into the global one, but that would be a single step operation.

Related

CUDA C++ shared memory and if-condition

i have a question i couldnt find an answer to myself, and i was hoping some of you could offer me some insight regarding a possible solution. Within a kernel call, i would like to insert an if-condition regarding access to shared memory.
__global__ void GridFillGPU (int * gridGLOB, int n) {
__shared__ int grid[SIZE] // ... initialized to zero
int tid = threadIdx.x
if (tid < n) {
for ( int k = 0; k < SIZE; k++) {
if (grid[k] == 0) {
grid[k] = tid+1;
break;
}
}
}
//... here write grid to global memory gridGLOB
}
The idea is that, if the element grid[k] has already been written by one thread (with the index tid), it should not be written by another one. My question is: can this even be done in parallel ? Since all parallel threads perform the same for-loop, how can i be sure that the if-condition is evaluated correctly? I am guessing this will lead to certain race-conditions. I am quite new to Cuda, so i hope this question is not stupid. I know that grid needs to be in shared memory, and that one should avoid if-statements, but i find no other way around at the moment.
I am thankful for any help
EDIT: here is the explicit version, which explains why the array is called grid
__global__ void GridFillGPU (int * pos, int * gridGLOB, int n) {
__shared__ int grid[SIZE*7] // ... initialized to zero
int tid = threadIdx.x
if (tid < n) {
int jmin = pos[tid] - 3;
int jmax = pos[tid] + 3;
for ( int j = jmin; j <= jmax; j++ {
for ( int k = 0; k < SIZE; k++) {
if (grid[(j-jmin)*SIZE + k] == 0) {
grid[(j-jmin)*SIZE + k] = tid+1;
break;
}
}
}
} //... here write grid to global memory gridGLOB
}
You should model you problem in a way you don't need to worry about "if has been written already", also because cuda offers no guarantee in the order in which thread will be executed, so the order might not be the way you excpect.
There are some minor things that cuda ensure you order wise within a warp but that is not the case.
There are sync barries and stuff you can use but I don't think is your case.
if you are processing a grid you should model that in a way that each thread has its own region of memory is going to work on. and that should not overlap with other thread region (at least in writing, in reading you can go outside boundaries). Also I would not worry about shared memory, make the algorithm works first, then think about optimization like load a tile in shared memory using the warp.
In that case if you want to split your domain in a grid you should setup the kernel, in order to have enough threads as your grid "cells" or pixels if is an image. Then you use the thread and block coordinates that cuda provides you to compute where you should read and write in memory.
There is a really good course on udacity.com about cuda, you might want to have a look at that.
https://www.udacity.com/courses/cs344
There is also another one on coursera.com but I don't know if it is open right now.
Anyway dividing the domain in a grid is a really common and solved problem ,you can find a lot of material on that.

It's slower to calculate integral image using CUDA than CPU code

I am implementing the integral image calculation module using CUDA to improve performance.
But its speed slower than the CPU module.
Please let me know what i did wrong.
cuda kernels and host code follow.
And also, another problem is...
In the kernel SumH, using texture memory is slower than global one, imageTexture was defined as below.
texture<unsigned char, 1> imageTexture;
cudaBindTexture(0, imageTexture, pbImage);
// kernels to scan the image horizontally and vertically.
__global__ void SumH(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rVSpan, int nWidth)
{
int nStartY, nEndY, nIdx;
if (!threadIdx.x)
{
nStartY = 1;
}
else
nStartY = (int)(threadIdx.x * rVSpan);
nEndY = (int)((threadIdx.x + 1) * rVSpan);
for (int i = nStartY; i < nEndY; i ++)
{
for (int j = 1; j < nWidth; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + pbImage[nIdx - nWidth - i] * pbImage[nIdx - nWidth - i];
//pnIntImage[nIdx] = pnIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i);
//pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - 1] + tex1Dfetch(imageTexture, nIdx - nWidth - i) * tex1Dfetch(imageTexture, nIdx - nWidth - i);
}
}
}
__global__ void SumV(unsigned char* pbImage, int* pnIntImage, __int64* pn64SqrIntImage, float rHSpan, int nHeight, int nWidth)
{
int nStartX, nEndX, nIdx;
if (!threadIdx.x)
{
nStartX = 1;
}
else
nStartX = (int)(threadIdx.x * rHSpan);
nEndX = (int)((threadIdx.x + 1) * rHSpan);
for (int i = 1; i < nHeight; i ++)
{
for (int j = nStartX; j < nEndX; j ++)
{
nIdx = i * nWidth + j;
pnIntImage[nIdx] = pnIntImage[nIdx - nWidth] + pnIntImage[nIdx];
pn64SqrIntImage[nIdx] = pn64SqrIntImage[nIdx - nWidth] + pn64SqrIntImage[nIdx];
}
}
}
// host code
int nW = image_width;
int nH = image_height;
unsigned char* pbImage;
int* pnIntImage;
__int64* pn64SqrIntImage;
cudaMallocManaged(&pbImage, nH * nW);
// assign image gray values to pbimage
cudaMallocManaged(&pnIntImage, sizeof(int) * (nH + 1) * (nW + 1));
cudaMallocManaged(&pn64SqrIntImage, sizeof(__int64) * (nH + 1) * (nW + 1));
float rHSpan, rVSpan;
int nHThreadNum, nVThreadNum;
if (nW + 1 <= 1024)
{
rHSpan = 1;
nVThreadNum = nW + 1;
}
else
{
rHSpan = (float)(nW + 1) / 1024;
nVThreadNum = 1024;
}
if (nH + 1 <= 1024)
{
rVSpan = 1;
nHThreadNum = nH + 1;
}
else
{
rVSpan = (float)(nH + 1) / 1024;
nHThreadNum = 1024;
}
SumH<<<1, nHThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rVSpan, nW + 1);
cudaDeviceSynchronize();
SumV<<<1, nVThreadNum>>>(pbImage, pnIntImage, pn64SqrIntImage, rHSpan, nH + 1, nW + 1);
cudaDeviceSynchronize();
Regarding the code that is currently in the question. There are two things I'd like to mention: launch parameters and timing methodology.
1) Launch parameters
When you launch a kernel there are two main arguments that specify the amount of threads you are launching. These are between the <<< and >>> sections, and are the number of blocks in the grid, and the number of threads per block as follows:
foo <<< numBlocks, numThreadsPerBlock >>> (args);
For a single kernel to be efficient on a current GPU you can use the rule of thumb that numBlocks * numThreadsPerBlock should be at least 10,000. Ie. 10,000 pieces of work. This is a rule of thumb, so you may get good results with only 5,000 threads (it varies with GPU: cheaper GPUs can get away with fewer threads), but this is the order of magnitude you need to be looking at as a minimum. You are running 1024 threads. This is almost certainly not enough (Hint: the loops inside your kernel look like scan primatives, these can be done in parallel).
Further to this there are a few other things to consider.
The number of blocks should be large in comparison to the number of SMs on your GPU. A Kepler K40 has 15 SMs, and to avoid a signficant tail effect you'd probably want at least ~100 blocks on this GPU. Other GPUs have fewer SMs, but you haven't specified which you have, so I can't be more specific.
The number of threads per block should not be too small. You can only have so many blocks on each SM, so if your blocks are too small you will use the GPU suboptimally. Furthermore, on newer GPUs up to four warps can receive instructions on a SM simultaneously, and as such is it often a good idea to have block sizes as multiples of 128.
2) Timing
I'm not going to go into so much depth here, but make sure your timing is sane. GPU code tends to have a one-time initialisation delay. If this is within your timing, you will see erroneously large runtimes for codes designed to represent a much larger code. Similarly, data transfer between the CPU and GPU takes time. In a real application you may only do this once for thousands of kernel calls, but in a test application you may do it once per kernel launch.
If you want to get accurate timings you must make your example more representitive of the final code, or you must be sure that you are only timing the regions that will be repeated.
The only way to be sure is to profile the code, but in this case we can probably make a reasonable guess.
You're basically just doing a single scan through some data, and doing extremely minimal processing on each item.
Given how little processing you're doing on each item, the bottleneck when you process the data with the CPU is probably just reading the data from memory.
When you do the processing on the GPU, the data still needs to be read from memory and copied into the GPU's memory. That means we still have to read all the data from main memory, just like if the CPU did the processing. Worse, it all has to be written to the GPU's memory, causing a further slowdown. By the time the GPU even gets to start doing real processing, you've already used up more time than it would have taken the CPU to finish the job.
For Cuda to make sense, you generally need to be doing a lot more processing on each individual data item. In this case, the CPU is probably already nearly idle most of the time, waiting for data from memory. In such a case, the GPU is unlikely to be of much help unless the input data was already in the GPU's memory so the GPU could do the processing without any extra copying.
When working with CUDA there are a few things you should keep in mind.
Copying from host memory to device memory is 'slow' - when you copy some data from the host to the device you should do as much calculations as possible (do all the work) before you copy it back to the host.
On the device there are 3 types of memory - global, shared, local. You can rank them in speed like global < shared < local (local = fastest).
Reading from consecutive memory blocks is faster than random access. When working with array of structures you would like to transpose it to a structure of arrays.
You can always consult the CUDA Visual Profiler to show you the bottleneck of your program.
the above mentioned GTX750 has 512 CUDA cores (these are the same as the shader units, just driven in a /different/ mode).
http://www.nvidia.de/object/geforce-gtx-750-de.html#pdpContent=2
the duty of creating integral images is only partially able to be parallel'ized as any result value in the results array depends on a bigger bunch of it's predecessors. further it is only a tiny math portion per memory transfer so the ALU powers and thus the unavoidable memory transfers might be the bottle neck. such an accelerator might provide some speed up, but not a thrilling speed up because of the duty itself does not allow it.
if you would compute multiple variations of integral images on the same input data you would be able to see the "thrill" much more likely due to much higher parallelism options and a higher amount of math ops. but that would be a different duty then.
as a wild guess from google search - others have already fiddled with those item: https://www.google.de/url?sa=t&rct=j&q=&esrc=s&source=web&cd=11&cad=rja&uact=8&ved=0CD8QFjAKahUKEwjjnoabw8bIAhXFvhQKHbUpA1Y&url=http%3A%2F%2Fdspace.mit.edu%2Fopenaccess-disseminate%2F1721.1%2F71883&usg=AFQjCNHBbOEB_OHAzLZI9__lXO_7FPqdqA

GPU for loops: avoid warp divergence & implicit syncthreads

My situation: each thread in a warp operates on its own completely independent & distinct data array. All threads loop over their data array. The number of loop iterations is different for each thread. (This incurs a cost, I know).
Within the for loop, each thread needs to save the maximum value after calculating three floats. After the for-loop, threads in warp will "communicate" by checking the maximum value calculated by only their "neighboring thread" in the warp (determined by parity).
Questions:
If I avoid the conditionals in a "max" operation by doing multiplication, this will avoid warp divergence, right? (see example code below)
The extra multiplication operations mentioned in (1.) are worth it, right? - i.e. far faster than any sort of warp divergence.
The same mechanism that causes warp divergence (one set of instructions for all threads) can be exploited as an implicit "thread barrier" (for the warp) at the end of the for-loop (much the same way as with an "#pragma omp for" statement in non-gpu computing). Thus I don't need to make a "syncthreads" call for a warp after the for loop before one thread checks the value saved by another thread, right? (This would be because "synthreads" is only for the "entire GPU", i.e. inter-warp and inter-MP, right?)
example code:
__shared__ int N_per_data; // loaded from host
__shared__ float ** data; //loaded from host
data = new float*[num_threads_in_warp];
for (int j = 0; j < num_threads_in_warp; ++j)
data[j] = new float[N_per_data[j]];
// the values of jagged matrix "data" are loaded from host.
__shared__ float **max_data = new float*[num_threads_in_warp];
for (int j = 0; j < num_threads_in_warp; ++j)
max_data[j] = new float[N_per_data[j]];
for (uint j = 0; j < N_per_data[threadIdx.x]; ++j)
{
const float a = f(data[threadIdx.x][j]);
const float b = g(data[threadIdx.x][j]);
const float c = h(data[threadIdx.x][j]);
const int cond_a = (a > b) && (a > c);
const int cond_b = (b > a) && (b > c);
const int cond_c = (c > a) && (c > b);
// avoid if-statements. question (1) and (2)
max_data[threadIdx.x][j] = conda_a * a + cond_b * b + cond_c * c;
}
// Question (3):
// No "syncthreads" necessary in next line:
// access data of your mate at some magic positions (assume it exists):
float my_neighbors_max_at_7 = max_data[threadIdx.x + pow(-1,(threadIdx.x % 2) == 1) ][7];
Before implementing my algorithm on a GPU, I am investigating every aspect of the algorithm to ensure that it will be worth the implementation effort. So please bear with me..
Yes
My guess would be NO - depends on how you would write the other version with the ifs.
The compiler will probably use predicates to mask out the unwanted writes, in which case there would be no real thread divergence, just a few executed but masked out write instructions.
You should let the compiler do it's magic and compare the decompiled code for both versions to determine what is the better solution.
In your particular case of calculating a maximum of signed integer d = a > b ? a : b translates to one PTX ISA instruction max.s32 so there is really no need to make it as complicated as you did... just compute the maximum into a temporary variable and do one unconditional write.
Yes, but the synthreads barrier is an intra-block barrier, not inter-block and certainly not inter-mp.

Can/Should I run this code of a statistical application on a GPU?

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.
Several methods performing different, but independent, calculations on the array in nested loops, for example:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
for (float x = 0f; x < 100f; x += 0.0001f) {
int noOfOccurrences = 0;
foreach (float y in largeFloatingPointArray) {
if (x == y) {
noOfOccurrences++;
}
}
noOfNumbers.Add(x, noOfOccurrences);
}
The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:
Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?
UPDATE GPU Version
__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
float y; // compute one (or more) floats
int noOfOccurrences = 0;
int a;
while( x < size ) // While there is work to do each thread will:
{
dictionary[x] = 0; // Initialize the position in each it will work
noOfOccurrences = 0;
for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
{ // that are equal
// to it assign float
y = largeFloatingPointArray[j]; // Take a candidate from the floats array
y *= 10000; // e.g if y = 0.0001f;
a = y + 0.5; // a = 1 + 0.5 = 1;
if (a == x) noOfOccurrences++;
}
dictionary[x] += noOfOccurrences; // Update in the dictionary
// the number of times that the float appears
x += blockDim.x * gridDim.x; // Update the position here the thread will work
}
}
This one I just tested for smaller inputs, because I am testing in my laptop. Nevertheless, it is working, but more tests are needed.
UPDATE Sequential Version
I just did this naive version that executes your algorithm for an array with 30,000,000 element in less than 20 seconds (including the time taken by function that generates the data).
This naive version first sorts your array of floats. Afterward, will go through the sorted array and check the number of times a given value appears in the array and then puts this value in a dictionary along with the number of times it has appeared.
You can use sorted map, instead of the unordered_map that I used.
Heres the code:
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>
typedef std::tr1::unordered_map<float, int> Mymap;
void generator(float *data, long int size)
{
float LO = 0.0;
float HI = 100.0;
for(long int i = 0; i < size; i++)
data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}
void print_array(float *data, long int size)
{
for(long int i = 2; i < size; i++)
printf("%f\n",data[i]);
}
std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
float previous = data[0];
int count = 1;
std::tr1::unordered_map<float, int> dict;
for(long int i = 1; i < size; i++)
{
if(previous == data[i])
count++;
else
{
dict.insert(Mymap::value_type(previous,count));
previous = data[i];
count = 1;
}
}
dict.insert(Mymap::value_type(previous,count)); // add the last member
return dict;
}
void printMAP(std::tr1::unordered_map<float, int> dict)
{
for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
{
std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
}
}
int main(int argc, char** argv)
{
int size = 1000000;
if(argc > 1) size = atoi(argv[1]);
printf("Size = %d",size);
float data[size];
using namespace __gnu_cxx;
std::tr1::unordered_map<float, int> dict;
generator(data,size);
sort(data, data + size);
dict = fill_dict(data,size);
return 0;
}
If you have the library thrust installed in you machine your should use this:
#include <thrust/sort.h>
thrust::sort(data, data + size);
instead of this
sort(data, data + size);
For sure it will be faster.
Original Post
I'm working on a statistical application which has a large array
containing 10 - 30 millions of floating point values.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Yes, it is. A month ago, I ran an entirely Molecular Dynamic simulation on a GPU. One of the kernels, which calculated the force between pairs of particles, received as parameter 6 array each one with 500,000 doubles, for a total of 3 Millions doubles (22 MB).
So if you are planning to put 30 Million floating points, which is about 114 MB of global Memory, it will not be a problem.
In your case, can the number of calculations be an issue? Based on my experience with the Molecular Dynamic (MD), I would say no. The sequential MD version takes about 25 hours to complete while the GPU version took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the MD.
Here's the force calculation example:
__global__ void add(double *fx, double *fy, double *fz,
double *x, double *y, double *z,...){
int pos = (threadIdx.x + blockIdx.x * blockDim.x);
...
while(pos < particles)
{
for (i = 0; i < particles; i++)
{
if(//inside of the same radius)
{
// calculate force
}
}
pos += blockDim.x * gridDim.x;
}
}
A simple example of a code in CUDA could be the sum of two 2D arrays:
In C:
for(int i = 0; i < N; i++)
c[i] = a[i] + b[i];
In CUDA:
__global__ add(int *c, int *a, int*b, int N)
{
int pos = (threadIdx.x + blockIdx.x)
for(; i < N; pos +=blockDim.x)
c[pos] = a[pos] + b[pos];
}
In CUDA you basically took each for iteration and assigned to each thread,
1) threadIdx.x + blockIdx.x*blockDim.x;
Each block has an ID from 0 to N-1 (N the number maximum of blocks) and each block has a 'X' number of threads with an ID from 0 to X-1.
Gives you the for loop iteration that each thread will compute based on its ID and the block ID which the thread is in; the blockDim.x is the number of threads that a block has.
So if you have 2 blocks each one with 10 threads and N=40, the:
Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39
Looking at your current code, I have made this draft of what your code could look like in CUDA:
__global__ hash (float *largeFloatingPointArray, int *dictionary)
// You can turn the dictionary in one array of int
// here each position will represent the float
// Since x = 0f; x < 100f; x += 0.0001f
// you can associate each x to different position
// in the dictionary:
// pos 0 have the same meaning as 0f;
// pos 1 means float 0.0001f
// pos 2 means float 0.0002f ect.
// Then you use the int of each position
// to count how many times that "float" had appeared
int x = blockIdx.x; // Each block will take a different x to work
float y;
while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
int noOfOccurrences = 0;
float z = converting_int_to_float(x); // This function will convert the x to the
// float like you use (x / 0.0001)
// each thread of each block
// will takes the y from the array of largeFloatingPointArray
for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
{
y = largeFloatingPointArray[j];
if (z == y)
{
noOfOccurrences++;
}
}
if(threadIdx.x == 0) // Thread master will update the values
atomicAdd(&dictionary[x], noOfOccurrences);
__syncthreads();
}
You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences concurrently, so you have to ensure mutual exclusion.
This is just one approach; you can even assign the iterations of the outer loop to the threads instead of the blocks.
Tutorials
The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.
and anothers:
Volume I: Introduction to CUDA Programming
Getting started with CUDA
CUDA Resources List
Take a look on the last item, you will find many link to learn CUDA.
OpenCL: OpenCL Tutorials | MacResearch
I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.
In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.
I don't know any C#, but a single pass implementation of your sample would look something like this:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
foreach (float x in largeFloatingPointArray)
{
if (math.Truncate(x/0.0001f)*0.0001f == x)
{
if (noOfNumbers.ContainsKey(x))
noOfNumbers.Add(x, noOfNumbers[x]+1);
else
noOfNumbers.Add(x, 1);
}
}
Hope this helps.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.
If yes: Does anyone know any tutorial or got any sample code
(programming language doesn't matter)?
When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.
CUDA is mature with a lot of tools but is NVidia GPUs centric.
OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).
As you use C# you can use bindings like OpenCL.Net or Cloo.
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.
In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.
The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.
I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.
I am not sure whether using GPUs would be a good match given that
'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.
I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.
You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.
Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.
On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.
The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.
With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.
I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.
I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.
If you need help in implementing a distributed solution, I'd be happy to discuss it with you.
Added Nov 16, 2012.
I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.
/*
* Convert the X range from 0f to 100f in steps of 0.0001f
* into a range of integers 0 to 1 + (100 * 10000) to use as an
* index into an array.
*/
#define X_MAX (1 + (100 * 10000))
/*
* Number of floats in largeFloatingPointArray needs to be defined
* below to be whatever your value is.
*/
#define LARGE_ARRAY_MAX (1000)
main()
{
int j, y, *noOfOccurances;
float *largeFloatingPointArray;
/*
* Allocate memory for largeFloatingPointArray and populate it.
*/
largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));
if (largeFloatingPointArray == 0) {
printf("out of memory\n");
exit(1);
}
/*
* Allocate memory to hold noOfOccurances. The index/10000 is the
* the floating point number. The contents is the count.
*
* E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
* in largeFloatingPointArray.
*/
noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
if (noOfOccurances == 0) {
printf("out of memory\n");
exit(1);
}
for (j = 0; j < LARGE_ARRAY_MAX; j++) {
y = (int)(largeFloatingPointArray[j] * 10000);
if (y >= 0 && y <= X_MAX) {
noOfOccurances[y]++;
}
}
}

CUDA how to get grid, block, thread size and parallalize non square matrix calculation

I am new to CUDA and need help understanding some things. I need help parallelizing these two for loops. Specifically how to setup the dimBlock and dimGrid to make this run faster. I know this looks like the vector add example in the sdk but that example is only for square matrices and when I try to modify that code for my 128 x 1024 matrix it doesn't work properly.
__global__ void mAdd(float* A, float* B, float* C)
{
for(int i = 0; i < 128; i++)
{
for(int j = 0; j < 1024; j++)
{
C[i * 1024 + j] = A[i * 1024 + j] + B[i * 1024 + j];
}
}
}
This code is part of a larger loop and is the simplest portion of the code, so I decided to try to paralleize thia and learn CUDA at same time. I have read the guides but still do not understand how to get the proper no. of grids/block/threads going and use them effectively.
As you have written it, that kernel is completely serial. Every thread launched to execute it is going to performing the same work.
The main idea behind CUDA (and OpenCL and other similar "single program, multiple data" type programming models) is that you take a "data parallel" operation - so one where the same, largely independent, operation must be performed many times - and write a kernel which performs that operation. A large number of (semi)autonomous threads are then launched to perform that operation across the input data set.
In your array addition example, the data parallel operation is
C[k] = A[k] + B[k];
for all k between 0 and 128 * 1024. Each addition operation is completely independent and has no ordering requirements, and therefore can be performed by a different thread. To express this in CUDA, one might write the kernel like this:
__global__ void mAdd(float* A, float* B, float* C, int n)
{
int k = threadIdx.x + blockIdx.x * blockDim.x;
if (k < n)
C[k] = A[k] + B[k];
}
[disclaimer: code written in browser, not tested, use at own risk]
Here, the inner and outer loop from the serial code are replaced by one CUDA thread per operation, and I have added a limit check in the code so that in cases where more threads are launched than required operations, no buffer overflow can occur. If the kernel is then launched like this:
const int n = 128 * 1024;
int blocksize = 512; // value usually chosen by tuning and hardware constraints
int nblocks = n / blocksize; // value determine by block size and total work
madd<<<nblocks,blocksize>>>mAdd(A,B,C,n);
Then 256 blocks, each containing 512 threads will be launched onto the GPU hardware to perform the array addition operation in parallel. Note that if the input data size was not expressible as a nice round multiple of the block size, the number of blocks would need to be rounded up to cover the full input data set.
All of the above is a hugely simplified overview of the CUDA paradigm for a very trivial operation, but perhaps it gives enough insight for you to continue yourself. CUDA is rather mature these days and there is a lot of good, free educational material floating around the web you can probably use to further illuminate many of the aspects of the programming model I have glossed over in this answer.