What is the optimum OpenCL 2 kernel to sum floats? - c++

C++ 17 introduced a number of new algorithms to support parallel execution, in particular std::reduce is a parallel version of std::accumulate which permits non-deterministic behaviour for non-commutative operations, such as floating point addition. I want to implement a reduce algorithm using OpenCL 2.
Intel have an example here which uses OpenCL 2 work group kernel functions to implement a std::exclusive_scan OpenCL 2 kernel. Below is kernel to sum floats, based on Intel's exclusive_scan example:
kernel void sum_float (global float* sum, global float* values)
{
float sum_val = 0.0f;
for (size_t i = 0u; i < get_num_groups(0); ++i)
{
size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
float value = work_group_reduce_add(values[index]);
sum_val += work_group_broadcast(value, 0u);
}
sum[0] = sum_val;
}
The kernel above works (or seems to!). However, exclusive_scan required the work_group_broadcast function to pass the last value of one work group to the next, whereas this kernel only requires the result of work_group_reduce_add to be added to sum_val, so an atomic add is more appropriate.
OpenCL 2 provides an atomic_int which supports atomic_fetch_add. An integer version of the kernel above using atomic_int is:
kernel void sum_int (global int* sum, global int* values)
{
atomic_int sum_val;
atomic_init(&sum_val, 0);
for (size_t i = 0u; i < get_num_groups(0); ++i)
{
size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
int value = work_group_reduce_add(values[index]);
atomic_fetch_add(&sum_val, value);
}
sum[0] = atomic_load(&sum_val);
}
OpenCL 2 also provides an atomic_float but it doesn't support atomic_fetch_add.
What is the best way to implement an OpenCL2 kernel to sum floats?

kernel void sum_float (global float* sum, global float* values)
{
float sum_val = 0.0f;
for (size_t i = 0u; i < get_num_groups(0); ++i)
{
size_t index = get_local_id(0) + i * get_enqueued_local_size(0);
float value = work_group_reduce_add(values[index]);
sum_val += work_group_broadcast(value, 0u);
}
sum[0] = sum_val;
}
this has a race condition to write data to sum's zero-indexed element, all workgroups are doing same computation which makes this O(N*N) instead of O(N) and takes more than 1100 milliseconds to complete a 1M-element array sum.
For same 1-M element array, this(global=1M, local=256)
kernel void sum_float2 (global float* sum, global float* values)
{
float sum_partial = work_group_reduce_add(values[get_global_id(0)]);
if(get_local_id(0)==0)
sum[get_group_id(0)] = sum_partial;
}
followed by this (global=4k, local=256)
kernel void sum_float3 (global float* sum, global float* values)
{
float sum_partial = work_group_reduce_add(sum[get_global_id(0)]);
if(get_local_id(0)==0)
values[get_group_id(0)] = sum_partial;
}
does the same thing in a few miliseconds except a third step. First one gets each group sums into their group-id related item and second kernel sums those into 16 values and these 16 values can easily summed by CPU(microseconds or less)(as third step).
Program works like this:
values: 1.0 1.0 .... 1.0 1.0
sum_float2
sum: 256.0 256.0 256.0
sum_float3
values: 65536.0 65536.0 .... 16 items total to be summed by cpu
if you need to use atomics, you should do it as sparsely as possible. Easiest example can be using local atomics to sum many values by each group and then doing last step using a single global atomic function per group to add all. I don't have a C++ setup ready for OpenCL for now, but I guess OpenCL 2.0 atomics are better when you are using multiple devices with same memory resource(probably streaming mode or in SVM) and/or a CPU using C++17 functions. If you don't have multiple devices computing on same area at same time, then I suppose that these new atomics can only be a micro-optimization on top of already working OpenCL 1.2 atomics. I didn't use these new atomics so take all these as a grain of salt.

Related

Fully Connected Layer (dot product) using AVX

I have the following C++ code to perform the multiply and accumulate steps of a fully connected layer (without the bias). Basically I just do a dot product using a vector (inputs) and a matrix (weights). I used AVX vectors to speed up the operation.
const float* src = inputs[0]->buffer();
const float* scl = weights->buffer();
float* dst = outputs[0]->buffer();
SizeVector in_dims = inputs[0]->getTensorDesc().getDims();
SizeVector out_dims = outputs[0]->getTensorDesc().getDims();
const int in_neurons = static_cast<int>(in_dims[1]);
const int out_neurons = static_cast<int>(out_dims[1]);
for(size_t n = 0; n < out_neurons; n++){
float accum = 0.0;
float temp[4] = {0,0,0,0};
float *p = temp;
__m128 in, ws, dp;
for(size_t i = 0; i < in_neurons; i+=4){
// read and save the weights correctly by applying the mask
temp[0] = scl[(i+0)*out_neurons + n];
temp[1] = scl[(i+1)*out_neurons + n];
temp[2] = scl[(i+2)*out_neurons + n];
temp[3] = scl[(i+3)*out_neurons + n];
// load input neurons sequentially
in = _mm_load_ps(&src[i]);
// load weights
ws = _mm_load_ps(p);
// dot product
dp = _mm_dp_ps(in, ws, 0xff);
// accumulator
accum += dp.m128_f32[0];
}
// save the final result
dst[n] = accum.m128_f32[0];
}
It works but the speedup is far from what I expected. As you can see below a convolutional layer with x24 more operations than my custom dot product layer takes less time. This makes no sense and there should be much more room for improvements. What are my major mistakes when trying to use AVX? (I'm new to AVX programming so I don't fully understand from where I should start to look to fully optimize the code).
**Convolutional Convolutional Layer Fully Optimized (AVX)**
Layer: CONV3-32
Input: 28x28x32 = 25K
Weights: (3*3*32)*32 = 9K
Number of MACs: 3*3*27*27*32*32 = 7M
Execution Time on OpenVINO framework: 0.049 ms
**My Custom Dot Product Layer Far From Optimized (AVX)**
Layer: FC
Inputs: 1x1x512
Outputs: 576
Weights: 3*3*64*512 = 295K
Number of MACs: 295K
Execution Time on OpenVINO framework: 0.197 ms
Thanks for all help in advance!
Addendum: What you are doing is actually a Matrix-Vector-product. It is well-understood how to implement this efficiently (although with caching and instruction-level parallelism it is not completely trivial). The rest of the answer just shows a very simple vectorized implementation.
You can drastically simplify your implementation by incrementing n+=8 and i+=1 (assuming out_neurons is a multiple of 8, otherwise, some special handling needs to be done for the last elements), i.e., you can accumulate 8 dst values at once.
A very simple implementation assuming FMA is available (otherwise use multiplication and addition):
void dot_product(const float* src, const float* scl, float* dst,
const int in_neurons, const int out_neurons)
{
for(size_t n = 0; n < out_neurons; n+=8){
__m256 accum = _mm256_setzero_ps();
for(size_t i = 0; i < in_neurons; i++){
accum = _mm256_fmadd_ps(_mm256_loadu_ps(&scl[i*out_neurons+n]), _mm256_set1_ps(src[i]), accum);
}
// save the result
_mm256_storeu_ps(dst+n ,accum);
}
}
This could still be optimized e.g., by accumulating 2, 4, or 8 dst packets inside the inner loop, which would not only save some broadcast operations (the _mm256_set1_ps instruction), but also compensate latencies of the FMA instruction.
Godbolt-Link, if you want to play around with the code: https://godbolt.org/z/mm-YHi

Unable to find simple sum of 1 to 100 numbers in CUDA?

I am working on image processing algorithm using CUDA. In my algorithm i want to find sum of all pixels of image using CUDA kernel. so i made kernel method in cuda for measure sum of all pixels of 16 bit gray scale image, but i got wrong answer.
So i make simple program in cuda for find sum of 1 to 100 numbers and my code is below.
In my code i got not exact sum of that 1 to 100 numbers using GPU, but i got exact sum of that 1 to 100 numbers using CPU. So what i had done in that code ?
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <conio.h>
#include <malloc.h>
#include <limits>
#include <math.h>
using namespace std;
__global__ void computeMeanValue1(double *pixels,double *sum){
int x = threadIdx.x;
sum[0] = sum[0] + (pixels[(x)]);
__syncthreads();
}
int main(int argc, char **argv)
{
double *data;
double *dev_data;
double *dev_total;
double *total;
data=new double[(100) * sizeof(double)];
total=new double[(1) * sizeof(double)];
double cpuSum=0.0;
for(int i=0;i<100;i++){
data[i]=i+1;
cpuSum=cpuSum+data[i];
}
cout<<"CPU total = "<<cpuSum<<std::endl;
cudaMalloc( (void**)&dev_data, 100 * sizeof(double));
cudaMalloc( (void**)&dev_total, 1 * sizeof(double));
cudaMemcpy(dev_data, data, 100 * sizeof(double), cudaMemcpyHostToDevice);
computeMeanValue1<<<1,100>>>(dev_data,dev_total);
cudaDeviceSynchronize();
cudaMemcpy(total, dev_total, 1* sizeof(double), cudaMemcpyDeviceToHost);
cout<<"GPU total = "<<total[0]<<std::endl;
cudaFree(dev_data);
cudaFree(dev_total);
free(data);
free(total);
getch();
return 0;
}
All your threads are writing to the same memory location at the same time.
sum[0] = sum[0] + (pixels[(x)]);
You can't do this and expect to get the correct result. Your kernel needs to take a different approach to avoid writing to the same memory from different threads. The pattern usually employed for doing this is reduction. Simply put with a reduction each thread is responsible for summing a block of elements within the array and then storing the result. By employing a series of these reduction operations its possible to sum the entire contents of the array.
__global__ void block_sum(const float *input,
float *per_block_results,
const size_t n)
{
extern __shared__ float sdata[];
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
// load input into __shared__ memory
float x = 0;
if(i < n)
{
x = input[i];
}
sdata[threadIdx.x] = x;
__syncthreads();
// contiguous range pattern
for(int offset = blockDim.x / 2;
offset > 0;
offset >>= 1)
{
if(threadIdx.x < offset)
{
// add a partial sum upstream to our own
sdata[threadIdx.x] += sdata[threadIdx.x + offset];
}
// wait until all threads in the block have
// updated their partial sums
__syncthreads();
}
// thread 0 writes the final result
if(threadIdx.x == 0)
{
per_block_results[blockIdx.x] = sdata[0];
}
}
Each thread writes to a different location in sdata[threadIdx.x] there is no race condition. Threads are free to access other elements in sdata because they only read from them so there are no race conditions. Note the use of __syncthreads() to ensure that the operations to load data into sdata are complete before the threads start to read the data and the second call to __syncthreads() to ensure that all the summation operations have completed before copying the final result from sdata[0]. Note that only thread 0 writes its result to per_block_results[blockIdx.x], so there is no race condition there either.
You can find the complete sample code for the above on Google Code (I did not write this). This slide deck has a reasonable summary of reductions in CUDA. It includes diagrams which really help in understanding how the interleaved memory reads and writes do not conflict with each other.
You can find lots of other material on efficient implementations of reduction on GPUs. Ensuring that your implementation makes most efficient use of memory is key to getting the best performance out of a memory bound operation like reduction.
In GPU code, we have multiple threads executing in parallel. If all of those threads attempt to update the same location in memory, we have undefined behavior, unless we use special operations, called atomics to do the update.
In your case, since sum is updated by all threads, and sum is a double quantity, we can use the special custom atomic function described in the programming guide to accomplish this.
If I replace your kernel code with the following:
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val +
__longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
__global__ void computeMeanValue1(double *pixels,double *sum){
int x = threadIdx.x;
atomicAdd(sum, pixels[x]);
}
And initialize the sum value to zero before the kernel:
double gpuSum = 0.0;
cudaMemcpy(dev_total, &gpuSum, sizeof(double), cudaMemcpyHostToDevice);
Then I think you'll get matching results.
As #AdeMiller pointed out, the faster way to perform parallel sums like this is via classical parallel reduction.
There is a CUDA sample code that demonstrates this and an accompanying presentation that covers the methodology.

Is it possible to run the sum computation in parallel in OpenCL?

I am a newbie in OpenCL. However, I understand the C/C++ basics and the OOP.
My question is as follows: is it somehow possible to run the sum computation task in parallel? Is it theoretically possible? Below I will describe what I've tried to do:
The task is, for example:
double* values = new double[1000]; //let's pretend it has some random values inside
double sum = 0.0;
for(int i = 0; i < 1000; i++) {
sum += values[i];
}
What I tried to do in OpenCL kernel (and I feel it is wrong because perhaps it accesses the same "sum" variable from different threads/tasks at the same time):
__kernel void calculate2dim(__global float* vectors1dim,
__global float output,
const unsigned int count) {
int i = get_global_id(0);
output += vectors1dim[i];
}
This code is wrong. I will highly appreciate if anyone answers me if it is theoretically possible to run such tasks in parallel and if it is - how!
If you want to sum the values of your array in a parallel fashion, you should make sure you reduce contention and make sure there's no data dependencies across threads.
Data dependencies will cause threads to have to wait for each other, creating contention, which is what you want to avoid to get true parallellization.
One way you could do that is to split your array into N arrays, each containing some subsection of your original array, and then calling your OpenCL kernel function with each different array.
At the end, when all kernels have done the hard work, you can just sum up the results of each array into one. This operation can easily be done by the CPU.
The key is to not have any dependencies between the calculations done in each kernel, so you have to split your data and processing accordingly.
I don't know if your data has any actual dependencies from your question, but that is for you to figure out.
The piece of code I've provided for reference should do the job.
E.g. you have N elements, and size of your workgroup is WS = 64. I assume that N is multiple of 2*WS (this is important, one workgroup calculates sum of 2*WS elements). Then you need to run kernel specifying:
globalSizeX = 2*WS*(N/(2*WS));
As a result sum array will have partial sums of 2*WS elements. ( e.g. sum[1] - will contain sum of elements whose indices are from 2*WS to 4*WS-1).
If your globalSizeX is 2*WS or less (which means that you have only one workgroup), then you are done. Just use sum[0] as a result.
If not - you need to repeat procedure, this time using sum array as input array and output to other array (create 2 arrays and ping-pong between them). And so on untill you will have only one workgroup.
Search also for Hilli Steele / Blelloch parallel algorithms.
This article could be useful as well
Here is the actual example:
__kernel void par_sum(__global unsigned int* input, __global unsigned int* sum)
{
int li = get_local_id(0);
int groupId = get_group_id(0);
__local int our_h[2 * get_group_size(0)];
our_h[2*li + 0] = hist[2*get_group_size(0)*blockId + 2*li + 0];
our_h[2*li + 1] = hist[2*get_group_size(0)*blockId + 2*li + 1];
// sweep up
int width = 2;
int num_el = 2*get_group_size(0)/width;
int wby2 = width>>1;
for(int i = 2*BLK_SIZ>>1; i>0; i>>=1)
{
barrier(CLK_LOCL_MEM_FENCE);
if(li < num_el)
{
int idx = width*(li+1) - 1;
our_h[idx] = our_h[idx] + our_h[(idx - wby2)];
}
width<<=1;
wby2 = width>>1;
num_el>>=1;
}
barrier(CLK_LOCL_MEM_FENCE);
// down-sweep
if(0 == li)
sum[groupId] = our_h[2*get_group_size(0)-1]; // save sum
}

GPU for loops: avoid warp divergence & implicit syncthreads

My situation: each thread in a warp operates on its own completely independent & distinct data array. All threads loop over their data array. The number of loop iterations is different for each thread. (This incurs a cost, I know).
Within the for loop, each thread needs to save the maximum value after calculating three floats. After the for-loop, threads in warp will "communicate" by checking the maximum value calculated by only their "neighboring thread" in the warp (determined by parity).
Questions:
If I avoid the conditionals in a "max" operation by doing multiplication, this will avoid warp divergence, right? (see example code below)
The extra multiplication operations mentioned in (1.) are worth it, right? - i.e. far faster than any sort of warp divergence.
The same mechanism that causes warp divergence (one set of instructions for all threads) can be exploited as an implicit "thread barrier" (for the warp) at the end of the for-loop (much the same way as with an "#pragma omp for" statement in non-gpu computing). Thus I don't need to make a "syncthreads" call for a warp after the for loop before one thread checks the value saved by another thread, right? (This would be because "synthreads" is only for the "entire GPU", i.e. inter-warp and inter-MP, right?)
example code:
__shared__ int N_per_data; // loaded from host
__shared__ float ** data; //loaded from host
data = new float*[num_threads_in_warp];
for (int j = 0; j < num_threads_in_warp; ++j)
data[j] = new float[N_per_data[j]];
// the values of jagged matrix "data" are loaded from host.
__shared__ float **max_data = new float*[num_threads_in_warp];
for (int j = 0; j < num_threads_in_warp; ++j)
max_data[j] = new float[N_per_data[j]];
for (uint j = 0; j < N_per_data[threadIdx.x]; ++j)
{
const float a = f(data[threadIdx.x][j]);
const float b = g(data[threadIdx.x][j]);
const float c = h(data[threadIdx.x][j]);
const int cond_a = (a > b) && (a > c);
const int cond_b = (b > a) && (b > c);
const int cond_c = (c > a) && (c > b);
// avoid if-statements. question (1) and (2)
max_data[threadIdx.x][j] = conda_a * a + cond_b * b + cond_c * c;
}
// Question (3):
// No "syncthreads" necessary in next line:
// access data of your mate at some magic positions (assume it exists):
float my_neighbors_max_at_7 = max_data[threadIdx.x + pow(-1,(threadIdx.x % 2) == 1) ][7];
Before implementing my algorithm on a GPU, I am investigating every aspect of the algorithm to ensure that it will be worth the implementation effort. So please bear with me..
Yes
My guess would be NO - depends on how you would write the other version with the ifs.
The compiler will probably use predicates to mask out the unwanted writes, in which case there would be no real thread divergence, just a few executed but masked out write instructions.
You should let the compiler do it's magic and compare the decompiled code for both versions to determine what is the better solution.
In your particular case of calculating a maximum of signed integer d = a > b ? a : b translates to one PTX ISA instruction max.s32 so there is really no need to make it as complicated as you did... just compute the maximum into a temporary variable and do one unconditional write.
Yes, but the synthreads barrier is an intra-block barrier, not inter-block and certainly not inter-mp.

Can/Should I run this code of a statistical application on a GPU?

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.
Several methods performing different, but independent, calculations on the array in nested loops, for example:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
for (float x = 0f; x < 100f; x += 0.0001f) {
int noOfOccurrences = 0;
foreach (float y in largeFloatingPointArray) {
if (x == y) {
noOfOccurrences++;
}
}
noOfNumbers.Add(x, noOfOccurrences);
}
The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:
Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?
UPDATE GPU Version
__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
float y; // compute one (or more) floats
int noOfOccurrences = 0;
int a;
while( x < size ) // While there is work to do each thread will:
{
dictionary[x] = 0; // Initialize the position in each it will work
noOfOccurrences = 0;
for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
{ // that are equal
// to it assign float
y = largeFloatingPointArray[j]; // Take a candidate from the floats array
y *= 10000; // e.g if y = 0.0001f;
a = y + 0.5; // a = 1 + 0.5 = 1;
if (a == x) noOfOccurrences++;
}
dictionary[x] += noOfOccurrences; // Update in the dictionary
// the number of times that the float appears
x += blockDim.x * gridDim.x; // Update the position here the thread will work
}
}
This one I just tested for smaller inputs, because I am testing in my laptop. Nevertheless, it is working, but more tests are needed.
UPDATE Sequential Version
I just did this naive version that executes your algorithm for an array with 30,000,000 element in less than 20 seconds (including the time taken by function that generates the data).
This naive version first sorts your array of floats. Afterward, will go through the sorted array and check the number of times a given value appears in the array and then puts this value in a dictionary along with the number of times it has appeared.
You can use sorted map, instead of the unordered_map that I used.
Heres the code:
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>
typedef std::tr1::unordered_map<float, int> Mymap;
void generator(float *data, long int size)
{
float LO = 0.0;
float HI = 100.0;
for(long int i = 0; i < size; i++)
data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}
void print_array(float *data, long int size)
{
for(long int i = 2; i < size; i++)
printf("%f\n",data[i]);
}
std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
float previous = data[0];
int count = 1;
std::tr1::unordered_map<float, int> dict;
for(long int i = 1; i < size; i++)
{
if(previous == data[i])
count++;
else
{
dict.insert(Mymap::value_type(previous,count));
previous = data[i];
count = 1;
}
}
dict.insert(Mymap::value_type(previous,count)); // add the last member
return dict;
}
void printMAP(std::tr1::unordered_map<float, int> dict)
{
for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
{
std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
}
}
int main(int argc, char** argv)
{
int size = 1000000;
if(argc > 1) size = atoi(argv[1]);
printf("Size = %d",size);
float data[size];
using namespace __gnu_cxx;
std::tr1::unordered_map<float, int> dict;
generator(data,size);
sort(data, data + size);
dict = fill_dict(data,size);
return 0;
}
If you have the library thrust installed in you machine your should use this:
#include <thrust/sort.h>
thrust::sort(data, data + size);
instead of this
sort(data, data + size);
For sure it will be faster.
Original Post
I'm working on a statistical application which has a large array
containing 10 - 30 millions of floating point values.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Yes, it is. A month ago, I ran an entirely Molecular Dynamic simulation on a GPU. One of the kernels, which calculated the force between pairs of particles, received as parameter 6 array each one with 500,000 doubles, for a total of 3 Millions doubles (22 MB).
So if you are planning to put 30 Million floating points, which is about 114 MB of global Memory, it will not be a problem.
In your case, can the number of calculations be an issue? Based on my experience with the Molecular Dynamic (MD), I would say no. The sequential MD version takes about 25 hours to complete while the GPU version took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the MD.
Here's the force calculation example:
__global__ void add(double *fx, double *fy, double *fz,
double *x, double *y, double *z,...){
int pos = (threadIdx.x + blockIdx.x * blockDim.x);
...
while(pos < particles)
{
for (i = 0; i < particles; i++)
{
if(//inside of the same radius)
{
// calculate force
}
}
pos += blockDim.x * gridDim.x;
}
}
A simple example of a code in CUDA could be the sum of two 2D arrays:
In C:
for(int i = 0; i < N; i++)
c[i] = a[i] + b[i];
In CUDA:
__global__ add(int *c, int *a, int*b, int N)
{
int pos = (threadIdx.x + blockIdx.x)
for(; i < N; pos +=blockDim.x)
c[pos] = a[pos] + b[pos];
}
In CUDA you basically took each for iteration and assigned to each thread,
1) threadIdx.x + blockIdx.x*blockDim.x;
Each block has an ID from 0 to N-1 (N the number maximum of blocks) and each block has a 'X' number of threads with an ID from 0 to X-1.
Gives you the for loop iteration that each thread will compute based on its ID and the block ID which the thread is in; the blockDim.x is the number of threads that a block has.
So if you have 2 blocks each one with 10 threads and N=40, the:
Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39
Looking at your current code, I have made this draft of what your code could look like in CUDA:
__global__ hash (float *largeFloatingPointArray, int *dictionary)
// You can turn the dictionary in one array of int
// here each position will represent the float
// Since x = 0f; x < 100f; x += 0.0001f
// you can associate each x to different position
// in the dictionary:
// pos 0 have the same meaning as 0f;
// pos 1 means float 0.0001f
// pos 2 means float 0.0002f ect.
// Then you use the int of each position
// to count how many times that "float" had appeared
int x = blockIdx.x; // Each block will take a different x to work
float y;
while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
int noOfOccurrences = 0;
float z = converting_int_to_float(x); // This function will convert the x to the
// float like you use (x / 0.0001)
// each thread of each block
// will takes the y from the array of largeFloatingPointArray
for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
{
y = largeFloatingPointArray[j];
if (z == y)
{
noOfOccurrences++;
}
}
if(threadIdx.x == 0) // Thread master will update the values
atomicAdd(&dictionary[x], noOfOccurrences);
__syncthreads();
}
You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences concurrently, so you have to ensure mutual exclusion.
This is just one approach; you can even assign the iterations of the outer loop to the threads instead of the blocks.
Tutorials
The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.
and anothers:
Volume I: Introduction to CUDA Programming
Getting started with CUDA
CUDA Resources List
Take a look on the last item, you will find many link to learn CUDA.
OpenCL: OpenCL Tutorials | MacResearch
I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.
In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.
I don't know any C#, but a single pass implementation of your sample would look something like this:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
foreach (float x in largeFloatingPointArray)
{
if (math.Truncate(x/0.0001f)*0.0001f == x)
{
if (noOfNumbers.ContainsKey(x))
noOfNumbers.Add(x, noOfNumbers[x]+1);
else
noOfNumbers.Add(x, 1);
}
}
Hope this helps.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.
If yes: Does anyone know any tutorial or got any sample code
(programming language doesn't matter)?
When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.
CUDA is mature with a lot of tools but is NVidia GPUs centric.
OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).
As you use C# you can use bindings like OpenCL.Net or Cloo.
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.
In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.
The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.
I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.
I am not sure whether using GPUs would be a good match given that
'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.
I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.
You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.
Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.
On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.
The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.
With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.
I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.
I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.
If you need help in implementing a distributed solution, I'd be happy to discuss it with you.
Added Nov 16, 2012.
I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.
/*
* Convert the X range from 0f to 100f in steps of 0.0001f
* into a range of integers 0 to 1 + (100 * 10000) to use as an
* index into an array.
*/
#define X_MAX (1 + (100 * 10000))
/*
* Number of floats in largeFloatingPointArray needs to be defined
* below to be whatever your value is.
*/
#define LARGE_ARRAY_MAX (1000)
main()
{
int j, y, *noOfOccurances;
float *largeFloatingPointArray;
/*
* Allocate memory for largeFloatingPointArray and populate it.
*/
largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));
if (largeFloatingPointArray == 0) {
printf("out of memory\n");
exit(1);
}
/*
* Allocate memory to hold noOfOccurances. The index/10000 is the
* the floating point number. The contents is the count.
*
* E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
* in largeFloatingPointArray.
*/
noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
if (noOfOccurances == 0) {
printf("out of memory\n");
exit(1);
}
for (j = 0; j < LARGE_ARRAY_MAX; j++) {
y = (int)(largeFloatingPointArray[j] * 10000);
if (y >= 0 && y <= X_MAX) {
noOfOccurances[y]++;
}
}
}