Is the following data processing task suitable for GPU computing? - c++

I'm looking to upgrade my graphics card to be able to process the following task in parallel.
As I have no experience in GPU computing, would this task be suitable, and is it possible to estimate the rate at which the processing could be done before I buy?
My project is publicly funded but has a limited budget, so I need to make the right choice.
I have an in-house build camera chip that produces 4x 256x256 images at 100fps. The data is accessed by calling a c function, passing a pointer to an array data of type unsigned short. I can read out the data fast enough into a memory buffer.
Currently the raw data is saved to disk and then processed offline later, but for future lab experiments with this camera I wish to access data derived from the images as the experiment runs.
I have written in c++ using valarray, methods to calculate the derived data, but it is too slow on my current hardware at about 40ms per frame . (I have experimented with optimisation and I have cut the time considerably from >100ms)
If a Frame is denoted by S, the four subframes (in time) are S1,S2,S3,S4.
I must calculate the following images and the image averages, (S1+S2+S3+S4)/4,
Sqrt((S3-S1)^2 + (S4-S2)^2),
arctan(S3-S1/S2-S4)

It seems like a good fit for an operation to be carried out by a GPU. GPUs are better suited than CPUs to performing massive amounts of relatively simple calculations. They are not as efficient when there is logic, or interdependencies between 'threads'. Although this kind of wanders into the 'opinion' area, I'll try to back up my answer with some numbers.
As a quick estimation of the performance you can expect, I made a quick HLSL pixel shader which does your proposed operations (untested - no guarantee of functionality!):
Texture2D S[4] : register(t0);
SamplerState mySampler : register(s0);
struct PS_OUT
{
float4 average : SV_Target0;
float4 sqrt : SV_Target1;
float4 arctan : SV_Target2;
};
PS_OUT main(float2 UV: TEXCOORD0)
{
PS_OUT output;
float4 SSamples[4];
int i;
for (i = 0; i < 4; i++)
{
SSamples[i] = S[i].Sample(mySampler, UV);
}
float4 s3ms1 = SSamples[2] - SSamples[0];
float4 s4ms2 = SSamples[3] - SSamples[1];
output.average = (SSamples[0] + SSamples[1] + SSamples[2] + SSamples[3]) / 4.0;
output.sqrt = sqrt(s3ms1*s3ms1 + s4ms2*s4ms2);
output.arctan = atan(s3ms1 / s4ms2);
return output;
}
When compiling this (fxc /T ps_4_0 example.ps), it gives the estimation of: Approximately 32 instruction slots used.
If you are processing 256x256 (64k pixels) per-frame, that works out to be about 2.1m/frame, or 210m/s, at 100fps. Looking at a chart of GPU performance (Nvidia for example: http://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units), all their GPUs past Geforce 4 (circa 2005), have sufficient speed to acheive this.
Note that, this shader performance is only an estimation, and the listed rates are theoretical maximums, and I'm only accounting for the pixel unit work (although it will be doing the majority of the work). However, with any sufficiently recent video card the FLOPS will far exceed your needs, so you should be able to easily do this on the GPU at 100fps. Assuming you have a PC newer than 2005, you probably already have a video card powerful enough.

In addition to what #MuertoExcobito already wrote you must also account for copying the data to and from the GPU, however in your case this is not much data.
I created a simple thrust-based implementation which can be compiled and run using CUDA 7 like this:
nvcc -std=c++11 main.cu && ./a.out
Averaged over 10000 runs one iteration which includes copying to the GPU, calculating the three result images and copying the results back from the GPU takes 1.79 ms on my computer (Ubuntu 14.04 x64, Intel Xeon#3.6 Ghz, Geforce GTX 680).
The file "helper_math.h" is adapted from the CUDA SDK and can be found here:
https://gist.github.com/dachziegel/70e008dee7e3f0c18656
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/transform.h>
#include <vector_types.h>
#include <iostream>
#include <chrono>
#include "helper_math.h"
template<typename T>
struct QuadVec
{
T S1, S2, S3, S4;
QuadVec(const int N) : S1(N), S2(N), S3(N), S4(N){}
};
template<typename T>
struct Result
{
T average, sqrt, arctan;
Result(const int N) : average(N), sqrt(N), arctan(N){}
};
typedef thrust::tuple<float4,float4,float4,float4> QuadInput;
typedef thrust::tuple<float4,float4,float4> TripleOutput;
struct CalcResult : public thrust::unary_function<QuadInput,TripleOutput>
{
__host__ __device__
TripleOutput operator()(const QuadInput& f) const
{
const float4 s3ms1 = thrust::get<2>(f) - thrust::get<0>(f);
const float4 s4ms2 = thrust::get<3>(f) - thrust::get<1>(f);
const float4 sqrtArg = s3ms1*s3ms1 + s4ms2*s4ms2;
const float4 atanArg = s3ms1 / s4ms2;
return thrust::make_tuple((thrust::get<0>(f) + thrust::get<1>(f) + thrust::get<2>(f) + thrust::get<3>(f)) / 4.0f,
make_float4(sqrtf(sqrtArg.x), sqrtf(sqrtArg.y), sqrtf(sqrtArg.z), sqrtf(sqrtArg.w)),
make_float4(atanf(atanArg.x), atanf(atanArg.y), atanf(atanArg.z), atanf(atanArg.w))
);
}
};
int main()
{
typedef thrust::host_vector<float4> HostVec;
typedef thrust::device_vector<float4> DevVec;
const int N = 256;
QuadVec<HostVec> hostFrame(N*N);
QuadVec<DevVec> devFrame(N*N);
Result<HostVec> hostResult(N*N);
Result<DevVec> devResult(N*N);
const int runs = 10000;
int accumulatedDuration = 0;
for (int i = 0; i < runs; ++i)
{
auto start = std::chrono::system_clock::now();
thrust::copy(hostFrame.S1.begin(), hostFrame.S1.end(), devFrame.S1.begin());
thrust::copy(hostFrame.S2.begin(), hostFrame.S2.end(), devFrame.S2.begin());
thrust::copy(hostFrame.S3.begin(), hostFrame.S3.end(), devFrame.S3.begin());
thrust::copy(hostFrame.S4.begin(), hostFrame.S4.end(), devFrame.S4.begin());
thrust::transform(thrust::make_zip_iterator(make_tuple(devFrame.S1.begin(), devFrame.S2.begin(), devFrame.S3.begin(), devFrame.S4.begin())),
thrust::make_zip_iterator(make_tuple(devFrame.S1.end(), devFrame.S2.end(), devFrame.S3.end(), devFrame.S4.end())),
thrust::make_zip_iterator(make_tuple(devResult.average.begin(), devResult.sqrt.begin(), devResult.arctan.begin())),
CalcResult() );
thrust::copy(devResult.average.begin(), devResult.average.end(), hostResult.average.begin());
thrust::copy(devResult.sqrt.begin(), devResult.sqrt.end(), hostResult.sqrt.begin());
thrust::copy(devResult.arctan.begin(), devResult.arctan.end(), hostResult.arctan.begin());
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::system_clock::now() - start);
accumulatedDuration += duration.count();
}
std::cout << accumulatedDuration/runs << std::endl;
return 0;
}

Related

Count elements in texture

I have a 3D texture of 32-bit unsigned integers initialized with zeroes. It is defined as follows:
D3D11_TEXTURE3D_DESC description{};
description.Format = DXGI_FORMAT_R32_UINT;
description.Usage = D3D11_USAGE_DEFAULT;
description.BindFlags = D3D11_BIND_SHADER_RESOURCE | D3D11_BIND_UNORDERED_ACCESS;
description.CpuAccessFlags = 0;
description.MipLevels = 1;
description.Width = ...;
description.Height = ...;
description.Depth = ...;
I am writing to this texture in a compute shader to set a bit on specified position if a certain condition is fulfilled:
RWTexture3D<uint> txOutput : register(u0)
cbuffer InputBuffer : register(b0)
{
uint position;
/** other elements **/
}
#define SET_BIT(value, position) value |= (1U << position)
[numthreads(8, 8, 8)]
void main(uint3 threadID : SV_DispatchThreadID)
{
if(/** some condition **/)
{
uint value = txOutput[threadID];
SET_BIT(value, position);
txOutput[threadID] = value;
}
}
I need to know how many elements of this texture is filled at a certain bit position in a code behind in C++. How could this be done?
You will have to read back the texture to the cpu with the ID3D11DeviceContext::Map API
https://learn.microsoft.com/en-us/windows/win32/api/d3d11/nf-d3d11-id3d11devicecontext-map
You will get out a void* you will cast to uint32_t* which will point to the start of your data.
You need to get better a looking up the DirectX documentation, its really quite good documentation. There are a lot harder things you will need to find in the documentation if you keep doing 3D graphics.
Edit: I use DirectML to do these tasks now, only using compute shaders for exotic work.
When I need to readback to the cpu I always accumulate on the cpu, because accumulating on the gpu is difficult and only partially parallel.
Summing a texture on the gpu is call a parallel reduction and this type of programming is called general purpose gpu (gpgpu). The most excellent resource on gpgpu for Direct Compute are these slides from nvidia, which go through optimizing parallel reductions https://on-demand.gputechconf.com/gtc/2010/presentations/S12312-DirectCompute-Pre-Conference-Tutorial.pdf
From the slides:
for (unsigned int s=groupDim_x/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
GroupMemoryBarrierWithGroupSync();
}

I followed a CUDA tutorial but my GPU computation time is much longer than my CPU time?

I followed the tutorial on this page but my results are terrible. The time taken is as follows:
CPU: 569
GPU: 11160
Here is my code. What is going wrong? I can't see why this code is so slow?
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <chrono>
#include <iostream>
#include <math.h>
#include <stdio.h>
__global__ void addCUDA(int n, float* x, float* y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
void add(int n, float* x, float* y)
{
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}
int main()
{
int N = 1 << 20;
float* x = new float[N];
float* y = new float[N];
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
auto t1 = std::chrono::high_resolution_clock::now();
add(N, x, y);
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
std::cout << duration << std::endl;
delete[] x;
delete[] y;
float* u,
float* v;
cudaMallocManaged(&u, N * sizeof(float));
cudaMallocManaged(&v, N * sizeof(float));
for (int i = 0; i < N; i++) {
u[i] = 1.0f;
v[i] = 2.0f;
}
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
int device = -1;
cudaGetDevice(&device);
cudaMemPrefetchAsync(u, N * sizeof(float), device, NULL);
cudaMemPrefetchAsync(v, N * sizeof(float), device, NULL);
auto t3 = std::chrono::high_resolution_clock::now();
addCUDA<<<numBlocks, blockSize>>> (N, u, v);
cudaDeviceSynchronize();
auto t4 = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(t4 - t3).count();
maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(v[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
std::cout << duration << std::endl;
cudaFree(u);
cudaFree(v);
return 0;
}
For a so trivial operation (+ on each element) it takes way more time to send the buffers from host to gpu and to retrieve the buffer from gpu to host, than performing the actual computation.
Even if the API is very comfortable to make buffer accesses look easy and almost magic, data has to travel through the PCI-express bus...
The transfer is asynchronous here, but the computation has to wait for it to complete before actually starting; asynchronous transfer is interesting only if you have something else to do in the meantime (organise various stages of a complex computation as a pipeline for example).
If you try with another problem that requires much more computation, the buffer transfers will be amortized.
Moreover, two arrays of 1<<20 floats requires only 8MB and can fit in the cache memory of a modern CPU.
Then, after the initialisation of these two arrays, they may be already hot in cache memory and easily accessible for CPU computation.
Because the computation is a perfectly regular loop, a decent optimizing compiler will use SIMD instructions, the CPU won't mispredict branches and will perfectly prefetch the data in the various cache levels; all of this greatly increases CPU efficiency for this kind of computation.
It's not so easy to outperform a modern CPU with a GPU.
It really depends on the size and the complexity of the problem (an on the specific properties of these two pieces of hardware of course).
EDIT
As discussed in the comments, the timing method used in the cited article and the one shown in the question are very different.
In the article, nvprof uses internal counters in the GPU to measure the time spent actively computing the addCUDA() (add() in the article) function, without considering either the time it takes to obtain the two source buffers from host and to send back the resulting buffer to host.
Of course, it's fast! Because on much modern hardware (CPU or GPU) most of the time is spent accessing/transferring data rather than computing. If we measured the time spent in our CPU to perform additions only, ignoring the time spent fetching/writing data from/to cache/memory, it would not be very long either!
(Note that the CPU code in the article is not even compiled with optimisation turned on; do such timings have any meaning?)
In the code shown in the question, the timing method is quite different but much more relevant in my opinion.
The two calls to std::chrono::high_resolution_clock::now() actually consider the time spent doing all the work: sending the two source buffers, computing on them and fetching the resulting buffer.
It's the only duration that matters after all!
This way, it is fair to compare this duration to the one we obtain (with a similar method) when timing the CPU.
The fact that cudaMemPrefetchAsync() is used can be misleading because we could think that the transfer of the source buffers is excluded from the timings: it is not, and that's why we find the result disappointing compared to the article.
We launch the timer right after these two calls in order to measure the time spent in the computation, but the computation has to wait for these transfers to complete before actually starting (I would even have started the timer before these two calls).
Moreover, the call to cudaDeviceSynchronize() before stopping the timer waits for the transfer of the resulting buffer to complete in order to actually make the result available to the host.
If we used cudaDeviceSynchronize() before starting the timer, we could have excluded the two initial transfers from the timing, but what's the point of such a timing?
In conclusion, I think the timing method you used in your question is much better than the one promoted in the article since you can really compare the benefit you obtain (or not!) from one technology over the other.
For information, on my computers, with full optimisation turned on, your code gives these results:
CPU: 809 Intel(R) Xeon(R) CPU E5-2697 v2 # 2.70GHz]
GPU: 1160 NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
CPU: 157 Intel(R) Core(TM) i7-10875H CPU # 2.30GHz
GPU: 1158 NVIDIA Corporation TU104GLM [Quadro RTX 4000 Mobile / Max-Q] (rev a1)

OpenCL GPU Programming with Intel HD Graphics 4000

I have been trying to implement a simple parallel algorithm using OpenCL c++ bindings (version 1.2).
Roughly here is the c code (no OpenCL):
typedef struct coord{
double _x;
double _y;
double _z;
}__coord;
typedef struct node{
__coord _coord;
double _dist;
} __node;
double input[3] = {-1.0, -2, 3.5};
//nodeVector1D is a 1Dim random array of struct __node
//nodeVectorSize is the Size of the above array (>1,000)
double d = 0.0;
for(int i=0; i < nodeVectorSize; i++){
__node n = nodeVector1D[i];
d += (input[0] - n._coord._x)*(input[0] - n._coord._x);
d += (input[1] - n._coord._y)*(input[1] - n._coord._y);
d += (input[2] - n._coord._z)*(input[2] - n._coord._z);
n._dist = d;
}
I use a MacBook Pro 13" Late 2013, running on Mac Os X Lion.
OpenCL only detects the CPU.
The CPU: an Intel Ivy i5 2.6GHz, has an integrated GPU of 1Gb at 1.6Ghz (Intel HD Graphics 4000).
The maximum detected Group Item Size is 1024 bytes.
When I run the flat code above (with 1024 nodes), it takes around 17 micro seconds.+
When I run its parallel version using OpenCL, C++ library, it takes 10 times as long, around 87 micro seconds
(excluding the program creation, buffer allocation and writing).
What am I doing wrong here?
NB: the OpenCL kernel for this algorithm is obvious to guess, but I can post it if needed.
Thanks in advance.
EDIT N#1: THE KERNEL CODE
__kernel void _computeDist(
__global void* nodeVector1D,
const unsigned int nodeVectorSize,
const unsigned int itemsize,
__global const double* input){
double d = 0.;
int i,c;
double* n;
i = get_global_id(0);
if (i >= nodeVectorSize) return;
n = (double*)(nodeVector1D + i*itemsize);
for (c=0; c<3;c++){
d += (input[c] - n[c])*(input[c] - n[c]);
}
n[3] = d;
}
Sorry for the void pointer arithmetic, but it works (no seg default).
I can also post the OpenCL initialization routine, but I think it's all over the Internet. However, I will post it, if someone asks.
#pmdj: As I said above OpenCL recognizes my CPU, otherwise I wouldn't have been able to run the tests and get the performance results presented above.
#pmdj: OpenCL kernel code, to my knowledge are always written in C. However, I tagged C++ because (as I said above), I'm using the OpenCL C++ bindings.
I finally found the issue.
The problem was that OpenCL on Mac OS X returns the wrong maximum device work group size of 1024.
I tested with various work group sizes and ended up having 200% performance gains when using a work group size of 128 work items per group.
Here is a clearer benchmark picture. IGPU stands for Integrated GPU.
(X-Axis: the array size, Y-Axis: The Time Duration in microseconds)

Faster algorithm to check the colors in a image

Supposing I am given an image of 2048x2048 and i want to know the total number of colors present in the image, what is the fastest possible algorithm? I came up with two algorithm but they are slow.
Algorithm 1:
Compare the current pixel an the next pixel and if they are different
Check a temporary variable, which contains all the detected colors, to see if the color is present or not
If not present add it to the array(List) and increment noOfColors.
This Algorithm works but is slow. For a 1600x1200 pixels image it takes around 3 sec.
Algorithm 2:
The obvious method of checking the each pixel with all other pixels and recording the no of occurences of the color and incrementing the count. This is very very slow, almost like a hung app. So is there any better approach? I need all the pixel info.
You could use std::set (or std::unordered_set), and simply do a single loop though the pixels, adding the colors to the set. Then the number of colors is the size of the set.
Well, this is suited for parallelization. Split the image in several parts and execute the algorithm for each part in a separate task. To avoid syncing each should have its own storage for the unique colors. When all tasks are done, you aggregate the results.
DRAM is dirt cheap. Use brute force. Fill a tab, count.
On a core2duo # 3.0GHz :
0.35secs for 4096x4096 32 bits rgb
0.20secs after some trivial parallelization (I do know nothing of omp)
However, if you are to use 64bit rgb (one channel = 16 bits) it is another question (not enough memory).
You shall probably need a good hash table function.
Using random pixels, same size takes 10 secs.
Remark: at 0.15 secs, the std::bitset<> solution is faster (it gets slower trivially parallelized !).
Solution, c++11
#include <vector>
#include <random>
#include <iostream>
#include <boost/chrono.hpp>
#define _16M 256*256*256
typedef union {
struct { unsigned char r,g,b,n ; } r_g_b_n ;
unsigned char rgb[4] ;
unsigned i_rgb;
} RGB ;
RGB make_RGB(unsigned char r, unsigned char g , unsigned char b) {
RGB res;
res.r_g_b_n.r = r;
res.r_g_b_n.g = g;
res.r_g_b_n.b = b;
res.r_g_b_n.n = 0;
return res;
}
static_assert(sizeof(RGB)==4,"bad RGB size not 4");
static_assert(sizeof(unsigned)==4,"bad i_RGB size not 4");
struct Image
{
Image (unsigned M, unsigned N) : M_(M) , N_(N) , v_(M*N) {}
const RGB* tab() const {return & v_[0] ; }
RGB* tab() {return & v_[0] ; }
unsigned M_ , N_;
std::vector<RGB> v_;
};
void FillRandom(Image & im) {
std::uniform_int_distribution<unsigned> rnd(0,_16M-1);
std::mt19937 rng;
const int N = im.M_ * im.N_;
RGB* tab = im.tab();
for (int i=0; i<N; i++) {
unsigned r = rnd(rng) ;
*tab++ = make_RGB( (r & 0xFF) , (r>>8 & 0xFF), (r>>16 & 0xFF) ) ;
}
}
size_t Count(const Image & im) {
const int N = im.M_ * im.N_;
std::vector<char> count(_16M,0);
const RGB* tab = im.tab();
#pragma omp parallel
{
#pragma omp for
for (int i=0; i<N; i++) {
count[ tab->i_rgb ] = 1 ;
tab++;
}
}
size_t nColors = 0 ;
#pragma omp parallel
{
#pragma omp for
for (int i = 0 ; i<_16M; i++) nColors += count[i];
}
return nColors;
}
int main() {
Image im(4096,4096);
FillRandom(im);
typedef boost::chrono::high_resolution_clock hrc;
auto start = hrc::now();
std::cout << " # colors " << Count(im) << std::endl ;
boost::chrono::duration<double> sec = hrc::now() - start;
std::cout << " took " << sec.count() << " seconds\n";
return 0;
}
The only feasible algorithm here is building a sort of a histogram of the image colors. The only difference in your case is that instead of calculating the population of each color you need just to know if it's zero or not.
Depending on which color space you work, you may use either an std::set to tag existing colors (as Joachim Pileborg suggested), or just use something like std::bitset, which is obviously faster. This depends on how much distinct colors exist in your color-space.
Also, like Marius Bancila noted, this procedure is a perfect match for parallelization. Calculated the histogram-like data for image parts, and then merge it. Naturally the image division should be based on its memory partition, not the geometric properties. In simple words - split the image vertically (by batches of scan lines), not horizontally.
And, if possible, you should either use some low-level library/code to run through pixels, or try to write your own. At least you must obtain a pointer to scan line and run on its pixels in a batch, rather than doing something like GetPixel for each pixel.
The point, here, is that the ideal representation of an image as 2D array of colors is not the one that happens the way the image is stored on memory (color components can be arranged in "planes", there could be "padding" etc. So getting the pixels using a GetPixel-like function may take time.
The question, then, may even be somehow meaningless if the image is not the result of a "vectorial draw": think to a photograph: between two nearby "greens" you find all the shade of green, so the colors -in this case- are no more no less the ones supported by the encoding of the image itself (2^24, or 256, or 16 or ...), so, unless you are interested on the color distribution (how differently used they are), just counting them makes very few sense.
A workaround can be:
Create an in-memory bitmap having pixel in a "single plane format"
Blit your image into that bitmap using BitBlt or similar (this let the OS to make pixel
conversion from the GPU,if any)
Get the bitmap-bits (this lets you
access the stored values)
Play your "counting algorithm" (whatever
it is) onto those values.
Note that step 1 and 2 can be avoided if you already know that the image is already in planar format.
If you have a multicore system, step 4 can also be assigned to different threads, each working part of the image.
You can use bitset which allows you to set individual bits and has a count function.
You have a bit for each colour, there are 256 values for each of RGB, so that's 256*256*256 bits (16,777,216 colours). The bitset will use a byte for every 8 bits so it will use 2MB.
Use the pixel colour as an index into the bitset:
bitset<256*256*256> colours;
for(int pixel: pixels) {
colours[pixel] = true;
}
colours.count();
This has linear complexity.
Late comer to this answer, but could not help it since this algorithm is brutally fast, developed about 2 or more decades ago, when it really mattered.
3-D Lookup Table Color Matching
http://www.ddj.com/cpp/184403257
Basically, it creates a 3d color loop up table and the search is very fast, I've done some modifications to suit my purpose for image binarization, so I reduced the color space from ff ff ff to f f f, and it's even 10 times faster. As it is right out of the box, I haven't found anything even close, including hash tables.
char * creatematcharray(struct rgb_color *palette, int palettesize)
{
int rval=16, gval=16, bval=16, len, r, g, b;
char *taken, *match, *same;
int i, set, sqstep, tp, maxtp, *entryr, *entryg, *entryb;
char *table;
len=rval*gval*bval;
// Prepare table buffers:
size_t size_of_table = len*sizeof(char);
table=(char *)malloc(size_of_table);
if (table==nullptr) return nullptr;
// Select colors to use for fill:
set=0;
size_t size_of_taken = (palettesize * sizeof(int) * 3) +
(palettesize*sizeof(char)) + (len * sizeof(char));
taken=(char *)malloc(size_of_taken);
same=taken + (len * sizeof(char));
entryr=(int*)(same + (palettesize * sizeof(char)));
entryg=entryr + palettesize;
entryb=entryg + palettesize;
if (taken==nullptr)
{
free((void *)table);
return nullptr;
}
std::memset((void *)taken, 0, len * sizeof(char));
// std::cout << "sizes: " << size_of_table << " " << size_of_taken << std::endl;
match=table;
for (i=0; i<palettesize; i++)
{
same[i]=0;
// Compute 3d-table coordinates of palette rgb color:
r=palette[i].r&0x0f, g=palette[i].g&0x0f, b=palette[i].b&0x0f;
// Put color in position:
if (taken[b*rval*gval+g*rval+r]==0) set++;
else same[match[b*rval*gval+g*rval+r]]=1;
match[b*rval*gval+g*rval+r]=i;
taken[b*rval*gval+g*rval+r]=1;
entryr[i]=r; entryg[i]=g; entryb[i]=b;
}
// ### Fill match_array by steps: ###
for (set=len-set, sqstep=1; set>0; sqstep++)
{
for (i=0; i<palettesize && set>0; i++)
if (same[i]==0)
{
// Fill all six sides of incremented cube (by pairs, 3 loops):
for (b=entryb[i]-sqstep; b<=entryb[i]+sqstep; b+=sqstep*2)
if (b>=0 && b<bval)
for (r=entryr[i]-sqstep; r<=entryr[i]+sqstep; r++)
if (r>=0 && r<rval)
{ // Draw one 3d line:
tp=b*rval*gval+(entryg[i]-sqstep)*rval+r;
maxtp=b*rval*gval+(entryg[i]+sqstep)*rval+r;
if (tp<b*rval*gval+0*rval+r)
tp=b*rval*gval+0*rval+r;
if (maxtp>b*rval*gval+(gval-1)*rval+r)
maxtp=b*rval*gval+(gval-1)*rval+r;
for (; tp<=maxtp; tp+=rval)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
}
for (g=entryg[i]-sqstep; g<=entryg[i]+sqstep; g+=sqstep*2)
if (g>=0 && g<gval)
for (b=entryb[i]-sqstep; b<=entryb[i]+sqstep; b++)
if (b>=0 && b<bval)
{ // Draw one 3d line:
tp=b*rval*gval+g*rval+(entryr[i]-sqstep);
maxtp=b*rval*gval+g*rval+(entryr[i]+sqstep);
if (tp<b*rval*gval+g*rval+0)
tp=b*rval*gval+g*rval+0;
if (maxtp>b*rval*gval+g*rval+(rval-1))
maxtp=b*rval*gval+g*rval+(rval-1);
for (; tp<=maxtp; tp++)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
}
for (r=entryr[i]-sqstep; r<=entryr[i]+sqstep; r+=sqstep*2)
if (r>=0 && r<rval)
for (g=entryg[i]-sqstep; g<=entryg[i]+sqstep; g++)
if (g>=0 && g<gval)
{ // Draw one 3d line:
tp=(entryb[i]-sqstep)*rval*gval+g*rval+r;
maxtp=(entryb[i]+sqstep)*rval*gval+g*rval+r;
if (tp<0*rval*gval+g*rval+r)
tp=0*rval*gval+g*rval+r;
if (maxtp>(bval-1)*rval*gval+g*rval+r)
maxtp=(bval-1)*rval*gval+g*rval+r;
for (; tp<=maxtp; tp+=rval*gval)
if (!taken[tp])
taken[tp]=1, match[tp]=i, set--;
}
}
}
free((void *)taken);`enter code here`
return table;
}
The answer: unordered_map
I use unordered_map, based on my testing.
You should test because your compiler / library may exhibit different performance Comment out #define USEHASH to use map instead.
On my machine, the vanilla unordered_map (a hash implementation) is about twice as fast as map. Inasmuch as different compilers, libraries can vary enormously, you must test to see which is better. In production, I build a fake image on first start of the app, run both algorithms on it and time them, save an indication of which one is faster, and then preferentially use that for all subsequent starts on that the machine. It's nit-picky, but hey, the user's time is valuable to them.
For a DSLR image with 12,106,244 pixels (about 12 megapixels, not a typo) and 11,857,131 distinct colors (also not a typo), map takes about 14 seconds, while unordered map takes about 7 seconds:
Test Code:
#define USEHASH 1
#ifdef USEHASH
#include <unordered_map>
#endif
size = im->xw * im->yw;
#ifdef USEHASH
// unordered_map is about twice as fast as map on my mac with qt5
// --------------------------------------------------------------
#include <unordered_map>
std::unordered_map<qint64, unsigned char> colors;
colors.reserve(size); // pre-allocate the hash space
#else
std::map<qint64, unsigned char> colors;
#endif
...use of either is in a loop where I build a 48-bit value of 0RGB in a 64-bit variable corresponding to the 16-bit RGB values of the image pixels, like so:
for (i=0; i<size; i++)
{
pel = BUILDPEL(i); // macro just shovels 0RGB into 64 bit pel from im
// You'd do the same for your image structure
// in whatever way is fastest for you
colors[pel] = 1;
}
cc = colors.size();
// time here: 14 secs for map, 7 secs for unordered_map with
// 12,106,244 pixels containing 11,857,131 colors on 12/24 core,
// 3 GHz, 64GB machine.

Can/Should I run this code of a statistical application on a GPU?

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.
Several methods performing different, but independent, calculations on the array in nested loops, for example:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
for (float x = 0f; x < 100f; x += 0.0001f) {
int noOfOccurrences = 0;
foreach (float y in largeFloatingPointArray) {
if (x == y) {
noOfOccurrences++;
}
}
noOfNumbers.Add(x, noOfOccurrences);
}
The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:
Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?
UPDATE GPU Version
__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
float y; // compute one (or more) floats
int noOfOccurrences = 0;
int a;
while( x < size ) // While there is work to do each thread will:
{
dictionary[x] = 0; // Initialize the position in each it will work
noOfOccurrences = 0;
for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
{ // that are equal
// to it assign float
y = largeFloatingPointArray[j]; // Take a candidate from the floats array
y *= 10000; // e.g if y = 0.0001f;
a = y + 0.5; // a = 1 + 0.5 = 1;
if (a == x) noOfOccurrences++;
}
dictionary[x] += noOfOccurrences; // Update in the dictionary
// the number of times that the float appears
x += blockDim.x * gridDim.x; // Update the position here the thread will work
}
}
This one I just tested for smaller inputs, because I am testing in my laptop. Nevertheless, it is working, but more tests are needed.
UPDATE Sequential Version
I just did this naive version that executes your algorithm for an array with 30,000,000 element in less than 20 seconds (including the time taken by function that generates the data).
This naive version first sorts your array of floats. Afterward, will go through the sorted array and check the number of times a given value appears in the array and then puts this value in a dictionary along with the number of times it has appeared.
You can use sorted map, instead of the unordered_map that I used.
Heres the code:
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>
typedef std::tr1::unordered_map<float, int> Mymap;
void generator(float *data, long int size)
{
float LO = 0.0;
float HI = 100.0;
for(long int i = 0; i < size; i++)
data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}
void print_array(float *data, long int size)
{
for(long int i = 2; i < size; i++)
printf("%f\n",data[i]);
}
std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
float previous = data[0];
int count = 1;
std::tr1::unordered_map<float, int> dict;
for(long int i = 1; i < size; i++)
{
if(previous == data[i])
count++;
else
{
dict.insert(Mymap::value_type(previous,count));
previous = data[i];
count = 1;
}
}
dict.insert(Mymap::value_type(previous,count)); // add the last member
return dict;
}
void printMAP(std::tr1::unordered_map<float, int> dict)
{
for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
{
std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
}
}
int main(int argc, char** argv)
{
int size = 1000000;
if(argc > 1) size = atoi(argv[1]);
printf("Size = %d",size);
float data[size];
using namespace __gnu_cxx;
std::tr1::unordered_map<float, int> dict;
generator(data,size);
sort(data, data + size);
dict = fill_dict(data,size);
return 0;
}
If you have the library thrust installed in you machine your should use this:
#include <thrust/sort.h>
thrust::sort(data, data + size);
instead of this
sort(data, data + size);
For sure it will be faster.
Original Post
I'm working on a statistical application which has a large array
containing 10 - 30 millions of floating point values.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Yes, it is. A month ago, I ran an entirely Molecular Dynamic simulation on a GPU. One of the kernels, which calculated the force between pairs of particles, received as parameter 6 array each one with 500,000 doubles, for a total of 3 Millions doubles (22 MB).
So if you are planning to put 30 Million floating points, which is about 114 MB of global Memory, it will not be a problem.
In your case, can the number of calculations be an issue? Based on my experience with the Molecular Dynamic (MD), I would say no. The sequential MD version takes about 25 hours to complete while the GPU version took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the MD.
Here's the force calculation example:
__global__ void add(double *fx, double *fy, double *fz,
double *x, double *y, double *z,...){
int pos = (threadIdx.x + blockIdx.x * blockDim.x);
...
while(pos < particles)
{
for (i = 0; i < particles; i++)
{
if(//inside of the same radius)
{
// calculate force
}
}
pos += blockDim.x * gridDim.x;
}
}
A simple example of a code in CUDA could be the sum of two 2D arrays:
In C:
for(int i = 0; i < N; i++)
c[i] = a[i] + b[i];
In CUDA:
__global__ add(int *c, int *a, int*b, int N)
{
int pos = (threadIdx.x + blockIdx.x)
for(; i < N; pos +=blockDim.x)
c[pos] = a[pos] + b[pos];
}
In CUDA you basically took each for iteration and assigned to each thread,
1) threadIdx.x + blockIdx.x*blockDim.x;
Each block has an ID from 0 to N-1 (N the number maximum of blocks) and each block has a 'X' number of threads with an ID from 0 to X-1.
Gives you the for loop iteration that each thread will compute based on its ID and the block ID which the thread is in; the blockDim.x is the number of threads that a block has.
So if you have 2 blocks each one with 10 threads and N=40, the:
Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39
Looking at your current code, I have made this draft of what your code could look like in CUDA:
__global__ hash (float *largeFloatingPointArray, int *dictionary)
// You can turn the dictionary in one array of int
// here each position will represent the float
// Since x = 0f; x < 100f; x += 0.0001f
// you can associate each x to different position
// in the dictionary:
// pos 0 have the same meaning as 0f;
// pos 1 means float 0.0001f
// pos 2 means float 0.0002f ect.
// Then you use the int of each position
// to count how many times that "float" had appeared
int x = blockIdx.x; // Each block will take a different x to work
float y;
while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
int noOfOccurrences = 0;
float z = converting_int_to_float(x); // This function will convert the x to the
// float like you use (x / 0.0001)
// each thread of each block
// will takes the y from the array of largeFloatingPointArray
for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
{
y = largeFloatingPointArray[j];
if (z == y)
{
noOfOccurrences++;
}
}
if(threadIdx.x == 0) // Thread master will update the values
atomicAdd(&dictionary[x], noOfOccurrences);
__syncthreads();
}
You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences concurrently, so you have to ensure mutual exclusion.
This is just one approach; you can even assign the iterations of the outer loop to the threads instead of the blocks.
Tutorials
The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.
and anothers:
Volume I: Introduction to CUDA Programming
Getting started with CUDA
CUDA Resources List
Take a look on the last item, you will find many link to learn CUDA.
OpenCL: OpenCL Tutorials | MacResearch
I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.
In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.
I don't know any C#, but a single pass implementation of your sample would look something like this:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
foreach (float x in largeFloatingPointArray)
{
if (math.Truncate(x/0.0001f)*0.0001f == x)
{
if (noOfNumbers.ContainsKey(x))
noOfNumbers.Add(x, noOfNumbers[x]+1);
else
noOfNumbers.Add(x, 1);
}
}
Hope this helps.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.
If yes: Does anyone know any tutorial or got any sample code
(programming language doesn't matter)?
When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.
CUDA is mature with a lot of tools but is NVidia GPUs centric.
OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).
As you use C# you can use bindings like OpenCL.Net or Cloo.
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.
In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.
The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.
I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.
I am not sure whether using GPUs would be a good match given that
'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.
I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.
You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.
Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.
On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.
The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.
With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.
I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.
I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.
If you need help in implementing a distributed solution, I'd be happy to discuss it with you.
Added Nov 16, 2012.
I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.
/*
* Convert the X range from 0f to 100f in steps of 0.0001f
* into a range of integers 0 to 1 + (100 * 10000) to use as an
* index into an array.
*/
#define X_MAX (1 + (100 * 10000))
/*
* Number of floats in largeFloatingPointArray needs to be defined
* below to be whatever your value is.
*/
#define LARGE_ARRAY_MAX (1000)
main()
{
int j, y, *noOfOccurances;
float *largeFloatingPointArray;
/*
* Allocate memory for largeFloatingPointArray and populate it.
*/
largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));
if (largeFloatingPointArray == 0) {
printf("out of memory\n");
exit(1);
}
/*
* Allocate memory to hold noOfOccurances. The index/10000 is the
* the floating point number. The contents is the count.
*
* E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
* in largeFloatingPointArray.
*/
noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
if (noOfOccurances == 0) {
printf("out of memory\n");
exit(1);
}
for (j = 0; j < LARGE_ARRAY_MAX; j++) {
y = (int)(largeFloatingPointArray[j] * 10000);
if (y >= 0 && y <= X_MAX) {
noOfOccurances[y]++;
}
}
}