Poor performance of large number of viz::Widget3D in OpenCV

Poor performance of large number of viz::Widget3D in OpenCV - c++

Using several thousands of viz::Widget3D in OpenCV is very slow. I tried v3.4.3 and v4.0.0 on Windows with Visual Studio 2017. This code snippet takes over 5s to execute the timed part (t0 to t1) and viewing is very choppy afterwards:
using namespace std;
using namespace cv;
int main()
{
constexpr double n = 100;
viz::Viz3d window("Viz3d");
window.setFullScreen();
window.showWidget("Coordinate Widget", viz::WCoordinateSystem());
window.spinOnce();
auto t0 = chrono::high_resolution_clock::now();
for (double x = 0; x < n; x += 1)
for (double y = 0; y < n; y += 1)
window.showWidget(to_string(x+y*n), viz::WArrow({x, y, 0}, {x+1, y+1, 0}, 0.02, viz::Color::bluberry()));
auto t1 = chrono::high_resolution_clock::now();
window.spin();
fmt::print("\nTime: {}ms", chrono::duration_cast<chrono::milliseconds>(t1-t0).count());
fmt::print("\nVersion {}.{}.{}{}\n", CV_VERSION_MAJOR, CV_VERSION_MINOR, CV_VERSION_REVISION, CV_VERSION_STATUS);
return 0;
}
It seems that widget management imposes a huge overhead. Is there any other way to display thousands of widgets (text, lines, arrow) with low latency? I tried viz::WWidgetMerger and it's even slower.
EDIT
BTW, I need only "immediate" mode rendering. I'm not modifying the widgets after they are shown.

If you have thousands of widgets, creating a single WidgetMerger will be ungodly slow when adding all the widgets together (if you were to visualize the time of each individual addition you will see the exponential slowdown).
However, if you instead have multiple mergers it should be smooth with absolutely no choppiness when displaying, while only being a little bit slower on creation.
You can try doing something similar to this:
std::vector<viz::WWidgetMerger> mergers(n);
for (double x = 0; x < n; x += 1) {
viz::WWidgetMerger merger = mergers[x];
for (double y = 0; y < n; y += 1) {
merger.addWidget(viz::WArrow({x, y, 0}, {x+1, y+1, 0}, 0.02, viz::Color::bluberry()));
}
merger.finalize();
window.showWidget("merger_" + std::to_string(x), merger);
}
Performance differences:
Your code as is: Time: 2522ms extremely choppy to the point of being almost unusable.
Using multiple mergers: Time: 6097ms buttery smooth, no choppiness
Using a single merger: Time > 5 minutes. I didn't even wait for it to finish it was taking so long.
So while using multiple mergers is a slight slowdown initially when creating the widgets, for the display performance, I find it to be worth it.

Related

Threads are slow c++

im trying to draw a mandelbrot and want to use 4 threats to do the calculation at the same time but a different part of the image , here are the functions
void Mandelbrot(int x_min,int x_max,int y_min,int y_max,Image &im)
{
for (int i = y_min; i < y_max; i++)
{
for (int j = x_min; j < x_max; j++)
{
//scaled x and y cordinate
double x0 = mape(j, 0, W, MinX, MaxX);
double y0 = mape(i, 0, H, MinY, MaxY);
double x = 0.0f;
double y = 0.0f;
int iteration = 0;
double z = 0;
while (abs(z)<2.0f && iteration < maxIteration)// && iteration < maxIteration)
{
double xtemp = x * x - y * y + x0;
y = 2 * x * y + y0;
x = xtemp;
iteration++;
z = x * x + y * y;
if (z > 10)//must be 10
break;
}
int b =mape(iteration, 0, maxIteration, 0, 255);
if (iteration == maxIteration)
b = 0;
im.setPixel(j, i, Color(b,b,0));
}
}
}
mape functions just convert a number from one range to another
Here is the thread function
void th(Image& im)
{
float size = (float)im.getSize().x / num_th;
int x_min = 0, x_max = size, y_min = 0, y_max = im.getSize().y;
thread t[num_th];
for (size_t i = 0; i < num_th; i++)
{
t[i] = thread(Mandelbrot, x_min, x_max, y_min, y_max, ref(im));
x_min = x_max;
x_max += size;
}
for (size_t i = 0; i<num_th; i++)
{
t[i].join();
}
}
The main function looks like this
int main()
{
Image img;
while(1)//here is while window.open()
{
th(img);
//here im drawing
}
}
So i am not getting any performance boost but it gets even slower , can anyone tell my where is the problem what im doing wrong , it happened to me before too
I sow a question what is an image , it's a class from the SFML library dont'n know if this is of any help.

Your code is incomplete to be able to answer you concretely, but there are a few suspicions:
Spawning a thread has non-trivial overhead. If the amount of work performed by the thread is not large enough, the overhead of launching it may cost more than any gains you would get through parallelism.
Excessive locking and contention. Does not look like a problem in your code, as you don't seem to use any locks at all. Be careful (though as long as they don't write to the same addresses, it should be correct.)
False sharing: Possible problem in your code. Cache lines tend to be 64 bytes. Any write to any portion of a cache line causes the whole line to be committed to memory. If two threads are looking at the same cache line and one of them writes to it, even if all the other threads use a different part of that cache line, they all will have their copy invalidated and will have to re-fetch. This can cause significant problems if multiple threads work in non-overlapping data that share a cache line and cause these invalidations. If they iterate at the same rate through the same data, it can cause this problem to recur over and over. This problem can be significant, and always worth considering.
memory layout causing your cache to be thrashed. While walking through an array, going "across" may align with actual memory layout, reading one full cacheline after another, but scanning "vertically" touches one portion of a cache line then jumps to the corresponding portion of another cache line. If this happens in many threads and you have a lot of memory to churn through, it can mean that your cache is vastly underutilized. Just something to beware of, whether your machine is row- or column- major, and write code to match it, and avoid jumping around in memory.

Parallelism vs Threading - Performance

I have been reading on the subject, but I haven't been able to find a concrete answer to my question. I am interested in using parallelism/multithreading to improve the performance of my game, but I have heard some contradicting facts. For example, that multithreading may not produce any improvement on the execution speed for a game. I
I have thought of two ways to do this:
putting the rendering component into a thread. There are some things
I would need to change, but I have a good idea of what needs to be
done.
using openMP to parallelize the rendering function. I have already code to do so, thus this might be easier option.
This being an Uni assessment, the target hardware are my Uni's computers, which are multi-core (4 cores), and therefore I am hoping to achieve some additional efficiency using either one of those techniques.
My question, is therefore, the following: Which one should I prefer? Which normally produces the best results?
EDIT: The main function I mean to parallelize/multithread away:
void Visualization::ClipTransBlit ( int id, Vector2i spritePosition, FrameData frame, View *view )
{
const Rectangle viewRect = view->GetRect ();
BYTE *bufferPtr = view->GetBuffer ();
Texture *txt = txtMan_.GetTexture ( id );
Rectangle clippingRect = Rectangle ( 0, frame.frameSize.x, 0, frame.frameSize.y );
clippingRect.Translate ( spritePosition );
clippingRect.ClipTo ( viewRect );
Vector2i negPos ( -spritePosition.x, -spritePosition.y );
clippingRect.Translate ( negPos );
if ( spritePosition.x < viewRect.left_ ) { spritePosition.x = viewRect.left_; }
if ( spritePosition.y < viewRect.top_ ) { spritePosition.y = viewRect.top_; }
if (clippingRect.GetArea() == 0) { return; }
//clippingRect.Translate ( frameData );
BYTE *destPtr = bufferPtr + ((abs(spritePosition.x) - abs(viewRect.left_)) + (abs(spritePosition.y) - abs(viewRect.top_)) * viewRect.Width()) * 4; // corner position of the sprite (top left corner)
BYTE *tempSPtr = txt->GetData() + (clippingRect.left_ + clippingRect.top_ * txt->GetSize().x) * 4;
int w = clippingRect.Width();
int h = clippingRect.Height();
int endOfLine = (viewRect.Width() - w) * 4;
int endOfSourceLine = (txt->GetSize().x - w) * 4;
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
if (tempSPtr[3] != 0)
{
memcpy(destPtr, tempSPtr, 4);
}
destPtr += 4;
tempSPtr += 4;
}
destPtr += endOfLine;
tempSPtr += endOfSourceLine;
}
}

instead of calling memcpy for each pixel consider just setting the value there. the overhead in calling a function that many times could be dominating the overall execution time for this loop. E.g:
for (int i = 0; i < h; i++)
{
for (int j = 0; j < w; j++)
{
if (tempSPtr[3] != 0)
{
*((DWORD*)destPtr) = *((DWORD*)tempSPtr);
}
destPtr += 4;
tempSPtr += 4;
}
destPtr += endOfLine;
tempSPtr += endOfSourceLine;
}
you could also avoid the conditional by employing one of the tricks mentioned here avoiding conditionals - in such a tight loop conditionals can be very expensive.
edit -
as to whether it's better to run several instances of ClipTransBlit concurrently or to parallelize ClipTransBlit internally, I would say generally speaking it's better to implement parallelization at as high a level as possible to reduce the overhead you incur by setting it up (creating threads, synchronizing them, etc.)
In your case though because it looks like you're drawing sprites, if they were to overlap then without additional synchronization your high level threading might lead to nasty visual artifacts and even a race condition on checking the alpha bit. In that case the low level parallelism might be a better choice.

Theoretically, they should produce the same effect. In practice, it might be quite different.
If you print out assembly code of an OpenMP program, OpenMP simply calls some function in the scope like #pragma omp parallel .... It is similar to folk.
OpenMP is parallel computing oriented, on the other hand, multi-thread is more general.
For example, if you want to write a GUI program, multithreading is necessary(Some frameworks may hide it. It still needs multiple threads). However you never want to implement it with OpenMP.

generating correct spectrogram using fftw and window function

For a project I need to be able to generate a spectrogram from a .WAV file. I've read the following should be done:
Get N (transform size) samples
Apply a window function
Do a Fast Fourier Transform using the samples
Normalise the output
Generate spectrogram
On the image below you see two spectrograms of a 10000 Hz sine wave both using the hanning window function. On the left you see a spectrogram generated by audacity and on the right my version. As you can see my version has a lot more lines/noise. Is this leakage in different bins? How would I get a clear image like the one audacity generates. Should I do some post-processing? I have not yet done any normalisation because do not fully understand how to do so.
update
I found this tutorial explaining how to generate a spectrogram in c++. I compiled the source to see what differences I could find.
My math is very rusty to be honest so I'm not sure what the normalisation does here:
for(i = 0; i < half; i++){
out[i][0] *= (2./transform_size);
out[i][6] *= (2./transform_size);
processed[i] = out[i][0]*out[i][0] + out[i][7]*out[i][8];
//sets values between 0 and 1?
processed[i] =10. * (log (processed[i] + 1e-6)/log(10)) /-60.;
}
after doing this I got this image (btw I've inverted the colors):
I then took a look at difference of the input samples provided by my sound library and the one of the tutorial. Mine were way higher so I manually normalised is by dividing it by the factor 32767.9. I then go this image which looks pretty ok I think. But dividing it by this number seems wrong. And I would like to see a different solution.
Here is the full relevant source code.
void Spectrogram::process(){
int i;
int transform_size = 1024;
int half = transform_size/2;
int step_size = transform_size/2;
double in[transform_size];
double processed[half];
fftw_complex *out;
fftw_plan p;
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * transform_size);
for(int x=0; x < wavFile->getSamples()/step_size; x++){
int j = 0;
for(i = step_size*x; i < (x * step_size) + transform_size - 1; i++, j++){
in[j] = wavFile->getSample(i)/32767.9;
}
//apply window function
for(i = 0; i < transform_size; i++){
in[i] *= windowHanning(i, transform_size);
// in[i] *= windowBlackmanHarris(i, transform_size);
}
p = fftw_plan_dft_r2c_1d(transform_size, in, out, FFTW_ESTIMATE);
fftw_execute(p); /* repeat as needed */
for(i = 0; i < half; i++){
out[i][0] *= (2./transform_size);
out[i][11] *= (2./transform_size);
processed[i] = out[i][0]*out[i][0] + out[i][12]*out[i][13];
processed[i] =10. * (log (processed[i] + 1e-6)/log(10)) /-60.;
}
for (i = 0; i < half; i++){
if(processed[i] > 0.99)
processed[i] = 1;
In->setPixel(x,(half-1)-i,processed[i]*255);
}
}
fftw_destroy_plan(p);
fftw_free(out);
}

This is not exactly an answer as to what is wrong but rather a step by step procedure to debug this.
What do you think this line does? processed[i] = out[i][0]*out[i][0] + out[i][12]*out[i][13] Likely that is incorrect: fftw_complex is typedef double fftw_complex[2], so you only have out[i][0] and out[i][1], where the first is the real and the second the imaginary part of the result for that bin. If the array is contiguous in memory (which it is), then out[i][12] is likely the same as out[i+6][0] and so forth. Some of these will go past the end of the array, adding random values.
Is your window function correct? Print out windowHanning(i, transform_size) for every i and compare with a reference version (for example numpy.hanning or the matlab equivalent). This is the most likely cause, what you see looks like a bad window function, kind of.
Print out processed, and compare with a reference version (given the same input, of course you'd have to print the input and reformat it to feed into pylab/matlab etc). However, the -60 and 1e-6 are fudge factors which you don't want, the same effect is better done in a different way. Calculate like this:
power_in_db[i] = 10 * log(out[i][0]*out[i][0] + out[i][1]*out[i][1])/log(10)
Print out the values of power_in_db[i] for the same i but for all x (a horizontal line). Are they approximately the same?
If everything so far is good, the remaining suspect is setting the pixel values. Be very explicit about clipping to range, scaling and rounding.
int pixel_value = (int)round( 255 * (power_in_db[i] - min_db) / (max_db - min_db) );
if (pixel_value < 0) { pixel_value = 0; }
if (pixel_value > 255) { pixel_value = 255; }
Here, again, print out the values in a horizontal line, and compare with the grayscale values in your pgm (by hand, using the colorpicker in photoshop or gimp or similar).
At this point, you will have validated everything from end to end, and likely found the bug.

The code you produced, was almost correct. So, you didn't left me much to correct:
void Spectrogram::process(){
int transform_size = 1024;
int half = transform_size/2;
int step_size = transform_size/2;
double in[transform_size];
double processed[half];
fftw_complex *out;
fftw_plan p;
out = (fftw_complex*) fftw_malloc(sizeof(fftw_complex) * transform_size);
for (int x=0; x < wavFile->getSamples()/step_size; x++) {
// Fill the transformation array with a sample frame and apply the window function.
// Normalization is performed later
// (One error was here: you didn't set the last value of the array in)
for (int j = 0, int i = x * step_size; i < x * step_size + transform_size; i++, j++)
in[j] = wavFile->getSample(i) * windowHanning(j, transform_size);
p = fftw_plan_dft_r2c_1d(transform_size, in, out, FFTW_ESTIMATE);
fftw_execute(p); /* repeat as needed */
for (int i=0; i < half; i++) {
// (Here were some flaws concerning the access of the complex values)
out[i][0] *= (2./transform_size); // real values
out[i][1] *= (2./transform_size); // complex values
processed[i] = out[i][0]*out[i][0] + out[i][1]*out[i][1]; // power spectrum
processed[i] = 10./log(10.) * log(processed[i] + 1e-6); // dB
// The resulting spectral values in 'processed' are in dB and related to a maximum
// value of about 96dB. Normalization to a value range between 0 and 1 can be done
// in several ways. I would suggest to set values below 0dB to 0dB and divide by 96dB:
// Transform all dB values to a range between 0 and 1:
if (processed[i] <= 0) {
processed[i] = 0;
} else {
processed[i] /= 96.; // Reduce the divisor if you prefer darker peaks
if (processed[i] > 1)
processed[i] = 1;
}
In->setPixel(x,(half-1)-i,processed[i]*255);
}
// This should be called each time fftw_plan_dft_r2c_1d()
// was called to avoid a memory leak:
fftw_destroy_plan(p);
}
fftw_free(out);
}
The two corrected bugs were most probably responsible for the slight variation of successive transformation results. The Hanning window is very vell suited to minimize the "noise" so a different window would not have solved the problem (actually #Alex I already pointed to the 2nd bug in his point 2. But in his point 3. he added a -Inf-bug as log(0) is not defined which can happen if your wave file containts a stretch of exact 0-values. To avoid this the constant 1e-6 is good enough).
Not asked, but there are some optimizations:
put p = fftw_plan_dft_r2c_1d(transform_size, in, out, FFTW_ESTIMATE); outside the main loop,
precalculate the window function outside the main loop,
abandon the array processed and just use a temporary variable to hold one spectral line at a time,
the two multiplications of out[i][0] and out[i][1] can be abandoned in favour of one multiplication with a constant in the following line. I left this (and other things) for you to improve
Thanks to #Maxime Coorevits additionally a memory leak could be avoided: "Each time you call fftw_plan_dft_rc2_1d() memory are allocated by FFTW3. In your code, you only call fftw_destroy_plan() outside the outer loop. But in fact, you need to call this each time you request a plan."

Audacity typically doesn't map one frequency bin to one horizontal line, nor one sample period to one vertical line. The visual effect in Audacity may be due to resampling of the spectrogram picture in order to fit the drawing area.

Can/Should I run this code of a statistical application on a GPU?

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.
Several methods performing different, but independent, calculations on the array in nested loops, for example:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
for (float x = 0f; x < 100f; x += 0.0001f) {
int noOfOccurrences = 0;
foreach (float y in largeFloatingPointArray) {
if (x == y) {
noOfOccurrences++;
}
}
noOfNumbers.Add(x, noOfOccurrences);
}
The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:
Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?

UPDATE GPU Version
__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
float y; // compute one (or more) floats
int noOfOccurrences = 0;
int a;
while( x < size ) // While there is work to do each thread will:
{
dictionary[x] = 0; // Initialize the position in each it will work
noOfOccurrences = 0;
for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
{ // that are equal
// to it assign float
y = largeFloatingPointArray[j]; // Take a candidate from the floats array
y *= 10000; // e.g if y = 0.0001f;
a = y + 0.5; // a = 1 + 0.5 = 1;
if (a == x) noOfOccurrences++;
}
dictionary[x] += noOfOccurrences; // Update in the dictionary
// the number of times that the float appears
x += blockDim.x * gridDim.x; // Update the position here the thread will work
}
}
This one I just tested for smaller inputs, because I am testing in my laptop. Nevertheless, it is working, but more tests are needed.
UPDATE Sequential Version
I just did this naive version that executes your algorithm for an array with 30,000,000 element in less than 20 seconds (including the time taken by function that generates the data).
This naive version first sorts your array of floats. Afterward, will go through the sorted array and check the number of times a given value appears in the array and then puts this value in a dictionary along with the number of times it has appeared.
You can use sorted map, instead of the unordered_map that I used.
Heres the code:
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>
typedef std::tr1::unordered_map<float, int> Mymap;
void generator(float *data, long int size)
{
float LO = 0.0;
float HI = 100.0;
for(long int i = 0; i < size; i++)
data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}
void print_array(float *data, long int size)
{
for(long int i = 2; i < size; i++)
printf("%f\n",data[i]);
}
std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
float previous = data[0];
int count = 1;
std::tr1::unordered_map<float, int> dict;
for(long int i = 1; i < size; i++)
{
if(previous == data[i])
count++;
else
{
dict.insert(Mymap::value_type(previous,count));
previous = data[i];
count = 1;
}
}
dict.insert(Mymap::value_type(previous,count)); // add the last member
return dict;
}
void printMAP(std::tr1::unordered_map<float, int> dict)
{
for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
{
std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
}
}
int main(int argc, char** argv)
{
int size = 1000000;
if(argc > 1) size = atoi(argv[1]);
printf("Size = %d",size);
float data[size];
using namespace __gnu_cxx;
std::tr1::unordered_map<float, int> dict;
generator(data,size);
sort(data, data + size);
dict = fill_dict(data,size);
return 0;
}
If you have the library thrust installed in you machine your should use this:
#include <thrust/sort.h>
thrust::sort(data, data + size);
instead of this
sort(data, data + size);
For sure it will be faster.
Original Post
I'm working on a statistical application which has a large array
containing 10 - 30 millions of floating point values.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Yes, it is. A month ago, I ran an entirely Molecular Dynamic simulation on a GPU. One of the kernels, which calculated the force between pairs of particles, received as parameter 6 array each one with 500,000 doubles, for a total of 3 Millions doubles (22 MB).
So if you are planning to put 30 Million floating points, which is about 114 MB of global Memory, it will not be a problem.
In your case, can the number of calculations be an issue? Based on my experience with the Molecular Dynamic (MD), I would say no. The sequential MD version takes about 25 hours to complete while the GPU version took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the MD.
Here's the force calculation example:
__global__ void add(double *fx, double *fy, double *fz,
double *x, double *y, double *z,...){
int pos = (threadIdx.x + blockIdx.x * blockDim.x);
...
while(pos < particles)
{
for (i = 0; i < particles; i++)
{
if(//inside of the same radius)
{
// calculate force
}
}
pos += blockDim.x * gridDim.x;
}
}
A simple example of a code in CUDA could be the sum of two 2D arrays:
In C:
for(int i = 0; i < N; i++)
c[i] = a[i] + b[i];
In CUDA:
__global__ add(int *c, int *a, int*b, int N)
{
int pos = (threadIdx.x + blockIdx.x)
for(; i < N; pos +=blockDim.x)
c[pos] = a[pos] + b[pos];
}
In CUDA you basically took each for iteration and assigned to each thread,
1) threadIdx.x + blockIdx.x*blockDim.x;
Each block has an ID from 0 to N-1 (N the number maximum of blocks) and each block has a 'X' number of threads with an ID from 0 to X-1.
Gives you the for loop iteration that each thread will compute based on its ID and the block ID which the thread is in; the blockDim.x is the number of threads that a block has.
So if you have 2 blocks each one with 10 threads and N=40, the:
Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39
Looking at your current code, I have made this draft of what your code could look like in CUDA:
__global__ hash (float *largeFloatingPointArray, int *dictionary)
// You can turn the dictionary in one array of int
// here each position will represent the float
// Since x = 0f; x < 100f; x += 0.0001f
// you can associate each x to different position
// in the dictionary:
// pos 0 have the same meaning as 0f;
// pos 1 means float 0.0001f
// pos 2 means float 0.0002f ect.
// Then you use the int of each position
// to count how many times that "float" had appeared
int x = blockIdx.x; // Each block will take a different x to work
float y;
while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
int noOfOccurrences = 0;
float z = converting_int_to_float(x); // This function will convert the x to the
// float like you use (x / 0.0001)
// each thread of each block
// will takes the y from the array of largeFloatingPointArray
for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
{
y = largeFloatingPointArray[j];
if (z == y)
{
noOfOccurrences++;
}
}
if(threadIdx.x == 0) // Thread master will update the values
atomicAdd(&dictionary[x], noOfOccurrences);
__syncthreads();
}
You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences concurrently, so you have to ensure mutual exclusion.
This is just one approach; you can even assign the iterations of the outer loop to the threads instead of the blocks.
Tutorials
The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.
and anothers:
Volume I: Introduction to CUDA Programming
Getting started with CUDA
CUDA Resources List
Take a look on the last item, you will find many link to learn CUDA.
OpenCL: OpenCL Tutorials | MacResearch

I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.
In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.
I don't know any C#, but a single pass implementation of your sample would look something like this:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
foreach (float x in largeFloatingPointArray)
{
if (math.Truncate(x/0.0001f)*0.0001f == x)
{
if (noOfNumbers.ContainsKey(x))
noOfNumbers.Add(x, noOfNumbers[x]+1);
else
noOfNumbers.Add(x, 1);
}
}
Hope this helps.

Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.
If yes: Does anyone know any tutorial or got any sample code
(programming language doesn't matter)?
When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.
CUDA is mature with a lot of tools but is NVidia GPUs centric.
OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).
As you use C# you can use bindings like OpenCL.Net or Cloo.
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.

In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.
The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.
I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.

I am not sure whether using GPUs would be a good match given that
'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.
I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.
You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.
Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.
On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.
The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.
With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.
I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.
I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.
If you need help in implementing a distributed solution, I'd be happy to discuss it with you.
Added Nov 16, 2012.
I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.
/*
* Convert the X range from 0f to 100f in steps of 0.0001f
* into a range of integers 0 to 1 + (100 * 10000) to use as an
* index into an array.
*/
#define X_MAX (1 + (100 * 10000))
/*
* Number of floats in largeFloatingPointArray needs to be defined
* below to be whatever your value is.
*/
#define LARGE_ARRAY_MAX (1000)
main()
{
int j, y, *noOfOccurances;
float *largeFloatingPointArray;
/*
* Allocate memory for largeFloatingPointArray and populate it.
*/
largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));
if (largeFloatingPointArray == 0) {
printf("out of memory\n");
exit(1);
}
/*
* Allocate memory to hold noOfOccurances. The index/10000 is the
* the floating point number. The contents is the count.
*
* E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
* in largeFloatingPointArray.
*/
noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
if (noOfOccurances == 0) {
printf("out of memory\n");
exit(1);
}
for (j = 0; j < LARGE_ARRAY_MAX; j++) {
y = (int)(largeFloatingPointArray[j] * 10000);
if (y >= 0 && y <= X_MAX) {
noOfOccurances[y]++;
}
}
}

How to make timer for a game loop?

I want to time fps count, and set it's limit to 60 and however i've been looking throught some code via google, I completly don't get it.

If you want 60 FPS, you need to figure out how much time you have on each frame. In this case, 16.67 milliseconds. So you want a loop that completes every 16.67 milliseconds.
Usually it goes (simply put): Get input, do physics stuff, render, pause until 16.67ms has passed.
Its usually done by saving the time at the top of the loop and then calculating the difference at the end and sleeping or looping doing nothing for that duration.
This article describes a few different ways of doing game loops, including the one you want, although I'd use one of the more advanced alternatives in this article.

delta time is the final time, minus the original time.
dt= t-t0
This delta time, though, is simply the amount of time that passes while the velocity is changing.
The derivative of a function represents an infinitesimal change
in the function with respect to one of its variables.
The derivative of a function with respect to the variable is defined as
f(x + h) - f(x)
f'(x) = lim -----------------
h->0 h
http://mathworld.wolfram.com/Derivative.html
#include<time.h>
#include<stdlib.h>
#include<stdio.h>
#include<windows.h>
#pragma comment(lib,"winmm.lib")
void gotoxy(int x, int y);
void StepSimulation(float dt);
int main(){
int NewTime = 0;
int OldTime = 0;
float dt = 0;
float TotalTime = 0;
int FrameCounter = 0;
int RENDER_FRAME_COUNT = 60;
while(true){
NewTime = timeGetTime();
dt = (float) (NewTime - OldTime)/1000; //delta time
OldTime = NewTime;
if (dt > (0.016f)) dt = (0.016f); //delta time
if (dt < 0.001f) dt = 0.001f;
TotalTime += dt;
if(TotalTime > 1.1f){
TotalTime=0;
StepSimulation(dt);
}
if(FrameCounter >= RENDER_FRAME_COUNT){
// draw stuff
//Render();
gotoxy(1,2);
printf(" \n");
printf("OldTime = %d \n",OldTime);
printf("NewTime = %d \n",NewTime);
printf("dt = %f \n",dt);
printf("TotalTime = %f \n",TotalTime);
printf("FrameCounter = %d fps\n",FrameCounter);
printf(" \n");
FrameCounter = 0;
}
else{
gotoxy(22,7);
printf("%d ",FrameCounter);
FrameCounter++;
}
}
return 0;
}
void gotoxy(int x, int y){
COORD coord;
coord.X = x; coord.Y = y;
SetConsoleCursorPosition(GetStdHandle(STD_OUTPUT_HANDLE), coord);
return;
}
void StepSimulation(float dt){
// calculate stuff
//vVelocity += Ae * dt;
}

You shouldn't try to limit the fps. The only reason to do so is if you are not using delta time and you expect each frame to be the same length. Even the simplest game cannot guarantee that.
You can however take your delta time and slice it into fixed sizes and then hold onto the remainder.
Here's some code I wrote recently. It's not thoroughly tested.
void GameLoop::Run()
{
m_Timer.Reset();
while(!m_Finished())
{
Time delta = m_Timer.GetDelta();
Time frameTime(0);
unsigned int loopCount = 0;
while (delta > m_TickTime && loopCount < m_MaxLoops)
{
m_SingTick();
delta -= m_TickTime;
frameTime += m_TickTime;
++loopCount;
}
m_Independent(frameTime);
// add an exception flag later.
// This is if the game hangs
if(loopCount >= m_MaxLoops)
{
delta %= m_TickTime;
}
m_Render(delta);
m_Timer.Unused(delta);
}
}
The member objects are Boost slots so different code can register with different timing methods. The Independent slot is for things like key mapping or changing music Things that don't need to be so precise. SingTick is good for physics where it is easier if you know every tick will be the same but you don't want to run through a wall. Render takes the delta so animations run smooth, but must remember to account for it on the next SingTick.
Hope that helps.

There are many good reasons why you should not limit your frame rate in such a way. One reason being as stijn pointed out, not every monitor may run at exactly 60fps, another reason being that the resolution of timers is not sufficient, yet another reason being that even given sufficient resolutions, two separate timers (monitor refresh and yours) running in parallel will always get out of sync with time (they must!) due to random inaccuracies, and the most important reason being that it is not necessary at all.
Note that the default timer resolution under Windows is 15ms, and the best possible resolution you can get (by using timeBeginPeriod) is 1ms. Thus, you can (at best) wait 16ms or 17ms. One frame at 60fps is 16.6666ms How do you wait 16.6666ms?
If you want to limit your game's speed to the monitor's refresh rate, enable vertical sync. This will do what you want, precisely, and without sync issues. Vertical sync does have its pecularities too (such as the funny surprise you get when a frame takes 16.67ms), but it is by far the best available solution.
If the reason why you wanted to do this was to fit your simulation into the render loop, this is a must read for you.

check this one out:
//Creating Digital Watch in C++
#include<iostream>
#include<Windows.h>
using namespace std;
struct time{
int hr,min,sec;
};
int main()
{
time a;
a.hr = 0;
a.min = 0;
a.sec = 0;
for(int i = 0; i<24; i++)
{
if(a.hr == 23)
{
a.hr = 0;
}
for(int j = 0; j<60; j++)
{
if(a.min == 59)
{
a.min = 0;
}
for(int k = 0; k<60; k++)
{
if(a.sec == 59)
{
a.sec = 0;
}
cout<<a.hr<<" : "<<a.min<<" : "<<a.sec<<endl;
a.sec++;
Sleep(1000);
system("Cls");
}
a.min++;
}
a.hr++;
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js