Arduino mega queue - c++

I wrote this simple code which reads a length from the Sharp infrared sensor, end presents the average meter in cm (unit) by serial.
When write this code for the Arduino Mega board, the Arduino starts a blinking LED (pin 13) and the program does nothing. Where is the bug in this code?
#include <QueueList.h>
const int ANALOG_SHARP = 0; //Set pin data from sharp.
QueueList <float> queuea;
float cm;
float qu1;
float qu2;
float qu3;
float qu4;
float qu5;
void setup() {
Serial.begin(9600);
}
void loop() {
cm = read_gp2d12_range(ANALOG_SHARP); //Convert to cm (unit).
queuea.push(cm); //Add item to queue, when I add only this line Arduino crash.
if ( 5 <= queuea.peek()) {
Serial.println(average());
}
}
float read_gp2d12_range(byte pin) { //Function converting to cm (unit).
int tmp;
tmp = analogRead(pin);
if (tmp < 3)
return -1; // Invalid value.
return (6787.0 /((float)tmp - 3.0)) - 4.0;
}
float average() { //Calculate average length
qu1 += queuea.pop();
qu2 += queuea.pop();
qu3 += queuea.pop();
qu4 += queuea.pop();
qu5 += queuea.pop();
float aver = ((qu1+qu2+qu3+qu4+qu5)/5);
return aver;
}

I agree with the peek() -> count() error listed by vhallac. But I'll also point out that you should consider averaging by powers of 2 unless there is a strong case to do otherwise.
The reason is that on microcontrollers, division is slow. By averaging over a power of 2 (2,4,8,16,etc.) you can simply calculate the sum and then bitshift it.
To calculate the average of 2: (v1 + v2) >> 1
To calculate the average of 4: (v1 + v2 + v3 + v4) >> 2
To calculate the average of n values (where n is a power of 2) just right bitshift the sum right by [log2(n)].
As long as the datatype for your sum variable is big enough and won't overflow, this is much easier and much faster.
Note: this won't work for floats in general. In fact, microcontrollers aren't optimized for floats. You should consider converting from int (what I'm assuming you're ADC is reading) to float at the end after the averaging rather than before.
By converting from int to float and then averaging floats you are losing more precision than averaging ints than converting the int to a float.
Other:
You're using the += operator without initializing the variables (qu1, qu2, etc.) -- it's good practice to initialize them if you're going to use += but it looks as if = would work fine.
For floats, I'd have written the average function as:
float average(QueueList<float> & q, int n)
{
float sum = 0;
for(int i=0; i<n; i++)
{
sum += q.pop();
}
return (sum / (float) n);
}
And called it: average(queuea, 5);
You could use this to average any number of sensor readings and later use the same code to later average floats in a completely different QueueList. Passing the number of readings to average as a parameter will really come in handy in the case that you need to tweak it.
TL;DR:
Here's how I would have done it:
#include <QueueList.h>
const int ANALOG_SHARP=0; // set pin data from sharp
const int AvgPower = 2; // 1 for 2 readings, 2 for 4 readings, 3 for 8, etc.
const int AvgCount = pow(2,AvgPow);
QueueList <int> SensorReadings;
void setup(){
Serial.begin(9600);
}
void loop()
{
int reading = analogRead(ANALOG_SHARP);
SensorReadings.push(reading);
if(SensorReadings.count() > AvgCount)
{
int avg = average2(SensorReadings, AvgPower);
Serial.println(gpd12_to_cm(avg));
}
}
float gp2d12_to_cm(int reading)
{
if(reading <= 3){ return -1; }
return((6787.0 /((float)reading - 3.0)) - 4.0);
}
int average2(QueueList<int> & q, int AvgPower)
{
int AvgCount = pow(2, AvgPower);
long sum = 0;
for(int i=0; i<AvgCount; i++)
{
sum += q.pop();
}
return (sum >> AvgPower);
}

You are using queuea.peek() to obtain the count. This will only return the last element in queue. You should use queuea.count() instead.
Also you might consider changing the condition tmp < 3 to tmp <= 3. If tmp is 3, you divide by zero.

Great improvement jedwards, however the first question I have is why use queuelist instead of an int array.
As an example I would do the following:
int average(int analog_reading)
{
#define NUM_OF_AVG 5
static int readings[NUM_OF_AVG];
static int next_position;
static int sum;
if (++next_position >= NUM_OF_AVG)
{
next_position=0;
}
reading[next_position]=analog_reading;
for(int i=0; i<NUM_OF_AVG; i++)
{
sum += reading[i];
}
average = sum/NUM_OF_AVG
}
Now I compute a new rolling average with every reading and it eliminates all the issues related to dynamic memory allocation (memory fragmentation, no available memory, memory leaks) in a embedded device.
I appreciate and understand the use of shifting for a division by 2,4 or 8, however I would stay away from that technique for two reasons.
I think readability and maintainability of the source code is more important then saving a little bit of time with a shift instead of a divide unless you can test and verify the divide is a bottleneck.
Second, I believe most current optimizing compilers will do a shift if possible, I know GCC does.
I will leave refactoring out the for loop for the next guy.

Related

is there a way to set an "unknown" variable like "x" inside a sine-equation, and change its value afterwards?

I want to write an audio code in c++ for my microcontroller-based synthesizer which should allow me to generate a sampled square wave signal using the Fourier Series equation.
My question in general is: is there a way to set an "unknown" variable like "x" inside a sine-equation, and change its value afterwards?
What do I mean by that:
If you take a look on my code i've written so far you see the following:
void SquareWave(int mHarmonics){
char x;
for(int k = 0; k <= mHarmonics; k++){
mFourier += 1/((2*k)+1)*sin(((2*k)+1)*2*M_PI*x/SAMPLES_TOTAL);
}
for(x = (int)0; x < SAMPLES_TOTAL; x++){
mWave[x] = mFourier;
}
}
Inside the first for loop mFourier is summing weighted sinus-signals dependent by the number of Harmonics "mHarmonics". So a note on my keyboard should be setting up the harmonic spectrum automatically.
In this equation I've set x as a character and now we get to the center of my problem because i want to set x as a "unknown" variable which has a value that i want to set afterwards and if x would be an integer it would have some standard value like 0, which would make the whole equation incorrect.
In the bottom loop i want to write this Fourier Series sum inside an array mWave, which will be the resulting output. Is there a possibility to give the sum to mWave[x], where x is a "unknown" multiplier inside the sine signal first, and then change its values afterwards inside the second loop?
Sorry if this is a stupid question, I have not much experience with c++ but I try to learn it by making these stupid mistakes!
Cheers
#Useless told you what to do, but I am going to try to spell it out for you.
This is how I would do it:
#include <vector>
/**
* Perform a rectangular window in the frequency domain of a time domain square
* wave. This should be a sync impulse response.
*
* #param x The time domain sample within the period of the signal.
* #param harmonic_count The number of harmonics to aggregate in the result.
* #param sample_count The number of samples across the square wave period.
*
* #return double The time domain result of the combined harmonics at point x.
*/
double box_car(unsigned int x,
unsigned int harmonic_count,
unsigned int sample_count)
{
double mFourier = 0.0;
for (int k = 0; k <= harmonic_count; k++)
{
mFourier += 1.0 / ((2 * k) + 1) * sin(((2 * k) + 1) * 2.0 * M_PI * x / sample_count);
}
return mFourier;
}
/**
* Calculate the suqare wave samples across the time domain where the samples
* are filtered to only include the harmonic_count.
*
* #param harmonic_count The number of harmonics to aggregate in the result.
* #param sample_count The number of samples across the square wave period.
*
* #return std::vector<double>
*/
std::vector<double> box_car_samples(unsigned int harmonic_count,
unsigned int sample_count)
{
std::vector<double> square_wave;
for (unsigned int x = 0; x < sample_count; x++)
{
double sample = box_car(x, harmonic_count, sample_count);
square_wave.push_back(sample);
}
return square_wave;
}
So mWave[x] is returned as a std::vector of doubles (floating point).
The function box_car_samples() is f(k, x) as stated before.
So since I can't use vectors inside Arduino IDE anyhow I've tried the following solution:
...
void ComputeBandlimitedSquareWave(int mHarmonics){
for(int i = 0; i < sample_count; i++){
mWavetable[i] = ComputeFourierSeriesSquare(x);
if (x < sample_count) x++;
}
}
float ComputeFourierSeriesSquare(int x){
for(int k = 0; k <= mHarmonics; k++){
mFourier += 1/((2*k)+1)*sin(((2*k)+1)*2*M_PI*x/sample_count);
return mFourier;
}
}
...
First I thought this must be right a minute ago, but my monitors prove me wrong...
It sounds like a completely messed up sum of signals first, but after about 2 seconds the true characterisic squarewave sound comes through. I try to figure out what I'm overseeing and keep You guys updated if I can isolate that last part coming through my speakers, because it actually has a really decent sound. Only the messy overlays in the beginning are making me desperate right now...

How to avoid using nested loops in cpp?

I am working on digital sampling for sensor. I have following code to compute the highest amplitude and the corresponding time.
struct LidarPoints{
float timeStamp;
float Power;
}
std::vector<LidarPoints> measurement; // To store Lidar points of current measurement
Currently power and energy are the same (because of delta function)and vector is arranged in ascending order of time. I would like to change this to step function. Pulse duration is a constant 10ns.
uint32_t pulseDuration = 5;
The problem is to find any overlap between the samples and if any to add up the amplitudes.
I currently use following code:
for(auto i= 0; i< measurement.size(); i++){
for(auto j=i+1; i< measurement.size(); j++){
if(measurement[j].timeStamp - measurement[i].timeStamp) < pulseDuration){
measurement[i].Power += measurement[j].Power;
measurement[i].timeStamp = (measurement[i].timeStamp + measurement[j].timeStamp)/2.0f;
}
}
}
Is it possible to code this without two for loops since I cannot afford the amount of time being taken by nested loops.
You can take advantage that the vector is sorted by timeStamp and find the next pulse with binary search, thus reducing the complexity from O(n^2) to O(n log n):
#include <vector>
#include <algorithm>
#include <numeric>
#include <iterator
auto it = measurement.begin();
auto end = measurement.end();
while (it != end)
{
// next timestamp as in your code
auto timeStampLower = it->timeStamp + pulseDuration;
// next value in measurement with a timestamp >= timeStampLower
auto lower_bound = std::lower_bound(it, end, timeStampLower, [](float a, const LidarPoints& b) {
return a < b.timeStamp;
});
// sum over [timeStamp, timeStampLower)
float sum = std::accumulate(it, lower_bound, 0.0f, [] (float a, const LidarPoints& b) {
return a + b.timeStamp;
});
auto num = std::distance(it, lower_bound);
// num should be >= since the vector is sorted and pulseDuration is positive
// you should uncomment next line to catch unexpected error
// Expects(num >= 1); // needs GSL library
// assert(num >= 1); // or standard C if you don't want to use GSL
// average over [timeStamp, timeStampLower)
it->timeStamp = sum / num;
// advance it
it = lower_bound;
}
https://en.cppreference.com/w/cpp/algorithm/lower_bound
https://en.cppreference.com/w/cpp/algorithm/accumulate
Also please note that my algorithm will produce different result than yours because you don't really compute the average over multiple values with measurement[i].timeStamp = (measurement[i].timeStamp + measurement[j].timeStamp)/2.0f
Also to consider: (I am by far not an expert in the field, so I am just throwing the ideea, it's up to you to know if its valid or not): with your code you just "squash" together close measurement, instead of having a vector of measurement with periodic time. It might be what you intend or not.
Disclaimer: not tested beyond "it compiles". Please don't just copy-paste it. It could be incomplet and incorrekt. But I hope I gave you a direction to investigate.
Due to jitter and other timing complexities, instead of simple summation, you need to switch to [Numerical Integration][۱] (eg. Trapezoidal Integration...).
If your values are in ascending order of timeStamp adding else break to the if statement shouldn't effect the result but should be a lot quicker.
for(auto i= 0; i< measurement.size(); i++){
for(auto j=i+1; i< measurement.size(); j++){
if(measurement[j].timeStamp - measurement[i].timeStamp) < pulseDuration){
measurement[i].Power += measurement[j].Power;
measurement[i].timeStamp = (measurement[i].timeStamp + measurement[j].timeStamp)/2.0f;
} else {
break;
}
}
}

Encoding numbers in BCD (Casio serial interface) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am attempting to create a device that talks to a Casio fx-9750 calculator through its serial port with an Arduino. I have figured out how to receive values and decode the BCD, but I'm stuck on how to create the required values from a float (to transmit back).
The calculator sends a data packet, which has an exponent value, several data values, and a byte that contains information about negativity, imaginary parts, etc. Each data value is worth one hundredth of the previous one, so the first is the amount of 10s, the next the amount of 0.1s, the next the amount of 0.001s, etc. This continues on until the 0.0000000000001s, though this is out of the range of what I'll really need, so that level of accuracy is not really important to me. The output of my receiving program looks like this:
Exponent: 1
10s: 1
0.1s: 23
0.001s: 40
This represents 12.34.
The general equation I worked out was: (let a=10s, b=0.1s, e=exponent etc)
((a*10)+(b*0.1)+(c*0.001))*10^(E-1)
If the exponent were to change to two:
Exponent: 2
10s: 1
0.1s: 23
0.001s: 40
This would represent 123.4
This method of dropping by hundredths each time is presumably used because they can store two digits in each byte with BCD, so it is most efficient to let each row have two digits as each row is stored as one byte.
I have come up with an equation that can calculate the exponent by counting the amount of digits before the decimal point and subtracting two, however this seems messy as it involves strings. I think a purely mathematical solution would be more elegant, if it is possible.
What is the fastest and simplest way to go from a normal number (e.g. 123.4) into this arrangement?
A solution in Arduino language would be greatly appreciated, but any insight whatsoever into the mathematical process needed would be equally valued.
Edit regarding floats:
I should clarify - I will be dealing with floats in other parts of my program and would like my inputted values to be compatible with numbers of any size (within reason, as stated before). I have no problem with multiplying them to be ints or casting them as other datatypes.
Hah, that was fun!
#include <stdio.h>
#include <assert.h>
#include <math.h>
#include <float.h>
struct num_s {
// exponent
int e;
// v[0] is *10
// v[1] is *0.01
// v[2] is *0.0001
// and so on...
// to increase precision increase array count
int v[6];
};
#define NUM_VALSCNT (sizeof(((struct num_s*)0)->v)/sizeof(((struct num_s*)0)->v[0]))
// creates num_s object from a double
struct num_s num_create(double v) {
struct num_s t;
// find exponent so that v <= 10
t.e = 0;
while (fabs(v) >= 10.0) {
++t.e;
v /= 10.0;
}
// for each output number get the integral part of number
// then multiply the rest by 100 and continue
for (size_t i = 0; i < sizeof(t.v) / sizeof(t.v[0]); ++i) {
const double tmp = fmod(v, 1.0);
t.v[i] = v - tmp;
v = tmp * 100;
}
return t;
}
// converts back from num object to double
double num_get(struct num_s t) {
double denom = 10;
double ret = 0;
for (size_t i = 0; i < sizeof(t.v) / sizeof(t.v[0]); ++i) {
ret += denom * t.v[i];
denom /= 100;
}
return ret * pow(10, t.e - 1);
}
void num_println(struct num_s t) {
printf("%f =", num_get(t));
for (size_t i = 0; i < sizeof(t.v) / sizeof(t.v[0]); ++i) {
printf(" %d", t.v[i]);
}
printf(" %d\n", t.e);
}
// returns the precision of numbers
// the smallest number we can represent in num object
double num_precision(void) {
return pow(0.1, (NUM_VALSCNT - 1) * 2) * 10;
}
int num_unittests(void) {
const double tests[][3] = {
{ 123.49, 123.5, 123.51, }
};
for (size_t i = 0; i < sizeof(tests) / sizeof(tests[0]); ++i) {
const double tmp = num_get(num_create(tests[i][1]));
if (!(tests[i][0] <= tmp && tmp <= tests[i][2])) {
return i + 1;
}
}
return 0;
}
int main() {
num_println(num_create(12.3456789));
num_println(num_create(123.5));
num_println(num_create(12.35));
printf("%d\n", num_unittests());
return 0;
}

Getting values for specific frequencies in a short time fourier transform

I'm trying to use C++ to recreate the spectrogram function used by Matlab. The function uses a Short Time Fourier Transform (STFT). I found some C++ code here that performs a STFT. The code seems to work perfectly for all frequencies but I only want a few. I found this post for a similar question with the following answer:
Just take the inner product of your data with a complex exponential at
the frequency of interest. If g is your data, then just substitute for
f the value of the frequency you want (e.g., 1, 3, 10, ...)
Having no background in mathematics, I can't figure out how to do this. The inner product part seems simple enough from the Wikipedia page but I have absolutely no idea what he means by (with regard to the formula for a DFT)
a complex exponential at frequency of interest
Could someone explain how I might be able to do this? My data structure after the STFT is a matrix filled with complex numbers. I just don't know how to extract my desired frequencies.
Relevant function, where window is Hamming, and vector of desired frequencies isn't yet an input because I don't know what to do with them:
Matrix<complex<double>> ShortTimeFourierTransform::Calculate(const vector<double> &signal,
const vector<double> &window, int windowSize, int hopSize)
{
int signalLength = signal.size();
int nOverlap = hopSize;
int cols = (signal.size() - nOverlap) / (windowSize - nOverlap);
Matrix<complex<double>> results(window.size(), cols);
int chunkPosition = 0;
int readIndex;
// Should we stop reading in chunks?
bool shouldStop = false;
int numChunksCompleted = 0;
int i;
// Process each chunk of the signal
while (chunkPosition < signalLength && !shouldStop)
{
// Copy the chunk into our buffer
for (i = 0; i < windowSize; i++)
{
readIndex = chunkPosition + i;
if (readIndex < signalLength)
{
// Note the windowing!
data[i][0] = signal[readIndex] * window[i];
data[i][1] = 0.0;
}
else
{
// we have read beyond the signal, so zero-pad it!
data[i][0] = 0.0;
data[i][1] = 0.0;
shouldStop = true;
}
}
// Perform the FFT on our chunk
fftw_execute(plan_forward);
// Copy the first (windowSize/2 + 1) data points into your spectrogram.
// We do this because the FFT output is mirrored about the nyquist
// frequency, so the second half of the data is redundant. This is how
// Matlab's spectrogram routine works.
for (i = 0; i < windowSize / 2 + 1; i++)
{
double real = fft_result[i][0];
double imaginary = fft_result[i][1];
results(i, numChunksCompleted) = complex<double>(real, imaginary);
}
chunkPosition += hopSize;
numChunksCompleted++;
} // Excuse the formatting, the while ends here.
return results;
}
Look up the Goertzel algorithm or filter for example code that uses the computational equivalent of an inner product against a complex exponential to measure the presence or magnitude of a specific stationary sinusoidal frequency in a signal. Performance or resolution will depend on the length of the filter and your signal.

Can/Should I run this code of a statistical application on a GPU?

I'm working on a statistical application containing approximately 10 - 30 million floating point values in an array.
Several methods performing different, but independent, calculations on the array in nested loops, for example:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
for (float x = 0f; x < 100f; x += 0.0001f) {
int noOfOccurrences = 0;
foreach (float y in largeFloatingPointArray) {
if (x == y) {
noOfOccurrences++;
}
}
noOfNumbers.Add(x, noOfOccurrences);
}
The current application is written in C#, runs on an Intel CPU and needs several hours to complete. I have no knowledge of GPU programming concepts and APIs, so my questions are:
Is it possible (and does it make sense) to utilize a GPU to speed up such calculations?
If yes: Does anyone know any tutorial or got any sample code (programming language doesn't matter)?
UPDATE GPU Version
__global__ void hash (float *largeFloatingPointArray,int largeFloatingPointArraySize, int *dictionary, int size, int num_blocks)
{
int x = (threadIdx.x + blockIdx.x * blockDim.x); // Each thread of each block will
float y; // compute one (or more) floats
int noOfOccurrences = 0;
int a;
while( x < size ) // While there is work to do each thread will:
{
dictionary[x] = 0; // Initialize the position in each it will work
noOfOccurrences = 0;
for(int j = 0 ;j < largeFloatingPointArraySize; j ++) // Search for floats
{ // that are equal
// to it assign float
y = largeFloatingPointArray[j]; // Take a candidate from the floats array
y *= 10000; // e.g if y = 0.0001f;
a = y + 0.5; // a = 1 + 0.5 = 1;
if (a == x) noOfOccurrences++;
}
dictionary[x] += noOfOccurrences; // Update in the dictionary
// the number of times that the float appears
x += blockDim.x * gridDim.x; // Update the position here the thread will work
}
}
This one I just tested for smaller inputs, because I am testing in my laptop. Nevertheless, it is working, but more tests are needed.
UPDATE Sequential Version
I just did this naive version that executes your algorithm for an array with 30,000,000 element in less than 20 seconds (including the time taken by function that generates the data).
This naive version first sorts your array of floats. Afterward, will go through the sorted array and check the number of times a given value appears in the array and then puts this value in a dictionary along with the number of times it has appeared.
You can use sorted map, instead of the unordered_map that I used.
Heres the code:
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include <algorithm>
#include <string>
#include <iostream>
#include <tr1/unordered_map>
typedef std::tr1::unordered_map<float, int> Mymap;
void generator(float *data, long int size)
{
float LO = 0.0;
float HI = 100.0;
for(long int i = 0; i < size; i++)
data[i] = LO + (float)rand()/((float)RAND_MAX/(HI-LO));
}
void print_array(float *data, long int size)
{
for(long int i = 2; i < size; i++)
printf("%f\n",data[i]);
}
std::tr1::unordered_map<float, int> fill_dict(float *data, int size)
{
float previous = data[0];
int count = 1;
std::tr1::unordered_map<float, int> dict;
for(long int i = 1; i < size; i++)
{
if(previous == data[i])
count++;
else
{
dict.insert(Mymap::value_type(previous,count));
previous = data[i];
count = 1;
}
}
dict.insert(Mymap::value_type(previous,count)); // add the last member
return dict;
}
void printMAP(std::tr1::unordered_map<float, int> dict)
{
for(std::tr1::unordered_map<float, int>::iterator i = dict.begin(); i != dict.end(); i++)
{
std::cout << "key(string): " << i->first << ", value(int): " << i->second << std::endl;
}
}
int main(int argc, char** argv)
{
int size = 1000000;
if(argc > 1) size = atoi(argv[1]);
printf("Size = %d",size);
float data[size];
using namespace __gnu_cxx;
std::tr1::unordered_map<float, int> dict;
generator(data,size);
sort(data, data + size);
dict = fill_dict(data,size);
return 0;
}
If you have the library thrust installed in you machine your should use this:
#include <thrust/sort.h>
thrust::sort(data, data + size);
instead of this
sort(data, data + size);
For sure it will be faster.
Original Post
I'm working on a statistical application which has a large array
containing 10 - 30 millions of floating point values.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Yes, it is. A month ago, I ran an entirely Molecular Dynamic simulation on a GPU. One of the kernels, which calculated the force between pairs of particles, received as parameter 6 array each one with 500,000 doubles, for a total of 3 Millions doubles (22 MB).
So if you are planning to put 30 Million floating points, which is about 114 MB of global Memory, it will not be a problem.
In your case, can the number of calculations be an issue? Based on my experience with the Molecular Dynamic (MD), I would say no. The sequential MD version takes about 25 hours to complete while the GPU version took 45 Minutes. You said your application took a couple hours, also based in your code example it looks softer than the MD.
Here's the force calculation example:
__global__ void add(double *fx, double *fy, double *fz,
double *x, double *y, double *z,...){
int pos = (threadIdx.x + blockIdx.x * blockDim.x);
...
while(pos < particles)
{
for (i = 0; i < particles; i++)
{
if(//inside of the same radius)
{
// calculate force
}
}
pos += blockDim.x * gridDim.x;
}
}
A simple example of a code in CUDA could be the sum of two 2D arrays:
In C:
for(int i = 0; i < N; i++)
c[i] = a[i] + b[i];
In CUDA:
__global__ add(int *c, int *a, int*b, int N)
{
int pos = (threadIdx.x + blockIdx.x)
for(; i < N; pos +=blockDim.x)
c[pos] = a[pos] + b[pos];
}
In CUDA you basically took each for iteration and assigned to each thread,
1) threadIdx.x + blockIdx.x*blockDim.x;
Each block has an ID from 0 to N-1 (N the number maximum of blocks) and each block has a 'X' number of threads with an ID from 0 to X-1.
Gives you the for loop iteration that each thread will compute based on its ID and the block ID which the thread is in; the blockDim.x is the number of threads that a block has.
So if you have 2 blocks each one with 10 threads and N=40, the:
Thread 0 Block 0 will execute pos 0
Thread 1 Block 0 will execute pos 1
...
Thread 9 Block 0 will execute pos 9
Thread 0 Block 1 will execute pos 10
....
Thread 9 Block 1 will execute pos 19
Thread 0 Block 0 will execute pos 20
...
Thread 0 Block 1 will execute pos 30
Thread 9 Block 1 will execute pos 39
Looking at your current code, I have made this draft of what your code could look like in CUDA:
__global__ hash (float *largeFloatingPointArray, int *dictionary)
// You can turn the dictionary in one array of int
// here each position will represent the float
// Since x = 0f; x < 100f; x += 0.0001f
// you can associate each x to different position
// in the dictionary:
// pos 0 have the same meaning as 0f;
// pos 1 means float 0.0001f
// pos 2 means float 0.0002f ect.
// Then you use the int of each position
// to count how many times that "float" had appeared
int x = blockIdx.x; // Each block will take a different x to work
float y;
while( x < 1000000) // x < 100f (for incremental step of 0.0001f)
{
int noOfOccurrences = 0;
float z = converting_int_to_float(x); // This function will convert the x to the
// float like you use (x / 0.0001)
// each thread of each block
// will takes the y from the array of largeFloatingPointArray
for(j = threadIdx.x; j < largeFloatingPointArraySize; j += blockDim.x)
{
y = largeFloatingPointArray[j];
if (z == y)
{
noOfOccurrences++;
}
}
if(threadIdx.x == 0) // Thread master will update the values
atomicAdd(&dictionary[x], noOfOccurrences);
__syncthreads();
}
You have to use atomicAdd because different threads from different blocks may write/read noOfOccurrences concurrently, so you have to ensure mutual exclusion.
This is just one approach; you can even assign the iterations of the outer loop to the threads instead of the blocks.
Tutorials
The Dr Dobbs Journal series CUDA: Supercomputing for the masses by Rob Farmer is excellent and covers just about everything in its fourteen installments. It also starts rather gently and is therefore fairly beginner-friendly.
and anothers:
Volume I: Introduction to CUDA Programming
Getting started with CUDA
CUDA Resources List
Take a look on the last item, you will find many link to learn CUDA.
OpenCL: OpenCL Tutorials | MacResearch
I don't know much of anything about parallel processing or GPGPU, but for this specific example, you could save a lot of time by making a single pass over the input array rather than looping over it a million times. With large data sets you will usually want to do things in a single pass if possible. Even if you're doing multiple independent computations, if it's over the same data set you might get better speed doing them all in the same pass, as you'll get better locality of reference that way. But it may not be worth it for the increased complexity in your code.
In addition, you really don't want to add a small amount to a floating point number repetitively like that, the rounding error will add up and you won't get what you intended. I've added an if statement to my below sample to check if inputs match your pattern of iteration, but omit it if you don't actually need that.
I don't know any C#, but a single pass implementation of your sample would look something like this:
Dictionary<float, int> noOfNumbers = new Dictionary<float, int>();
foreach (float x in largeFloatingPointArray)
{
if (math.Truncate(x/0.0001f)*0.0001f == x)
{
if (noOfNumbers.ContainsKey(x))
noOfNumbers.Add(x, noOfNumbers[x]+1);
else
noOfNumbers.Add(x, 1);
}
}
Hope this helps.
Is it possible (and does it make sense) to utilize a GPU to speed up
such calculations?
Definitely YES, this kind of algorithm is typically the ideal candidate for massive data-parallelism processing, the thing GPUs are so good at.
If yes: Does anyone know any tutorial or got any sample code
(programming language doesn't matter)?
When you want to go the GPGPU way you have two alternatives : CUDA or OpenCL.
CUDA is mature with a lot of tools but is NVidia GPUs centric.
OpenCL is a standard running on NVidia and AMD GPUs, and CPUs too. So you should really favour it.
For tutorial you have an excellent series on CodeProject by Rob Farber : http://www.codeproject.com/Articles/Rob-Farber#Articles
For your specific use-case there is a lot of samples for histograms buiding with OpenCL (note that many are image histograms but the principles are the same).
As you use C# you can use bindings like OpenCL.Net or Cloo.
If your array is too big to be stored in the GPU memory, you can block-partition it and rerun your OpenCL kernel for each part easily.
In addition to the suggestion by the above poster use the TPL (task parallel library) when appropriate to run in parallel on multiple cores.
The example above could use Parallel.Foreach and ConcurrentDictionary, but a more complex map-reduce setup where the array is split into chunks each generating an dictionary which would then be reduced to a single dictionary would give you better results.
I don't know whether all your computations map correctly to the GPU capabilities, but you'll have to use a map-reduce algorithm anyway to map the calculations to the GPU cores and then reduce the partial results to a single result, so you might as well do that on the CPU before moving on to a less familiar platform.
I am not sure whether using GPUs would be a good match given that
'largerFloatingPointArray' values need to be retrieved from memory. My understanding is that GPUs are better suited for self contained calculations.
I think turning this single process application into a distributed application running on many systems and tweaking the algorithm should speed things up considerably, depending how many systems are available.
You can use the classic 'divide and conquer' approach. The general approach I would take is as follows.
Use one system to preprocess 'largeFloatingPointArray' into a hash table or a database. This would be done in a single pass. It would use floating point value as the key, and the number of occurrences in the array as the value. Worst case scenario is that each value only occurs once, but that is unlikely. If largeFloatingPointArray keeps changing each time the application is run then in-memory hash table makes sense. If it is static, then the table could be saved in a key-value database such as Berkeley DB. Let's call this a 'lookup' system.
On another system, let's call it 'main', create chunks of work and 'scatter' the work items across N systems, and 'gather' the results as they become available. E.g a work item could be as simple as two numbers indicating the range that a system should work on. When a system completes the work, it sends back array of occurrences and it's ready to work on another chunk of work.
The performance is improved because we do not keep iterating over largeFloatingPointArray. If lookup system becomes a bottleneck, then it could be replicated on as many systems as needed.
With large enough number of systems working in parallel, it should be possible to reduce the processing time down to minutes.
I am working on a compiler for parallel programming in C targeted for many-core based systems, often referred to as microservers, that are/or will be built using multiple 'system-on-a-chip' modules within a system. ARM module vendors include Calxeda, AMD, AMCC, etc. Intel will probably also have a similar offering.
I have a version of the compiler working, which could be used for such an application. The compiler, based on C function prototypes, generates C networking code that implements inter-process communication code (IPC) across systems. One of the IPC mechanism available is socket/tcp/ip.
If you need help in implementing a distributed solution, I'd be happy to discuss it with you.
Added Nov 16, 2012.
I thought a little bit more about the algorithm and I think this should do it in a single pass. It's written in C and it should be very fast compared with what you have.
/*
* Convert the X range from 0f to 100f in steps of 0.0001f
* into a range of integers 0 to 1 + (100 * 10000) to use as an
* index into an array.
*/
#define X_MAX (1 + (100 * 10000))
/*
* Number of floats in largeFloatingPointArray needs to be defined
* below to be whatever your value is.
*/
#define LARGE_ARRAY_MAX (1000)
main()
{
int j, y, *noOfOccurances;
float *largeFloatingPointArray;
/*
* Allocate memory for largeFloatingPointArray and populate it.
*/
largeFloatingPointArray = (float *)malloc(LARGE_ARRAY_MAX * sizeof(float));
if (largeFloatingPointArray == 0) {
printf("out of memory\n");
exit(1);
}
/*
* Allocate memory to hold noOfOccurances. The index/10000 is the
* the floating point number. The contents is the count.
*
* E.g. noOfOccurances[12345] = 20, means 1.2345f occurs 20 times
* in largeFloatingPointArray.
*/
noOfOccurances = (int *)calloc(X_MAX, sizeof(int));
if (noOfOccurances == 0) {
printf("out of memory\n");
exit(1);
}
for (j = 0; j < LARGE_ARRAY_MAX; j++) {
y = (int)(largeFloatingPointArray[j] * 10000);
if (y >= 0 && y <= X_MAX) {
noOfOccurances[y]++;
}
}
}