Finding least number in a set with CUDA - c++

Suppose you had a function that would take in a vector, a set of vectors, and find which vector in the set of vectors was closest to the original vector. It may be useful if I included some code:
int findBMU(float * inputVector, float * weights){
int count = 0;
float currentDistance = 0;
int winner = 0;
float leastDistance = 99999;
for(int i = 0; i<10; i++){
for(int j = 0;j<10; j++){
for(int k = 0; k<10; k++){
int offset = (i*100+j*10+k)*644;
for(int i = offset; i<offset+644; i++){
currentDistance += abs((inputVector[count]-weights[i]))*abs((inputVector[count]-weights[i]));
count++;
}
currentDistance = sqrt(currentDistance);
count = 0;
if(currentDistance<leastDistance){
winner = offset;
leastDistance = currentDistance;
}
currentDistance = 0;
}
}
}
return winner;
}
In this example, weights is a single dimensional array, with a block of 644 elements corresponding to one vector. inputVector is the vector that's being compared, and it also has 644 elements.
To speed up my program, I decided to take a look at the CUDA framework provided by NVIDIA. This is what my code looked like once I changed it to fit CUDA's specifications.
__global__ void findBMU(float * inputVector, float * weights, int * winner, float * leastDistance){
int i = threadIdx.x+(blockIdx.x*blockDim.x);
if(i<1000){
int offset = i*644;
int count = 0;
float currentDistance = 0;
for(int w = offset; w<offset+644; w++){
currentDistance += abs((inputVector[count]-weights[w]))*abs((inputVector[count]-weights[w]));
count++;
}
currentDistance = sqrt(currentDistance);
count = 0;
if(currentDistance<*leastDistance){
*winner = offset;
*leastDistance = currentDistance;
}
currentDistance = 0;
}
}
To call the function, I used : findBMU<<<20, 50>>>(d_data, d_weights, d_winner, d_least);
But, when I would call the function, sometimes it would give me the right answer, and sometimes it wouldn't. After doing some research, I found that CUDA has some issues with reduction problems like these, but I couldn't find how to fix it. How can I modify my program to make it work with CUDA?

The issue is that threads that run concurrently will see the same leastDistance and overwrite each other's results. There are two values that are shared between threads; leastDistance and winner. You have two basic options. You can write out the results from all the threads and then do a second pass over the data with a parallel reduction to determine which vector had the best match or you can implement this with a custom atomic operation using atomicCAS().
The first method is the easiest. My guess is that it will also give you the best performance, though it does add a dependency for the the free Thrust library. You would use thrust::min_element().
The method using atomicCAS() uses the fact that atomicCAS() has a 64-bit mode, in which you can assign any semantics that you wish to a 64-bit value. In your case, you would use 32 bits to store leastDistance and 32 bits to store winner. To use this method, adapt this example in the CUDA C Programming Guide that implements a double precision floating point atomicAdd().
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}

Related

Why am I getting the error message, expected unqualified id before 'for'

I have to write a code that does this for a university assignment. I have written the average function of the code but I'm getting an error message for the other function.
The DEVBOARD_readAccelerometer function will read x,y,z components of acceleration
Firstly, you will need to write a function:
int average(int *array, int nLen);
which returns the average value of an array of integer values. (To prevent overflow, it is suggested to use a long internal variable).
The Spirit Level function will run a loop. For each iteration the Z-component of gravitational acceleration will be sampled four times at 50ms intervals and stored in a suitably sized array.
Using the average() function you have written, determine the average value and analyze
int average(int* array, int nlen) { // assuming array is int
long sum = 0L; // sum will be larger than an item, long for safety.
for (int i = 0; i < nlen; i++) {
sum += array[i];
}
return ((long)sum) / nlen;
}
float sum[3];
int j;
for (int j = 0; j < 3; j++) {
i = DEVBOARD_readAccelerometer(int* xAccel, int* yAccel, int* zAccel);
sum[j] = int * zAccel;
}
average(float* sum[j], float nlen);
printf("The average zcomponent is %f\n");
}
unqualified-id before for
You declared variable j 2 times. One before the for loop and one inside the for definition.

Why does std::chrono say my function took zero nanoseconds to execute?

I am working on a project where I implement two popular MST algorithms in C++ and then print how long each one takes to execute. Please ignore the actual algorithms, I have already tested them and am only interested in getting accurate measurements of how long they take.
void Graph::krushkalMST(bool e){
size_t s2 = size * size;
typedef struct{uint loc; uint val;} wcType; //struct used for storing a copy of the weights values to be sorted, with original locations
wcType* weightsCopy = new wcType[s2]; //copy of the weights which will be sorted.
for(int i = 0; i < s2; i++){
weightsCopy[i].loc = i;
weightsCopy[i].val = weights[i];
}
std::vector<uint> T(0); //List of edges in the MST
auto start = std::chrono::high_resolution_clock::now(); //time the program was started
typedef int (*cmpType)(const void*, const void*); //comparison function type
static cmpType cmp = [](const void* ua, const void* ub){ //Compare function used by the sort as a C++ lambda
uint a = ((wcType*)ua)->val, b = ((wcType*)ub)->val;
return (a == b) ? 0 : (a == NULLEDGE) ? 1 : (b == NULLEDGE) ? -1 : (a < b) ? -1 : 1;
};
std::qsort((void*)weightsCopy, s2, sizeof(wcType), cmp); //sort edges into ascending order using a quick sort (supposedly quick sort)
uint* componentRefs = new uint[size]; //maps nodes to what component they currently belong to
std::vector<std::vector<uint>> components(size); //vector of components, each component is a vector of nodes;
for(int i = 0; i < size; i++){
//unOptimize(components);
components[i] = std::vector<uint>({(uint)i});
componentRefs[i] = i;
}
for(int wcIndex = 0; components.size() >= 2 ; wcIndex++){
uint i = getI(weightsCopy[wcIndex].loc), j = getJ(weightsCopy[wcIndex].loc); //get pair of nodes with the smallest edge
uint ci = componentRefs[i], cj = componentRefs[j]; //locations of nodes i and j
if(ci != cj){
T.push_back(weightsCopy[wcIndex].loc); //push the edge into T
for(int k = 0; k < components[cj].size(); k++) //move each member in j's component to i's component
components[ci].push_back(components[cj][k]);
for(int k = 0; k < components[cj].size(); k++) //copy this change into the reference locations
componentRefs[components[cj][k]] = ci;
components.erase(components.begin() + cj); //delete j's component
for(int k = 0; k < size; k++)
if(componentRefs[k] >= cj)
componentRefs[k]--;
}
}
auto end = std::chrono::high_resolution_clock::now();
uint time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count();
std::cout<<"\nMST found my krushkal's Algorithm:\n";
printData(time, T, e);
delete[] weightsCopy;
delete[] componentRefs;
}
void Graph::primMST(bool e){
std::vector<uint> T(0); //List of edges in the MST
auto start = std::chrono::high_resolution_clock::now(); //Start calculating the time the algorithm takes
bool* visited = new bool[size]; //Maps each node to a visited value
visited[0] = true;
for(int i = 1; i < size; i++)
visited[i] = false;
for(uint numVisited = 1; numVisited < size; numVisited++){
uint index = 0; //index of the smallest cost edge to unvisited node
uint minCost = std::numeric_limits<uint>::max(); //cost of the smallest edge filling those conditions
for(int i = 0; i < size; i++){
if(visited[i]){
for(int j = 0; j < size; j++){
if(!visited[j]){
uint curIndex = i * size + j, weight = dweights[curIndex];
if(weight != NULLEDGE && weight < minCost){
index = curIndex;
minCost = weight;
}
}
}
}
}
T.push_back(index);
visited[getI(index)] = true;
}
auto end = std::chrono::high_resolution_clock::now();
uint time = std::chrono::duration_cast<std::chrono::microseconds>(end-start).count();
std::cout<<"\nMST found my Prim's Algorithm:\n";
printData(time, T, e);
delete[] visited;
}
I initially used clock() from <ctime> to try and get an accurate measurement of how long this would take, my largest test file has a graph of 40 nodes with 780 edges (sufficiently large enough to warrant some compute time), and even then on a slow computer using g++ with -O0 i would get either 0 or 1 milliseconds. On my desktop I was only ever able to get 0 ms, however as I need a more accurate way to distinguish time between test cases I decided I would try for the high_resolution_clock provided by the <chrono> library.
This is where the real trouble began, I would (and still) consistently get that the program took 0 nanoseconds to execute.
In my search for a solution I came across multiple questions that deal with similar issues, most of which state that <chrono> is system dependent and you're unlikely to actually be able to get nanosecond or even microsecond values. Never the less, I tried using std::chrono::microsecond only to still consistently get 0. Eventually I found what I thought was someone who was having the same problem as me:
counting duration with std::chrono gives 0 nanosecond when it should take long
However, this is clearly a problem of an overactive optimizer which has deleted an unnecessary piece of code, whereas in my case the end result always depends on the results for series of complex loops which must be executed in full. I am on Windows 10, compiling with GCC using -O0.
My best hypothesis is I'm doing something wrong or that windows doesn't support anything smaller then milliseconds while using std::chrono and std::chrono::nanoseconds are actually just milliseconds padded with 0s on the end (as I observe when I put a system("pause") in the algorithm and unpause at arbitrary times). Please let me know if you find anyway around this or if there is any other way I can achieve higher resolution time.
At the request of #Ulrich Eckhardt, I am including minimal reproducible example as well as the results of the test I preformed using it, and I must say it is rather insightful.
#include<iostream>
#include<chrono>
#include<cmath>
int main()
{
double c = 1;
for(int itter = 1; itter < 10000000; itter *= 10){
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < itter; i++)
c += sqrt(c) + log(c);
auto end = std::chrono::high_resolution_clock::now();
int time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count();
std::cout<<"calculated: "<<c<<". "<<itter<<" iterations took "<<time<<"ns\n";
}
system("pause");
}
For my loop I choose a random arbitrary mathematical formula and make sure to use the result of what the loop does so it's not optimized out of existence. Testing it with various iterations on my desktop yields:
This seems to imply that a certain threshold is required before the it starts counting time, since dividing the time taken by the first result that yields non-zero time by 10, we get another non-zero time which is not what the result says despite that being how it should work assuming this whole loop is takes O(n) time with n iterations that is. If anything this small example baffles me even further.
Switch to steady_clock and you get the correct results for both MSVC and MinGW GCC.
You should avoid using the high_resolution_clock as it is just an alias to either steady_clock or system_clock. For measuring elapsed time in a stop watch like fashion, you always want steady_clock. high_resolution_clock is an unfortunate thing and should be avoided.
I just checked and MSVC has the following:
using high_resolution_clock = steady_clock;
while MinGW GCC has:
/**
* #brief Highest-resolution clock
*
* This is the clock "with the shortest tick period." Alias to
* std::system_clock until higher-than-nanosecond definitions
* become feasible.
*/
using high_resolution_clock = system_clock;

Synchronize functions using pthread to do some simple operations on an array

I am studying pthread but confused about how to use pthread to synchronize the functions.
For example, I have a simple code to do some operations on an array like following:
float add(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + numbers[i] +5;
}
return sum/5;
}
float subtract(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + numbers[i] -10;
}
return sum/5;
}
float mul(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + (float)numbers[i] * 1.5 ;
}
return sum/5;
}
float div(int numbers[5]){
float sum;
for(int i = 0; i < 5; i++){
sum = sum + (float)numbers[i]/ 2;
}
return sum/5;
}
int main(){
int numbers [5] = { 34, 2, 77, 40, 12 };
float addition = add(numbers);
float subtraction = subtract(numbers);
float multiplication = mul(numbers);
float division = div(numbers);
cout << addition + subtraction + multiplication + division << endl;
return -1;
}
Since all the four functions are independent from each other and using the same input, how can I put each operation into one thread and let the functions(or threads) run at the same time?
I think if one day I have a very large array and run the program like above, it will spend a lot time but if I can make the functions run simultaneously, it will save a lot time.
First of all, I suspect, you are not clear how arrays are passed into functions. float subtract(int numbers[5]) does not tell anything about size of the passed array. It is equivalent of float subtract(int numbers[]) (no size), which, in turn, is equivalent to ``float subtract(int* numbers)` (pointer to int). You also have a bug in your function, since you do not initialize float before first use (as well as other functions).
Having this in mind, the whole substract function is better to be written like this:
float subtract(int* numbers, const size_t size) {
float sum = 0;
for(int i = 0; i < size; i++) {
sum = sum + numbers[i] -10;
}
return sum/5;
}
Now, once we cleared the function itself, we can tackle multithreading. I really suggest to ditch pthreads and instead use C++11 thread capability. That is especially true when you need to get the result back as a return value of the function. Doing it with pthreads would require too much typing. The relevant code to do this in C++ would be looking similar to this:
int numbers[] = {34, 2, 77, 40, 12}; // No need to provide array size when inited
auto sub_result = std::async(std::launch::async, &subtract, numbers, sizeof(numbers) / sizeof(*numbers);
auto div_result = ....
// rest of functions
std::cout << "Result of subtraction: " << div_result.get();
Now, this is a lot to grasp :) std::async is a way to run the function asynchronously without worring about multithreading at all. The task of threading is delegated to the compiler. It is a much cleaner way than using pthreads - see, invocation is not much different from normal function invocation! The only thing to keep in mind is that it returns so-called std::future object - an special object on which you can wait until the function which was run completes execution. It also has a get function, which waits until the function is completed and returns it's result. Nice, eh?

Calculating mean of an array

Hello I'm having issues calculating the mean of in my function, the program compiles however I don't get the intended answer of 64.2 to print out and instead get a random set of integers and characters.
This is not the entirety of the code but only the appropriate variables and functions.
// main function and prototyping would be here
int size=0;
float values[]={10.1, 9.2, 7.9, 9.2, 13.0, 12.7, 11.3};
float mean(float values[], int size)
{
float sum = 0;
float mean = 0;
for (size = 0; size > 7; size++)
{
sum += values[size];
mean = sum / 7;
}
return mean;
}
Change your loop like so:
for (size = 0; size < 7; size++)
{
sum += values[size];
}
mean = sum / 7;
Your terminating condition for for loop isn't right.
Move the mean out of for loop.
for (size = 0; size > 7; size++)
Since size is initialized as 0, and it is incremented by 1, it becomes 1 at the end of the first iteration and fails the test (it is not > 7). Thus, it immediately exits the loop.
Secondly, you calculate mean inside the loop when you should calculate it after the loop is complete. Theoretically, you should get a correct value since you redo it as the mean of the sums to that point in the loop, but it is a waste of time. You also wipe out size by redefining it.
float mean(float values[], int size)
{
float sum = 0;
float mymean = 0;
for (int i = 0; i < size; i++)
{
sum += values[i];
}
mymean = sum / size;
return mymean;
}
Why is the test size > 7 there? Expecting your initial value to have an unusually large value of zero? It's likely that you mean size < 7, though using arbitrary magic numbers like that is trouble.
What you probably want is:
float mean(float* values, int size)
{
float sum = 0;
for (int i = 0; i < size; ++i)
sum += values[i];
return sum / size;
}
To be more C++ you'd want that signature to be:
float mean(const float* values, const size_t size)
That way you'd catch any mistakes with modifying those values.

I want to optimize this short loop

I would like to optimize this simple loop:
unsigned int i;
while(j-- != 0){ //j is an unsigned int with a start value of about N = 36.000.000
float sub = 0;
i=1;
unsigned int c = j+s[1];
while(c < N) {
sub += d[i][j]*x[c];//d[][] and x[] are arrays of float
i++;
c = j+s[i];// s[] is an array of unsigned int with 6 entries.
}
x[j] -= sub; // only one memory-write per j
}
The loop has an execution time of about one second with a 4000 MHz AMD Bulldozer. I thought about SIMD and OpenMP (which I normally use to get more speed), but this loop is recursive.
Any suggestions?
think you may want to transpose the matrix d -- means store it in such a way that you can exchange the indices -- make i the outer index:
sub += d[j][i]*x[c];
instead of
sub += d[i][j]*x[c];
This should result in better cache performance.
I agree with transposing for better caching (but see my comments on that at the end), and there's more to do, so let's see what we can do with the full function...
Original function, for reference (with some tidying for my sanity):
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *x, float *b){
//We want to solve L D Lt x = b where D is a diagonal matrix described by Diagonals[0] and L is a unit lower triagular matrix described by the rest of the diagonals.
//Let D Lt x = y. Then, first solve L y = b.
float *y = new float[n];
float **d = IncompleteCholeskyFactorization->Diagonals;
unsigned int *s = IncompleteCholeskyFactorization->StartRows;
unsigned int M = IncompleteCholeskyFactorization->m;
unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i, j;
for(j = 0; j != N; j++){
float sub = 0;
for(i = 1; i != M; i++){
int c = (int)j - (int)s[i];
if(c < 0) break;
if(c==j) {
sub += d[i][c]*b[c];
} else {
sub += d[i][c]*y[c];
}
}
y[j] = b[j] - sub;
}
//Now, solve x from D Lt x = y -> Lt x = D^-1 y
// Took this one out of the while, so it can be parallelized now, which speeds up, because division is expensive
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] = y[j]/d[0][j];
}
while(j-- != 0){
float sub = 0;
for(i = 1; i != M; i++){
if(j + s[i] >= N) break;
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
delete[] y;
}
Because of the comment about parallel divide giving a speed boost (despite being only O(N)), I'm assuming the function itself gets called a lot. So why allocate memory? Just mark x as __restrict__ and change y to x everywhere (__restrict__ is a GCC extension, taken from C99. You might want to use a define for it. Maybe the library already has one).
Similarly, though I guess you can't change the signature, you can make the function take only a single parameter and modify it. b is never used when x or y have been set. That would also mean you can get rid of the branch in the first loop which runs ~N*M times. Use memcpy at the start if you must have 2 parameters.
And why is d an array of pointers? Must it be? This seems too deep in the original code, so I won't touch it, but if there's any possibility of flattening the stored array, it will be a speed boost even if you can't transpose it (multiply, add, dereference is faster than dereference, add, dereference).
So, new code:
void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *__restrict__ x){
// comments removed so that suggestions are more visible. Don't remove them in the real code!
// these definitions got long. Feel free to remove const; it does nothing for the optimiser
const float *const __restrict__ *const __restrict__ d = IncompleteCholeskyFactorization->Diagonals;
const unsigned int *const __restrict__ s = IncompleteCholeskyFactorization->StartRows;
const unsigned int M = IncompleteCholeskyFactorization->m;
const unsigned int N = IncompleteCholeskyFactorization->n;
unsigned int i;
unsigned int j;
for(j = 0; j < N; j++){ // don't use != as an optimisation; compilers can do more with <
float sub = 0;
for(i = 1; i < M && j >= s[i]; i++){
const unsigned int c = j - s[i];
sub += d[i][c]*x[c];
}
x[j] -= sub;
}
// Consider using processor-specific optimisations for this
#pragma omp parallel for
for(j = 0; j < N; j++){
x[j] /= d[0][j];
}
for( j = N; (j --) > 0; ){ // changed for clarity
float sub = 0;
for(i = 1; i < M && j + s[i] < N; i++){
sub += d[i][j]*x[j + s[i]];
}
x[j] -= sub;
}
}
Well it's looking tidier, and the lack of memory allocation and reduced branching, if nothing else, is a boost. If you can change s to include an extra UINT_MAX value at the end, you can remove more branches (both the i<M checks, which again run ~N*M times).
Now we can't make any more loops parallel, and we can't combine loops. The boost now will be, as suggested in the other answer, to rearrange d. Except… the work required to rearrange d has exactly the same cache issues as the work to do the loop. And it would need memory allocated. Not good. The only options to optimise further are: change the structure of IncompleteCholeskyFactorization->Diagonals itself, which will probably mean a lot of changes, or find a different algorithm which works better with data in this order.
If you want to go further, your optimisations will need to impact quite a lot of the code (not a bad thing; unless there's a good reason for Diagonals being an array of pointers, it seems like it could do with a refactor).
I want to give an answer to my own question: The bad performance was caused by cache conflict misses due to the fact that (at least) Win7 aligns big memory blocks to the same boundary. In my case, for all buffers, the adresses had the same alignment (bufferadress % 4096 was same for all buffers), so they fall into the same cacheset of L1 cache. I changed memory allocation to align the buffers to different boundaries to avoid cache conflict misses and got a speedup of factor 2. Thanks for all the answers, especially the answers from Dave!