Parallelize four and more nested loops with CUDA

Parallelize four and more nested loops with CUDA - c++

I am working on a compiler generating parallel C++ code. I am new to CUDA programming but I am trying to parallelize the C++ code with CUDA.
Currently if I have the following sequential C++ code:
for(int i = 0; i < a; i++) {
for(int j = 0; j < b; j++) {
for(int k = 0; k < c; k++) {
A[i*y*z + j*z + k*z +l] = 1;
}
}
}
and this results in the following CUDA code:
__global__ void kernelExample() {
int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);
A[_cu_x*y*z + _cu_y*z + _cu_z] = 1;
}
so each loop nest is mapped to one dimension, but what would be the correct way to parallelize four and more nested loops:
for(int i = 0; i < a; i++) {
for(int j = 0; j < b; j++) {
for(int k = 0; k < c; k++) {
for(int l = 0; l < d; l++) {
A[i*x*y*z + j*y*z + k*z +l] = 1;
}
}
}
}
Is there any similar way? Noteworthy: all loop dimensions are parallel and there are no dependencies between iterations.
Thanks in advance!
EDIT: the goal is to map all iterations to CUDA threads, since all iterations are independent and could be executed concurrently.

You could keep the outer loop unchanged. Also it is better to use .x as inner most loop so you can access the global memory efficiently.
__global__ void kernelExample() {
int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);
for(int i = 0; i < a; i++) {
A[i*x*y*z + _cu_z*y*z + _cu_y*z + _cu_x] = 1;
}
}
However if your a,b,c,d are all very small, you may not be able to get enough parallelism. In that case you could convert a linear index to n-D indices.
__global__ void kernelExample() {
int tid = ((blockIdx.x*blockDim.x)+threadIdx.x);
int i = tid / (b*c*d);
int j = tid / (c*d) % b;
int k = tid / d % c;
int l = tid % d;
A[i*x*y*z + j*y*z + k*z + l] = 1;
}
But be careful that calculating i,j,k,l may introduce a lot of overhead as integer division and mod are slow on GPU. As an alternative you could map i,j to .z and .y, and calculate only k,l and more dimensions from .x in a similar way.

Related

Calculating Big-O notation that has three nested loops

I want to find Big_O notation for my code. It has three nested loops and each loop has parameter that maybe vary.
According to my understanding (I am not sure if that correct).
time complexity is O(NKC) where N is the size in the outer loop, K is a constant inserted by user. C is also constant that may be change when using other dataset.
my code:
for (int m=0; m< size; m++)
{
int array_Y_class_target[2]{};
float CT[2]{};
float SumOf_Each_class_distances[2] = { 0.0 };
int min_index = -1;
for (int i = k; i > 0; --i) {
for (int c = 0; c < 2; ++c) {
for (int j = 0; j < i; ++j) {
int index = index_arr[j];
if (Y_train[index] == c)
{
array_Y_class_target[c] ++;
float dist = array_dist[index_arr[j]];
SumOf_Each_class_distances[c] += dist;
}
}
if (array_Y_class_target[c] != 0)
{
CT[c] = (((float)k / (float)array_Y_class_target[c]) + (SumOf_Each_class_distances[c] / (float)array_Y_class_target[c]));
}
else
{
CT[c] = 1.5; // max CT value
}
}

Is this the fastest single digit sort algorithm possible?

I'm trying to learn c++ and wrote this algorithm and I am wondering if there is a faster way to do the same thing. This is assuming that the input is valid. I was trying to think of how to remove the nested for loop but decided that it is fine since it is not exponential. Is this correct? Thanks
void DigitSort(int* arr, int size)
{
int counts[10] = { 0,0,0,0,0,0,0,0,0,0 };
int k = -1;
while (++k < size)
counts[arr[k]]++;
k = -1;
for (int j = 0; j < 10; ++j)
for (int i = 0; i < counts[j]; ++i)
arr[++k] = j;
}

There is no benchmark, but here is a (probably) faster solution, using std::fill_n.
void DigitSort(int* arr, int size)
{
int counts[10] = { 0,0,0,0,0,0,0,0,0,0 };
int k = -1, sum_count = 0;
while (++k < size)
counts[arr[k]]++;
for (k = 0; k < 10; ++k) {
std::fill_n(arr + sum_count, counts[k], k);
sum_count += counts[k];
}
}
When I say "probably", it's because the compiler can optimize the std::fill_n to a memset-like instruction.

Partial Pivoting/Gaussian elimination- swapping columns instead of rows producing wrong output

I'm trying to implement a quick program to solve a system of linear equations. The program reads the input from a file and then writes the upper-triangular system and solutions to a file. It is working with no pivoting, but when I try to implement the pivoting it produces incorrect results.
As example input, here is the following system of equations:
w+2x-3y+4z=12
2w+2x-2y+3z=10
x+y=-1
w-x+y-2z=-4
I expect the results to be w=1, x=0, y=-1 and z=2. When I don't pivot, I get this answer (with some rounding error on x). When I add in the pivoting, I get the same numbers but in the wrong order: w=2,x=1,y=-1 and z=0.
What do I need to do to get these in the correct order? Am I missing a step somewhere? I need to do column swapping instead of rows because I need to adapt this to a parallel algorithm later that requires that. Here is the code that does the elimination and back substitution:
void gaussian_elimination(double** A, double* b, double* x, int n)
{
int maxIndex;
double temp;
int i;
for (int k = 0; k < n; k++)
{
i = k;
for (int j = k+1; j < n; j++)
{
if (abs(A[k][j]) > abs(A[k][i]))
{
i = j;
}
}
if (i != k)
{
for (int j = 0; j < n; j++)
{
temp = A[j][k];
A[j][k] = A[j][i];
A[j][i] = temp;
}
}
for (int j = k + 1; j < n; j++)
{
A[k][j] = A[k][j] / A[k][k];
}
b[k] = b[k] / A[k][k];
A[k][k] = 1;
for (i = k + 1; i < n; i++)
{
for (int j = k + 1; j < n; j++)
{
A[i][j] = A[i][j] - A[i][k] * A[k][j];
}
b[i] = b[i] - A[i][k] * b[k];
A[i][k] = 0;
}
}
}
void back_substitution(double**U, double*x, double*y, int n)
{
for (int k = n - 1; k >= 0; k--)
{
x[k] = y[k];
for (int i = k - 1; i >= 0; i--)
{
y[i] = y[i] - x[k]*U[i][k];
}
}
}

I believe what you implemented is actually complete pivoting.
With complete pivoting, you must keep track of the permutation of columns, and apply the same permutation to your answer.
You can do this with an array {0, 1, ..., n}, where you swap the i'th and k'th values in the second loop. Then, rearange the solution using this array.
If what you were trying to do is partial pivoting, you need to look for the maximum in the respective row, and swap the rows and the values of 'b' accordingly.

Is it possible to avoid the for-loop to compute matrix entries?

I have to use a nested for-loop to compute the entries of a Eigen::MatrixXd type matrix output columnwise. Here input[0], input[1] and input[2] are defined as Eigen::ArrayXXd in order to use the elementwise oprerations. This part seems to be the bottleneck for my code. Can anyone help me to accelerate this loop? Thanks!
for (int i = 0; i < r; i++) {
for (int j = 0; j < r; j++) {
for (int k = 0; k < r; k++) {
output.col(i * (r * r) + j * r + k) =
input[0].col(i) * input[1].col(j) * input[2].col(k);
}
}
}

When thinking about optimizing code of a for loop, it helps to think, "Are there redundant calculations that I can eliminate?"
Notice how in the inner most loop, only k is changing. You should move all possible calculations that don't involve k out of that loop:
for (int i = 0; i < r; i++) {
int temp1 = i * (r * r);
for (int j = 0; j < r; j++) {
int temp2 = j * r;
for (int k = 0; k < r; k++) {
output.col(temp1 + temp2 + k) =
input[0].col(i) * input[1].col(j) * input[2].col(k);
}
}
}
Notice how i * (r * r) is being calculated over and over, but the answer is always the same! You only need to recalculate this when i increments. The same goes for j * r.
Hopefully this helps!

To reduce the number of flops, you should cache the result of input[0]*input[1]:
ArrayXd tmp(input[0].rows());
for (int i = 0; i < r; i++) {
for (int j = 0; j < r; j++) {
tmp = input[0].col(i) * input[1].col(j);
for (int k = 0; k < r; k++) {
output.col(i * (r * r) + j * r + k) = tmp * input[2].col(k);
}
}
}
Then, to fully use your CPU, enable AVX/FMA with -march=native and of course compiler optimizations (-O3).
Then, to get an idea of what you could gain more, measure accurately the time taken by this part, count the number of multiplications (r^2*(n+r*n)), and then compute the number of floating point operations per second you achieve. Then compare it to the capacity of your CPU. If you're good, then the only option is to multithread one of the for loop using, e.g., OpenMP. The choice of which for loop depends on the size of your inputs, but you can try with the outer one, making sure each thread has its own tmp array.

Very slow mutex in LLVM/OpenMP

I wrote code to test the performance of openmp on win (Win7 x64, Corei7 3.4HGz) and on Mac (10.12.3 Core i7 2.7 HGz).
In xcode I made a console application setting the compiled default. I use LLVM 3.7 and OpenMP 5 (in opm.h i searched define KMP_VERSION_MAJOR=5, define KMP_VERSION_MINOR=0 and KMP_VERSION_BUILD = 20150701, libiopm5) on macos 10.12.3 (CPU - Corei7 2700GHz)
For win I use VS2010 Sp1. Additional I set c/C++ -> Optimization -> Optimization = Maximize Speed (O2), c/C++ -> Optimization ->Favor Soze Or Speed = Favor Fast code (Ot).
If I run the application in a single thread, the time difference corresponds to the frequency ratio of processors (approximately). But if you run 4 threads, the difference becomes tangible: win program be faster then mac program in ~70 times.
#include <cmath>
#include <mutex>
#include <cstdint>
#include <cstdio>
#include <iostream>
#include <omp.h>
#include <boost/chrono/chrono.hpp>
static double ActionWithNumber(double number)
{
double sum = 0.0f;
for (std::uint32_t i = 0; i < 50; i++)
{
double coeff = sqrt(pow(std::abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
return sum;
}
static double TestOpenMP(void)
{
const std::uint32_t len = 4000000;
double *a;
double *b;
double *c;
double sum = 0.0;
std::mutex _mutex;
a = new double[len];
b = new double[len];
c = new double[len];
for (std::uint32_t i = 0; i < len; i++)
{
c[i] = 0.0;
a[i] = sin((double)i);
b[i] = cos((double)i);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
double k = 2.0;
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
c[i] = k*a[i] + b[i] + k;
if (c[i] > 0.0)
{
c[i] += ActionWithNumber(c[i]);
}
else
{
c[i] -= ActionWithNumber(c[i]);
}
std::lock_guard<std::mutex> scoped(_mutex);
sum += c[i];
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
double sum2 = 0.0;
for (std::uint32_t i = 0; i < len; i++)
{
sum2 += c[i];
c[i] /= sum2;
}
if (std::abs(sum - sum2) > 0.01) printf("Incorrect result.\n");
delete[] a;
delete[] b;
delete[] c;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const std::uint32_t steps = 5;
for (std::uint32_t i = 0; i < steps; i++)
{
sum += TestOpenMP();
}
sum /= (double)steps;
std::cout << "Elapsed time = " << sum;
return 0;
}
I specifically use a mutex here to compare the performance of openmp on the "mac" and "win". On the "Win" function returns the time of 0.39 seconds. On the "Mac" function returns the time of 25 seconds, i.e. 70 times slower.
What is the cause of this difference?
First of all, thank for edit my post (i use translater to write text).
In the real app, I update the values in a huge matrix (20000х20000) in random order. Each thread determines the new value and writes it in a particular cell. I create a mutex for each row, since in most cases different threads write to different rows. But apparently in cases when 2 threads write in one row and there is a long lock. At the moment I can't divide the rows in different threads, since the order of records is determined by the FEM elements.
So just to put a critical section in there comes out, as it will block writes to the entire matrix.
I wrote code like in real application.
static double ActionWithNumber(double number)
{
const unsigned int steps = 5000;
double sum = 0.0f;
for (u32 i = 0; i < steps; i++)
{
double coeff = sqrt(pow(abs(number), 0.1));
double res = number*(1.0-coeff)*number*(1.0-coeff) * 3.0;
sum += sqrt(res);
}
sum /= (double)steps;
return sum;
}
static double RealAppTest(void)
{
const unsigned int elementsNum = 10000;
double* matrix;
unsigned int* elements;
boost::mutex* mutexes;
elements = new unsigned int[elementsNum*3];
matrix = new double[elementsNum*elementsNum];
mutexes = new boost::mutex[elementsNum];
for (unsigned int i = 0; i < elementsNum; i++)
for (unsigned int j = 0; j < elementsNum; j++)
matrix[i*elementsNum + j] = (double)(rand() % 100);
for (unsigned int i = 0; i < elementsNum; i++) //build FEM element like Triangle
{
elements[3*i] = rand()%(elementsNum-1);
elements[3*i+1] = rand()%(elementsNum-1);
elements[3*i+2] = rand()%(elementsNum-1);
}
boost::chrono::time_point<boost::chrono::system_clock> start, end;
start = boost::chrono::system_clock::now();
omp_set_num_threads(4);
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
boost::lock_guard<boost::mutex> lockup(mutexes[i]);
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
}
}
end = boost::chrono::system_clock::now();
boost::chrono::duration<double> elapsed_time = end - start;
delete[] elements;
delete[] matrix;
delete[] mutexes;
return elapsed_time.count();
}
int main()
{
double sum = 0.0;
const u32 steps = 5;
for (u32 i = 0; i < steps; i++)
{
sum += RealAppTest();
}
sum /= (double)steps;
std::cout<<"Elapsed time = " << sum;
return 0;
}

You're combining two different sets of threading/synchronization primitives - OpenMP, which is built into the compiler and has a runtime system, and manually creating a posix mutex with std::mutex. It's probably not surprising that there's some interoperability hiccups with some compiler/OS combinations.
My guess here is that in the slow case, the OpenMP runtime is going overboard to make sure that there's no interactions between higher-level ongoing OpenMP threading tasks and the manual mutex, and that doing so inside a tight loop causes the dramatic slowdown.
For mutex-like behaviour in the OpenMP framework, we can use critical sections:
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
#pragma omp critical
sum += c[i];
}
or explicit locks:
omp_lock_t sumlock;
omp_init_lock(&sumlock);
#pragma omp parallel for
for (int i = 0; i < len; i++)
{
//...
// replacing this: std::lock_guard<std::mutex> scoped(_mutex);
omp_set_lock(&sumlock);
sum += c[i];
omp_unset_lock(&sumlock);
}
omp_destroy_lock(&sumlock);
We get much more reasonable timings:
$ time ./openmp-original
real 1m41.119s
user 1m15.961s
sys 1m53.919s
$ time ./openmp-critical
real 0m16.470s
user 1m2.313s
sys 0m0.599s
$ time ./openmp-locks
real 0m15.819s
user 1m0.820s
sys 0m0.276s
Updated: There's no problem with using an array of openmp locks in exactly the same way as the mutexes:
omp_lock_t sumlocks[elementsNum];
for (unsigned idx=0; idx<elementsNum; idx++)
omp_init_lock(&(sumlocks[idx]));
//...
#pragma omp parallel for
for (int i = 0; i < elementsNum; i++)
{
unsigned int* elems = &elements[3*i];
for (unsigned int j = 0; j < 3; j++)
{
//in here set mutex for row with index = elems[j];
double res = 0.0;
for (unsigned int k = 0; k < 3; k++)
{
res += ActionWithNumber(matrix[elems[j]*elementsNum + elems[k]]);
}
omp_set_lock(&(sumlocks[i]));
for (unsigned int k = 0; k < 3; k++)
{
matrix[elems[j]*elementsNum + elems[k]] = res;
}
omp_unset_lock(&(sumlocks[i]));
}
}
for (unsigned idx=0; idx<elementsNum; idx++)
omp_destroy_lock(&(sumlocks[idx]));

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parallelize four and more nested loops with CUDA - c++

Related

Calculating Big-O notation that has three nested loops

Is this the fastest single digit sort algorithm possible?

Partial Pivoting/Gaussian elimination- swapping columns instead of rows producing wrong output

Is it possible to avoid the for-loop to compute matrix entries?

Very slow mutex in LLVM/OpenMP

Categories

Resources