CUDA - separating cpu code from cuda code

CUDA - separating cpu code from cuda code - c++

Was looking to use system functions (such as rand() ) within the CUDA kernel. However, ideally this would just run on the CPU. Can I separate files (.cu and .c++), while still making use of gpu matrix addition? For example, something along these lines:
in main.cpp:
int main(){
std::vector<int> myVec;
srand(time(NULL));
for (int i = 0; i < 1024; i++){
myvec.push_back( rand()%26);
}
selfSquare(myVec, 1024);
}
and in cudaFuncs.cu:
__global__ void selfSquare_cu(int *arr, n){
int i = threadIdx.x;
if (i < n){
arr[i] = arr[i] * arr[i];
}
}
void selfSquare(std::vector<int> arr, int n){
int *cuArr;
cudaMallocManaged(&cuArr, n * sizeof(int));
for (int i = 0; i < n; i++){
cuArr[i] = arr[i];
}
selfSquare_cu<<1, n>>(cuArr, n);
}
What are best practices surrounding situations like these? Would it be a better idea to use curand and write everything in the kernel? It looks to me like in the above example, there is an extra step in taking the vector and copying it to the shared cuda memory.

In this case the only thing that you need is to have the array initialised with random values. Each value of the array can be initialised indipendently.
The CPU is involved in your code during the initialization and trasferring of the data to the device and back to the host.
In your case, do you really need to have the CPU to initialize the data for then having all those values moved to the GPU?
The best approach is to allocate some device memory and then initialize the values using a kernel.
This will save time because
The elements are initialized in parallel
There is not memory transfer required from the host to the device
As a rule of thumb, always avoid communication between host and device if possible.

Related

OpenMP offloading target map alloc - how does it work

I have always been confused and never understood how the alloc map-type of the map clause of the target (or target data) construct works.
What is my application - I would like to have a temporary array on a device, which is used only on the device, is initialized on the device, read on the device, everything on the device. The host does not touch the contents of the array at all. For the sake of simplicity, I have the following code, which copies an array to another array via a temporary array (using just a single team and thread, but that does not matter):
#include <cstdio>
int main()
{
const int count = 10;
int * src = new int[count];
int * tmp = new int[count];
int * dst = new int[count];
for(int i = 0; i < count; i++) src[i] = i;
for(int i = 0; i < count; i++) printf(" %3d", src[i]); printf("\n");
#pragma omp target map(to:src[0:count]) map(from:dst[0:count]) map(alloc:tmp[0:count])
{
for(int i = 0; i < count; i++) tmp[i] = src[i];
for(int i = 0; i < count; i++) dst[i] = tmp[i];
}
for(int i = 0; i < count; i++) printf(" %3d", dst[i]); printf("\n");
delete[] src;
delete[] tmp;
delete[] dst;
return 0;
}
This code works when using pgc++ -mp=gpu on Nvidia and on Intel gpu using icpx -fiopenmp -fopenmp-targets=spir64.
But the thing is, I don't want to allocate the tmp array on the host. If I just use int * tmp = nullptr, on nvidia the code fails (on intel it still works). If I leave the tmp uninitialized (using just int * tmp;, and removing the delete), the execution fails on Intel too. If I do not even declare the tmp variable, compilation fails (which kinda makes sense). I made sure it runs on the device (really offloads the code, doesn't fallback to cpu) using OMP_TARGET_OFFLOAD=MANDATORY.
This was weird to me, since I don't use the tmp array on the host at all. As I understand it, the tmp array is allocated on the device and then in the kernel the device array is used. Is that right? Why do I have to allocate and/or initialize the pointer on the host if I don't use it on the host?
So my question is: what are the exact requirements to use map(alloc) in OpenMP offloading? How does it work? How should I use it? I would appreciate an example and references from tutorials/documentation.
I wasn't able to find any useful information regarding this. The standard was not helpful at all, and the tutorials I attended and watched did not go into such depth.
I understand that the code should work even without OpenMP enabled (as if the pragmas were just ignored), so let's assume there is an #ifdef to actually allocate the tmp array if OpenMP is disabled.
I am also aware of manual memory management via omp_target_alloc(), omp_target_memcpy() and omp_target_free(), but I wanted to use the target map(alloc).
I am reading the standard 5.2, using pgc++ 22.2-0 and icpx 2022.0.0.20211123.

How to get the memory used by a multidimensional vector

I am currently writing some code to create a neural network, and i am trying to make it as optimised as possible. I want to be able to get the amount of memory consumed by a object of type Network, since memory usage is very important in order to avoid cache misses. I tried using sizeof(), however this does not work, since, i assume, that vectors store the values on the heap, so the sizeof() function will just tell me the size of the pointers. Here is my code so far.
#include <iostream>
#include <vector>
#include <random>
#include <chrono>
class Timer
{
private:
std::chrono::time_point<std::chrono::high_resolution_clock> start_time;
public:
Timer(bool auto_start=true)
{
if (auto_start)
{
start();
}
}
void start()
{
start_time = std::chrono::high_resolution_clock::now();
}
float get_duration()
{
std::chrono::duration<float> duration = std::chrono::high_resolution_clock::now() - start_time;
return duration.count();
}
};
class Network
{
public:
std::vector<std::vector<std::vector<float>>> weights;
std::vector<std::vector<std::vector<float>>> deriv_weights;
std::vector<std::vector<float>> biases;
std::vector<std::vector<float>> deriv_biases;
std::vector<std::vector<float>> activations;
std::vector<std::vector<float>> deriv_activations;
};
Network create_network(std::vector<int> layers)
{
Network network;
network.weights.reserve(layers.size() - 1);
int nodes_in_prev_layer = layers[0];
for (unsigned int i = 0; i < layers.size() - 1; ++i)
{
int nodes_in_layer = layers[i + 1];
network.weights.push_back(std::vector<std::vector<float>>());
network.weights[i].reserve(nodes_in_layer);
for (int j = 0; j < nodes_in_layer; ++j)
{
network.weights[i].push_back(std::vector<float>());
network.weights[i][j].reserve(nodes_in_prev_layer);
for (int k = 0; k < nodes_in_prev_layer; ++k)
{
float input_weight = float(std::rand()) / RAND_MAX;
network.weights[i][j].push_back(input_weight);
}
}
nodes_in_prev_layer = nodes_in_layer;
}
return network;
}
int main()
{
Timer timer;
Network network = create_network({784, 800, 16, 10});
std::cout << timer.get_duration() << std::endl;
std::cout << sizeof(network) << std::endl;
std::cin.get();
}

I've recently updated our production neural network code to AVX-512; it's definitely real-world production code. A key part of our optimalisations is that each matrix is not a std::vector, but a 1D AVX-aligned array. Even without AVX alignment, we see a huge benefit in moving to a one-dimensional array backing each matrix. This means the memory access will be fully sequential, which is much faster. The size will then be (rows*cols)*sizeof(float).
We store the bias as the first full row. Commonly that's implemented by prefixing the input with a 1.0 element, but for our AVX code we use the bias as the starting values for the FMA (Fused Multiply-Add) operations. I.e. in pseudo-code result=bias; for(input:inputs) result+=(input*weight). This keeps the input also AVX-aligned.
Since each matrix is used in turn, you can safely have a std::vector<Matrix> layers.

As quote from https://stackoverflow.com/a/17254518/7588455:
Vector stores its elements in an internally-allocated memory array. You can do this:
sizeof(std::vector<int>) + (sizeof(int) * MyVector.size())
This will give you the size of the vector structure itself plus the size of all the ints in it, but it may not include whatever small overhead your memory allocator may impose. I'm not sure there's a platform-independent way to include that.
In your case only the actually internally-allocated memory array matters since you're just accessing these. Also be aware of how you're accessing the memory.
In order to write cache friendly code I highly recommend to read thru this SO post: https://stackoverflow.com/a/16699282/7588455

Fastest way to create a vector of indices from distance matrix in C++

I have a distance matrix D of size n by n and a constant L as input. I need to create a vector v contains all entries in D such that its value is at most L. Here v must be in a specific order v = [v1 v2 .. vn] where vi contains entries in ith row of D with value at most L. The order of entries in each vi is not important.
I wonder there is a fast way to create v using vector, array or any data structure + parallization. What I did is to use for loops and it is very slow for large n.
vector<int> v;
for (int i=0; i < n; ++i){
for (int j=0; j < n; ++j){
if (D(i,j) <= L) v.push_back(j);
}
}

The best way is mostly depending on the context. If you are seeking for GPU parallization you should take a look at OpenCL.
For CPU based parallization the C++ standard #include <thread> library is probably your best bet, but you need to be careful:
Threads take time to create so if n is relatively small (<1000 or so) it will slow you down
D(i,j) has to be readably by multiple threads at the same time
v has to be writable by multiple threads, a standard vector wont cut it
v may be a 2d vector with vi as its subvectors, but these have to be initialized before the parallization:
std::vector<std::vector<int>> v;
v.reserve(n);
for(size_t i = 0; i < n; i++)
{
v.push_back(std::vector<int>());
}
You need to decide how many threads you want to use. If this is for one machine only, hardcoding is a valid option. There is a function in the thread library that gets the amount of supported threads, but it is more of a hint than trustworthy.
size_t threadAmount = std::thread::hardware_concurrency(); //How many threads should run hardware_concurrency() gives you a hint, but its not optimal
std::vector<std::thread> t; //to store the threads in
t.reserve(threadAmount-1); //you need threadAmount-1 extra threads (we already have the main-thread)
To start a thread you need a function it can execute. In this case this is to read through part of your matrix.
void CheckPart(size_t start, size_t amount, int L, std::vector<std::vector<int>>& vec)
{
for(size_t i = start; i < amount+start; i++)
{
for(size_t j = 0; j < n; j++)
{
if(D(i,j) <= L)
{
vec[i].push_back(j);
}
}
}
}
Now you need to split your matrix in parts of about n/threadAmount rows and start the threads. The thread constructor needs a function and its parameter, but it will always try to copy the parameters, even if the function wants a reference. To prevent this, you need to force using a reference with std::ref()
int i = 0;
int rows;
for(size_t a = 0; a < threadAmount-1; a++)
{
rows = n/threadAmount + ((n%threadAmount>a)?1:0);
t.push_back(std::thread(CheckPart, i, rows, L, std::ref(v)));
i += rows;
}
The threads are now running and all there is to do is run the last block on the main function:
SortPart(i, n/threadAmount, L, v);
After that you need to wait for the threads finishing and clean them up:
for(unsigned int a = 0; a < threadAmount-1; a++)
{
if(t[a].joinable())
{
t[a].join();
}
}
Please note that this is just a quick and dirty example. Different problems might need different implementation, and since I can't guess the context the help I can give is rather limited.

In consideration of the comments, I made the appropriate corrections (in emphasis).
Have you searched tips for writing performance code, threading, asm instructions (if your assembly is not exactly what you want) and OpenCL for parallel-processing? If not, I strongly recommend!
In some cases, declaring all for loop variables out of the for loop (to avoid declaring they a lot of times) will make it faster, but not in this case (comment from our friend Paddy).
Also, using new insted of vector can be faster, as we see here: Using arrays or std::vectors in C++, what's the performance gap? - and I tested, and with vector it's 6 seconds slower than with new,which only takes 1 second. I guess that the safety and ease of management guarantees that come with std::vector is not desired when someone is searching for performance, even because using new is not so difficult, just avoid heap overflow with calculations and remember using delete[]
user4581301 is correct here, and the following statement is untrue: Finally, if you build D in a array instead of matrix (or maybe if you copy D into a constant array, maybe...), it will be much mor cache-friendly and will save one for loop statement.

Count values from array CUDA

I have an array of float values, namely life, of which i want to count the number of entries with a value greater than 0 in CUDA.
On the CPU, the code would look like this:
int numParticles = 0;
for(int i = 0; i < MAX_PARTICLES; i++){
if(life[i]>0){
numParticles++;
}
}
Now in CUDA, I've tried something like this:
__global__ void update(float* life, int* numParticles){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (life[idx]>0){
(*numParticles)++;
}
}
//life is a filled device pointer
int launchCount(float* life)
{
int numParticles = 0;
int* numParticles_d = 0;
cudaMalloc((void**)&numParticles_d, sizeof(int));
update<<<MAX_PARTICLES/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(life, numParticles_d);
cudaMemcpy(&numParticles, numParticles_d, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "numParticles: " << numParticles << std::endl;
}
But for some reason the CUDA attempt always returns 0 for numParticles. How come?

This:
if (life[idx]>0){
(*numParticles)++;
}
is a read-after write hazard. Multiple threads will be simultaneously attempting to read and write from numParticles. The CUDA execution model does not guarantee anything about the order of simultaneous transactions.
You could make this work by using atomic memory transactions, for example:
if (life[idx]>0){
atomicAdd(numParticles, 1);
}
This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.
You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.

Your code is actually launching MAX_PARTICLES threads, and multiple thread blocks are executing (*numParticles)++; concurrently. It is a race condition. So you have the result 0, or if you are luck, sometimes a little bigger than 0.
As your attempt to sum up life[i]>0 ? 1 : 0 for all i, you could follow CUDA parallel reduction to implement your kernel, or use Thrust reduction to simplify your life.

MPI C++ matrix addition, function arguments, and function returns

I've been learning C++ from the internet for the past 2 years and finally the need has arisen for me to delve into MPI. I've been scouring stackoverflow and the rest of the internet (including http://people.sc.fsu.edu/~jburkardt/cpp_src/mpi/mpi.html and https://computing.llnl.gov/tutorials/mpi/#LLNL). I think I've got some of the logic down, but I'm having a hard time wrapping my head around the following:
#include (stuff)
using namespace std;
vector<double> function(vector<double> &foo, const vector<double> &bar, int dim, int rows);
int main(int argc, char** argv)
{
vector<double> result;//represents a regular 1D vector
int id_proc, tot_proc, root_proc = 0;
int dim;//set to number of "columns" in A and B below
int rows;//set to number of "rows" of A and B below
vector<double> A(dim*rows), B(dim*rows);//represent matrices as 1D vectors
MPI::Init(argc,argv);
id_proc = MPI::COMM_WORLD.Get_rank();
tot_proc = MPI::COMM_WORLD.Get_size();
/*
initialize A and B here on root_proc with RNG and Bcast to everyone else
*/
//allow all processors to call function() so they can each work on a portion of A
result = function(A,B,dim,rows);
//all processors do stuff with A
//root_proc does stuff with result (doesn't matter if other processors have updated result)
MPI::Finalize();
return 0;
}
vector<double> function(vector<double> &foo, const vector<double> &bar, int dim, int rows)
{
/*
purpose of function() is two-fold:
1. update foo because all processors need the updated "matrix"
2. get the average of the "rows" of foo and return that to main (only root processor needs this)
*/
vector<double> output(dim,0);
//add matrices the way I would normally do it in serial
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < dim; j++)
{
foo[i*dim + j] += bar[i*dim + j];//perform "matrix" addition (+= ON PURPOSE)
}
}
//obtain average of rows in foo in serial
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < dim; j++)
{
output[j] += foo[i*dim + j];//sum rows of A
}
}
for (int j = 0; j < dim; j++)
{
output[j] /= rows;//divide to obtain average
}
return output;
}
The code above is to illustrate the concept only. My main concern is to parallelize the matrix addition but what boggles my mind is this:
1) If each processor only works on a portion of that loop (naturally I'd have to modify the loop parameters per processor) what command do I use to merge all portions of A back into a single, updated A that all processors have in their memory. My guess is that I have to do some kind of Alltoall where each processor sends its portion of A to all other processors, but how do I guarantee that (for example) row 3 worked on by processor 3 overwrites row 3 of the other processors, and not row 1 by accident.
2) If I use an Alltoall inside function(), do all processors have to be allowed to step into function(), or can I isolate function() using...
if (id_proc == root_proc)
{
result = function(A,B,dim,rows);
}
… and then inside function() handle all the parallelization. As silly as it sounds, I'm trying to do a lot of the work on one processor (with broadcasts), and just parallelize the big time-consuming for loops. Just trying to keep the code conceptually simple so I can get my results and move on.
3) For the averaging part, I'm sure I can just use a reducing command if I wanted to parallelize it, correct?
Also, as an aside: is there a way to call Bcast() such that it is blocking? I'd like to use it to synchronize all my processors (boost libraries are not an option). If not then I'll just go with Barrier(). Thank you for your answer to this question, and to the community of stackoverflow for learning me how to program over the past two years! :)

1) The function you are looking is MPI_Allgather. MPI_Allgather will let you send a row from each processor and receive the result on all processors.
2) Yes you can use some of the processors in your function. Since MPI functions work with communicators you have to create a separate communicator for this purpose. I don't know how this is implemented in the C++ bindings but C bindings use the MPI_Comm_create function.
3) Yes see MPI_Allreduce.
aside: Bcast blocks a process until send/receive operation assigned to that process is finished. If you want to wait for all processors to finish their work (I don't have any idea why you would want to do this) you should use Barrier().
extra note: I wouldn't recommend using the C++ bindings as they are depreciated and you won't find specific examples on how to use them. Boost MPI is the library to use if you want C++ bindings however it does not cover all of MPI functions.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CUDA - separating cpu code from cuda code - c++

Related

OpenMP offloading target map alloc - how does it work

How to get the memory used by a multidimensional vector

Fastest way to create a vector of indices from distance matrix in C++

Count values from array CUDA

MPI C++ matrix addition, function arguments, and function returns

Categories

Resources