Count values from array CUDA

Count values from array CUDA - c++

I have an array of float values, namely life, of which i want to count the number of entries with a value greater than 0 in CUDA.
On the CPU, the code would look like this:
int numParticles = 0;
for(int i = 0; i < MAX_PARTICLES; i++){
if(life[i]>0){
numParticles++;
}
}
Now in CUDA, I've tried something like this:
__global__ void update(float* life, int* numParticles){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (life[idx]>0){
(*numParticles)++;
}
}
//life is a filled device pointer
int launchCount(float* life)
{
int numParticles = 0;
int* numParticles_d = 0;
cudaMalloc((void**)&numParticles_d, sizeof(int));
update<<<MAX_PARTICLES/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(life, numParticles_d);
cudaMemcpy(&numParticles, numParticles_d, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "numParticles: " << numParticles << std::endl;
}
But for some reason the CUDA attempt always returns 0 for numParticles. How come?

This:
if (life[idx]>0){
(*numParticles)++;
}
is a read-after write hazard. Multiple threads will be simultaneously attempting to read and write from numParticles. The CUDA execution model does not guarantee anything about the order of simultaneous transactions.
You could make this work by using atomic memory transactions, for example:
if (life[idx]>0){
atomicAdd(numParticles, 1);
}
This will serialize the memory transactions and make the calculation correct. It will also have a big negative effect on performance.
You might want to investigate having each block calculate a local sum using a reduction type calculation and then sum the block local sums atomically or on the host, or in a second kernel.

Your code is actually launching MAX_PARTICLES threads, and multiple thread blocks are executing (*numParticles)++; concurrently. It is a race condition. So you have the result 0, or if you are luck, sometimes a little bigger than 0.
As your attempt to sum up life[i]>0 ? 1 : 0 for all i, you could follow CUDA parallel reduction to implement your kernel, or use Thrust reduction to simplify your life.

Related

Windows threading synchronization performance issue

I have a threading issue under windows.
I am developing a program that runs complex physical simulations for different conditions. Say a condition per hour of the year, would be 8760 simulations. I am grouping those simulations per thread such that each thread runs a for loop of 273 simulations (on average)
I bought an AMD ryzen 9 5950x with 16 cores (32 threads) for this task. On Linux, all the threads seem to be between 98% to 100% usage, while under windows I get this:
(The first bar is the I/O thread reading data, the smaller bars are the process threads. Red: synchronization, green: process, purple: I/O)
This is from Visual Studio's concurrency visualizer, which tells me that 63% of the time was spent on thread synchronization. As far as I can tell, my code is the same for both the Linux and windows executions.
I made my best to make the objects immutable to avoid issues and that provided a big gain with my old 8-thread intel i7. However with many more threads, this issue arises.
For threading, I have tried a custom parallel for, and the taskflow library. Both perform identically for what I want to do.
Is there something fundamental about windows threads that produces this behaviour?
The custom parallel for code:
/**
* parallel for
* #tparam Index integer type
* #tparam Callable function type
* #param start start index of the loop
* #param end final +1 index of the loop
* #param func function to evaluate
* #param nb_threads number of threads, if zero, it is determined automatically
*/
template<typename Index, typename Callable>
static void ParallelFor(Index start, Index end, Callable func, unsigned nb_threads=0) {
// Estimate number of threads in the pool
if (nb_threads == 0) nb_threads = getThreadNumber();
// Size of a slice for the range functions
Index n = end - start + 1;
Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
slice = std::max(slice, Index(1));
// [Helper] Inner loop
auto launchRange = [&func] (int k1, int k2) {
for (Index k = k1; k < k2; k++) {
func(k);
}
};
// Create pool and launch jobs
std::vector<std::thread> pool;
pool.reserve(nb_threads);
Index i1 = start;
Index i2 = std::min(start + slice, end);
for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
pool.emplace_back(launchRange, i1, i2);
i1 = i2;
i2 = std::min(i2 + slice, end);
}
if (i1 < end) {
pool.emplace_back(launchRange, i1, end);
}
// Wait for jobs to finish
for (std::thread &t : pool) {
if (t.joinable()) {
t.join();
}
}
}
A complete C++ project illustrating the issue is uploaded here
Main.cpp:
//
// Created by santi on 26/08/2022.
//
#include "input_data.h"
#include "output_data.h"
#include "random.h"
#include "par_for.h"
void fillA(Matrix& A){
Random rnd;
rnd.setTimeBasedSeed();
for(int i=0; i < A.getRows(); ++i)
for(int j=0; j < A.getRows(); ++j)
A(i, j) = (int) rnd.randInt(0, 1000);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
for(const int& t: time_indices){
Matrix b = input_data.getAt(t);
Matrix A(input_data.getDim(), input_data.getDim());
fillA(A);
Matrix x = A * b;
output_data.setAt(t, x);
}
}
void process(int time_steps, int dim, int n_threads){
InputData input_data(time_steps, dim);
OutputData output_data(time_steps, dim);
// correct the number of threads
if ( n_threads < 1 ) { n_threads = ( int )getThreadNumber( ); }
// generate indices
std::vector<int> time_indices = arrange<int>(time_steps);
// compute the split of indices per core
std::vector<ParallelChunkData<int>> chunks = prepareParallelChunks(time_indices, n_threads );
// run in parallel
ParallelFor( 0, ( int )chunks.size( ), [ & ]( int k ) {
// run chunk
worker(input_data, output_data, chunks[k].indices, k );
} );
}
int main(){
process(8760, 5000, 0);
return 0;
}

The performance problem you see is definitely caused by the many memory allocations, as already suspected by Matt in his answer. To expand on this: Here is a screenshot from Intel VTune running on an AMD Ryzen Threadripper 3990X with 64 cores (128 threads):
As you can see, almost all of the time is spent in malloc or free, which get called from the various Matrix operations. The bottom part of the image shows the timeline of the activity of a small selection of the threads: Green means that the thread is inactive, i.e. waiting. Usually only one or two threads are actually active. Allocations and freeing memory accesses a shared resource, causing the threads to wait for each other.
I think you have only two real options:
Option 1: No dynamic allocations anymore
The most efficient thing to do would be to rewrite the code to preallocate everything and get rid of all the temporaries. To adapt it to your example code, you could replace the b = input_data.getAt(t); and x = A * b; like this:
void MatrixVectorProduct(Matrix const & A, Matrix const & b, Matrix & x)
{
for (int i = 0; i < x.getRows(); ++i) {
for (int j = 0; j < x.getCols(); ++j) {
x(i, j) = 0.0;
for (int k = 0; k < A.getCols(); ++k) {
x(i,j) += (A(i,k) * b(k,j));
}
}
}
}
void getAt(int t, Matrix const & input_data, Matrix & b) {
for (int i = 0; i < input_data.getRows(); ++i)
b(i, 0) = input_data(i, t);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
Matrix A(input_data.getDim(), input_data.getDim());
Matrix b(input_data.getDim(), 1);
Matrix x(input_data.getDim(), 1);
for (const int & t: time_indices) {
getAt(t, input_data.getMat(), b);
fillA(A);
MatrixVectorProduct(A, b, x);
output_data.setAt(t, x);
}
std::cout << "Thread " << thread_index << ": Finished" << std::endl;
}
This fixes the performance problems.
Here is a screenshot from VTune, where you can see a much better utilization:
Option 2: Using a special allocator
The alternative is to use a different allocator that handles allocating and freeing memory more efficiently in multithreaded scenarios. One that I had very good experience with is mimalloc (there are others such as hoard or the one from TBB). You do not need to modify your source code, you just need to link with a specific library as described in the documentation.
I tried mimalloc with your source code, and it gave near 100% CPU utilization without any code changes.
I also found a post on the Intel forums with a similar problem, and the solution there was the same (using a special allocator).
Additional notes
Matrix::allocSpace() allocates the memory by using pointers to arrays. It is better to use one contiguous array for the whole matrix instead of multiple independent arrays. That way, all elements are located behind each other in memory, allowing more efficient access.
But in general I suggest to use a dedicated linear algebra library such as Eigen instead of the hand rolled matrix implementation to exploit vectorization (SSE2, AVX,...) and to get the benefits of a highly optimized library.
Ensure that you compile your code with optimizations enabled.
Disable various cross-checks if you do not need them: assert() (i.e. define NDEBUG in the preprocessor), and for MSVC possibly /GS-.
Ensure that you actually have enough memory installed.

You said that all your memory was pre-allocated, but in the worker function I see this...
Matrix b = input_data.getAt(t);
which allocates and fills a new matrix b, and this...
Matrix A(input_data.getDim(), input_data.getDim());
which allocates and fills a new matrix A, and this...
Matrix x = A * b;
which allocates and fills a new matrix x.
The heap is a global data structure, so the thread synchronization time you're seeing is probably contention in the memory allocate/free functions.
These are in a tight loop. You should fix this loop to access b by reference, and reuse the other 2 matrices for every iteration.

Doesn't see any significant improvement while using parallel block in OpenMP C++

I am receiving an array of Eigen::MatrixXf and Eigen::Matrix4f in realtime. Both of these arrays are having an equal number of elements. All I am trying to do is just multiply elements of both the arrays together and storing the result in another array at the same index.
Please see the code snippet below-
#define COUNT 4
while (all_ok())
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
// at each iteration, new data is filled
// in 'trans' and 'in_data' variables
#pragma omp parallel num_threads(COUNT)
{
#pragma omp for
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_clouds[i];
}
}
Please note that COUNT is a constant. The size of trans and in_data is (4 x 4) and (4 x n) respectively, where n is approximately 500,000. In order to parallelize the for loop, I gave OpenMP a try as shown above. However, I don't see any significant improvement in the elapsed time of for loop.
Any suggestions? Any alternatives to perform the same operation, please?
Edit: My idea is to define 4 (=COUNT) threads wherein each of them is taking care of multiplication. In this way, we don't need to create threads every time, I guess!

Works for me using the following self-contained example, that is, I get a x4 speed up when enabling openmp:
#include <iostream>
#include <bench/BenchTimer.h>
using namespace Eigen;
const int COUNT = 4;
EIGEN_DONT_INLINE
void foo(const Matrix4f *trans, const MatrixXf *in_data, MatrixXf *out_data)
{
#pragma omp parallel for num_threads(COUNT)
for (int i = 0; i < COUNT; i++)
out_data[i] = trans[i] * in_data[i];
}
int main()
{
Eigen::Matrix4f trans[COUNT];
Eigen::MatrixXf in_data[COUNT];
Eigen::MatrixXf out_data[COUNT];
int n = 500000;
for (int i = 0; i < COUNT; i++)
{
trans[i].setRandom();
in_data[i].setRandom(4,n);
out_data[i].setRandom(4,n);
}
int tries = 3;
int rep = 1;
BenchTimer t;
BENCH(t, tries, rep, foo(trans, in_data, out_data));
std::cout << " " << t.best(Eigen::REAL_TIMER) << " (" << double(n)*4.*4.*4.*2.e-9/t.best() << " GFlops)\n";
return 0;
}
So 1) make sure you measure the wallclock time and not the CPU time, and 2) make sure that the products is the bottleneck and not filling in_data.
Finally, for maximal performance don't forget to enable AVX/FMA (e.g., with -march=native), and of course make sure to benchmark with compiler's optimization ON.
For the record, on my computer the above example takes 0.25s without openmp, and 0.065s with.

You need to specify -fopenmp during compilation and linking. But you will quickly hit the limit, where RAM access is stopping further speeding up. You really should have a look at vector intrinsics. Dependent on you CPU you could accelerate your operations to the size of your register divided by the size of your variable (float = 4). So if your processor supports say AVX, you'd be dealing with 8 floats at a time. If you need some inspiration, you're welcome to steal code from my medical image reconstruction library here:
https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp
The code does the whole shebang for float/double real and complex.

Cuda triple nested for loop assignement

I'm trying to convert c++ code into Cuda code and I've got the following triple nested for loop that will fill an array for further OpenGL rendering (i'm simply creating a coordinate vertices array):
for(int z=0;z<263;z++) {
for(int y=0;y<170;y++) {
for(int x=0;x<170;x++) {
g_vertex_buffer_data_3[i]=(float)x+0.5f;
g_vertex_buffer_data_3[i+1]=(float)y+0.5f;
g_vertex_buffer_data_3[i+2]=-(float)z+0.5f;
i+=3;
}
}
}
I would like to get faster operations and so I'll use Cuda for some operations like the one listed above. I want to create one block for each iteration of the outermost loop and since the inner loops have iterations of 170 * 170 = 28900 total iterations, assign one thread to each innermost loop iteration. I converted the c++ code into this (it's just a small program that i made to understand how to use Cuda):
__global__ void mykernel(int k, float *buffer) {
int idz=blockIdx.x;
int idx=threadIdx.x;
int idy=threadIdx.y;
buffer[k]=idx+0.5;
buffer[k+1]=idy+0.5;
buffer[k+2]=idz+0.5;
k+=3;
}
int main(void) {
int dim=3*170*170*263;
float* g_vertex_buffer_data_2 = new float[dim];
float* g_vertex_buffer_data_3;
int i=0;
HANDLE_ERROR(cudaMalloc((void**)&g_vertex_buffer_data_3, sizeof(float)*dim));
dim3 dimBlock(170, 170);
dim3 dimGrid(263);
mykernel<<<dimGrid, dimBlock>>>(i, g_vertex_buffer_data_3);
HANDLE_ERROR(cudaMemcpy(&g_vertex_buffer_data_2,g_vertex_buffer_data_3,sizeof(float)*dim,cudaMemcpyDeviceToHost));
for(int j=0;j<100;j++){
printf("g_vertex_buffer_data_2[%d]=%f\n",j,g_vertex_buffer_data_2[j]);
}
cudaFree(g_vertex_buffer_data_3);
return 0;
}
Trying to launch it I get a segmenation fault. Do you know what am i doing wrong?
I think the problem is that threadIdx.x and threadIdx.y grow at the same time, while I would like to have threadIdx.x to be the inner one and threadIdx.y to be the outer one.

There is a lot wrong here, but the source of the segfault is this:
cudaMemcpy(&g_vertex_buffer_data_2,g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
You either want
cudaMemcpy(&g_vertex_buffer_data_2[0],g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
or
cudaMemcpy(g_vertex_buffer_data_2,g_vertex_buffer_data_3,
sizeof(float)*dim,cudaMemcpyDeviceToHost);
Once you fix that you will notice that the kernel is actually never launching with an invalid launch error. This is because a block size of (170,170) is illegal. CUDA has a 1024 threads per block limit on all current hardware.
There might well be other problems in your code. I stopped looking after I found these two.

Possibly negative indices in a CUDA thread block?

I have a quite simple 1D CUDA kernel doing a inclusive sum, that is, if we have a input 1D array
[ x_0, x_1, x_2,..., x_n-1 ]
the output would be
[ x_0, x_0+x_1, x_0+x_1+x_2, ..., x_0+x_1+...x_n-1 ].
The kernel shown below actually does not completely finish this job, on the other hand it finishes its job within each block. Anyway my question is not about how I can completely implement the inclusive sum, but I think there is a possible negative-indexing error during thread calculation.
__global__ void parallel_scan_inefficient(float* input, float* output){
// num_threads and max_i are globalled defined
__shared__ float temp[num_threads];
int i = blockIdx.x*blockDim.x+threadIdx.x;//global index
if (i<max_i)
{
temp[threadIdx.x]=input[i];
}
for (unsigned int stride=1;stride<=threadIdx.x; stride*=2)
{
__syncthreads();
temp[threadIdx.x]+=temp[threadIdx.x-stride];
}
output[i]=temp[threadIdx.x];
}
This piece of program is in fact from Hwu&Kirk's textbook "Programming Massively Parallel Processors" Chapter 9 pp.203.
So as you can see in the for-loop
for (unsigned int stride=1;stride<=threadIdx.x; stride*=2)
{
__syncthreads();
temp[threadIdx.x]+=temp[threadIdx.x-stride];
}
since "threadIdx.x" starts from 0 for each block, but "stride" starts from 1. Wouldn't we see for example temp[-1] for the first element in a block ? Also after one iteration, "stride" then becomes 2 and we will see temp[-2] for threadIdx.x=0 ?
This doesn't quite make sense to me, though CUDA compiler doesn't report any errors - I ran cuda-memcheck for this kernel and it is still fine. Also the results are right (of course it is right for each block, as I said this kernel only partially finishes the inclusive sum)
I reckon I might make a very stupid mistake but I just couldn't spot it. Any light would be much appreciated. Many thanks.

If you have a code like this:
for (unsigned int stride=1;stride<=threadIdx.x; stride*=2)
{
__syncthreads();
temp[threadIdx.x]+=temp[threadIdx.x-stride];
}
Then for thread where threadIdx.x == 0 the for loop will be skipped entirely. Try running the following code in main:
for (unsigned int stride=1;stride<=0; stride*=2)
{
cout << "I am running" << endl;
}
And you'll see there is nothing in the console.

Code runs 6 times slower with 2 threads than with 1

Original Problem:
So I have written some code to experiment with threads and do some testing.
The code should create some numbers and then find the mean of those numbers.
I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.
void findmean(std::vector<double>*, std::size_t, std::size_t, double*);
int main(int argn, char** argv)
{
// Program entry point
std::cout << "Generating data..." << std::endl;
// Create a vector containing many variables
std::vector<double> data;
for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);
// Calculate mean using 1 core
double mean = 0;
std::cout << "Calculating mean, 1 Thread..." << std::endl;
findmean(&data, 0, data.size(), &mean);
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Repeat, using two threads
std::vector<std::thread> thread;
std::vector<double> result;
result.push_back(0.0);
result.push_back(0.0);
std::cout << "Calculating mean, 2 Threads..." << std::endl;
// Run threads
uint32_t halfsize = data.size() / 2;
uint32_t A = 0;
uint32_t B, C, D;
// Split the data into two blocks
if(data.size() % 2 == 0)
{
B = C = D = halfsize;
}
else if(data.size() % 2 == 1)
{
B = C = halfsize;
D = hsz + 1;
}
// Run with two threads
thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));
// Join threads
thread[0].join();
thread[1].join();
// Calculate result
mean = result[0] + result[1];
mean /= (double)data.size();
// Print result
std::cout << " Mean=" << mean << std::endl;
// Return
return EXIT_SUCCESS;
}
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
}
I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.
Register Variable:
Several people have suggested making a local variable for the function 'findmean'. This is what I have done:
void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
holding += (*datavec).at(start + i);
}
*result = holding;
}
I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?
Register Variable and O2 Optimization:
I have set the optimization to 'O2' - I will create a table with the results.
Results so far:
Original Code with no optimization or register variable:
1 thread: 4.98 seconds, 2 threads: 29.59 seconds
Code with added register variable:
1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds
With reg variable and -O2 optimization:
1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?
With Dameon's suggestion, which was to put a large block of memory in between the two result variables:
1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds
With TAS 's suggestion of using iterators to access contents of the vector:
1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (single channel memory 4GB):
1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds
Same as above on Core i7 920 (dual channel memory 2x2GB):
1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds

Why are 2 threads 6x slower than 1 thread?
You are getting hit by a bad case of false sharing.
After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?
You are bottlenecked by your memory bandwidth.
False Sharing:
The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.
Each thread is running this loop:
for(uint32_t i = 0; i < length; i ++) {
*result += (*datavec).at(start + i);
}
And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.
Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.
Memory Bandwidth:
Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.
Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.
In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.

As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.
Here are the relevant parts of the code:
thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));
and
void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
*result = std::accumulate(start, end, 0.0);
}
This consistently gives me better performance for two threads on my 32-bit netbook.

More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.

This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js