Memory Race in Cuda - c++

I have a global function that get an array and index to array.
the function need to find a word in some dictionary and where it start in a given sequence.
but I see that the threads are overwrite the results. so I guess its because a memory race.
what can I do?
__global__ void find_words(int* dictionary, int dictionary_size, int* indeces,
int indeces_size, int *sequence, int sequence_size,
int longest_word, int* devWords, int *counter)
{
int id = blockIdx.x * blockDim.x + threadIdx.x;
int start = id * (CHUNK_SIZE - longest_word);
int finish = start + CHUNK_SIZE;
int word_index = -1;
if (finish > sequence_size)
{
finish = sequence_size;
}
// search in a closed area
while(start < finish)
{
find_word_in_phoneme_dictionary_kernel(dictionary, dictionary_size,
indeces, indeces_size, sequence, &word_index, start, finish);
if(word_index >= 0 && word_index <= indeces[indeces_size-1])
{
devWords[*counter] = word_index;
devWords[*counter+1] = start; // index in sequence
*counter+=2;
start += dictionary[word_index];
}
else
{
start++;
}
}
__syncthreads();
}
I also tried to create for each thread his own array and counter to store there his results
and then to collect all the threads results.. but i don't understand how to implement the gather in CUDA. any help?

I guess the problem is that your counter is read and incremented by multiple threads. As a result, multiple threads will use the same counter value as index in the array. You should instead use int atomicAdd(int* address, int val); to increment the counter. The code would look like this:
int oldCounter = atomicAdd(counter, 2);
devWords[oldCounter] = word_index;
devWords[oldCounter+1] = start;
Note that I incremented counter before accessing the array. atomicAdd(...) returns the old value of the counter, which I then used to access the array.
The Atomic operations however are serialized, which means that incrementing the counter can not run in parallel. The rest oft the code is still running in parallel though.

Related

C++: Multithreaded merge sort witn N threads

I`m trying to write merge sort with 2 threads.
I divide array into 2 pieces and sort each half with usual merge sort. After that I just merge two sorted parts.
Usual merge sort works correctly, and if I apply it to eash part without threads, it works correctly too.
I run a lof of tests on randomly generated short arrays, and there can be 2k of correct tests, but sometimes my multithread sort doesn`t work properly.
After sorting each half but before merging them, I check them. Sometimes the set of numbers in current part of array occurs to be different from orinigal set of numbers in that part before sorting, the numbers just appear from nowhere.
There must be some problem with threads, because there is no such problem without them.
As you can see, I made buffer with length = array.size() and I pass reference on it to functions. When merging two sorted arrays, this buffer is used.
Each buffer element is initialized with 0.
I`m sure that there is no shared data, because every function uses separated part of buffer. The correct work of usual merge sort supports that.
Please, help to understand, what is wrong with this way of using threads, I`m absolutely confused.
P. S. my code is supposed to execute sorting in N threads, not in 2, thats why I create array of threads. But even with 2 it doesnt work.
Multithread function:
void merge_sort_multithread(std::vector<int>& arr, std::vector<int>& buffer, unsigned int threads_count)
{
int length = arr.size();
std::vector<std::thread> threads;
// dividing array into nearly equal parts
std::vector<int> thread_from; // array with indexes of part`s start
std::vector<int> thread_length; // array with part`s length
make_parts(thread_from, thread_length, threads_count, length);
// start threads
for (int i = 0; i < threads_count; ++i)
{
threads.push_back(std::thread(merge_sort, std::ref(arr), std::ref(buffer),
thread_length[i], thread_from[i]));
}
// waiting for end of sorting
for (int i = 0; i < threads_count; ++i)
threads[i].join();
// ------- here I check each part and find mistakes, so next function is not important ----
merge_sorted_after_multithreading(arr, buffer, thread_from, thread_length, threads_count, 0);
}
Usual merge sort:
void merge_sort(std::vector<int>& arr, std::vector<int>& buffer, size_t length, int from)
{
if (length == 1)
{
return;
}
int length_left = length / 2;
int length_right = length - length_left;
// sorting each part
merge_sort(arr, buffer, length_left, from);
merge_sort(arr, buffer, length_right, from + length_left);
// merging sorted parts
merge_arrays(arr, buffer, length_left, length - length_left, from, from + length_left);
}
Merging two sorted arrays with buffer:
void merge_arrays(std::vector<int>& arr, std::vector<int>& buffer, size_t length_left, size_t length_right, int start_left, int start_right)
{
int idx_left, idx_right, idx_buffer;
idx_left = idx_right = idx_buffer = 0;
while ((idx_left < length_left) && (idx_right < length_right))
{
if (arr[start_left + idx_left] < arr[start_right + idx_right])
{
do {
buffer[idx_buffer] = arr[start_left + idx_left];
++idx_buffer;
++idx_left;
} while ((idx_left < length_left) && (arr[start_left + idx_left] < arr[start_right + idx_right]));
}
else
{
do {
buffer[idx_buffer] = arr[start_right + idx_right];
++idx_buffer;
++idx_right;
} while ((idx_right < length_right) && (arr[start_right + idx_right] < arr[start_left + idx_left]));
}
}
if (idx_left == length_left)
{
for (; idx_right < length_right; ++idx_right)
{
buffer[idx_buffer] = arr[start_right + idx_right];
++idx_buffer;
}
}
else
{
for (; idx_left < length_left; ++idx_left)
{
buffer[idx_buffer] = arr[start_left + idx_left];
++idx_buffer;
}
}
// copying result to original array
for (int i = 0; i < idx_buffer; ++i)
{
arr[start_left + i] = buffer[i];
}
}
Dividing array into separated parts:
void make_parts(std::vector<int>& thread_from, std::vector<int>& thread_length, unsigned int threads_count, size_t length)
{
int dlength = (length / threads_count);
int odd_length = length % threads_count;
int offset = 0;
for (int i = 0; i < threads_count; ++i)
{
if (odd_length > 0)
{
thread_length.push_back(dlength + 1);
--odd_length;
}
else
thread_length.push_back(dlength);
thread_from.push_back(offset);
offset += thread_length[i];
}
}
P.P.S. Each function except multithread sort was tested and works correctly

Fill an array from different threads concurrently c++

First of all, I think it is important to say that I am new to multithreading and know very little about it. I was trying to write some programs in C++ using threads and ran into a problem (question) that I will try to explain to you now:
I wanted to use several threads to fill an array, here is my code:
static const int num_threads = 5;
int A[50], n;
//------------------------------------------------------------
void ThreadFunc(int tid)
{
for (int q = 0; q < 5; q++)
{
A[n] = tid;
n++;
}
}
//------------------------------------------------------------
int main()
{
thread t[num_threads];
n = 0;
for (int i = 0; i < num_threads; i++)
{
t[i] = thread(ThreadFunc, i);
}
for (int i = 0; i < num_threads; i++)
{
t[i].join();
}
for (int i = 0; i < n; i++)
cout << A[i] << endl;
return 0;
}
As a result of this program I get:
0
0
0
0
0
1
1
1
1
1
2
2
2
2
2
and so on.
As I understand, the second thread starts writing elements to an array only when the first thread finishes writing all elements to an array.
The question is why threads dont't work concurrently? I mean why don't I get something like that:
0
1
2
0
3
1
4
and so on.
Is there any way to solve this problem?
Thank you in advance.
Since n is accessed from more than one thread, those accesses need to be synchronized so that changes made in one thread don't conflict with changes made in another. There are (at least) two ways to do this.
First, you can make n an atomic variable. Just change its definition, and do the increment where the value is used:
std::atomic<int> n;
...
A[n++] = tid;
Or you can wrap all the accesses inside a critical section:
std::mutex mtx;
int next_n() {
std::unique_lock<std::mutex> lock(mtx);
return n++;
}
And in each thread, instead of directly incrementing n, call that function:
A[next_n()] = tid;
This is much slower than the atomic access, so not appropriate here. In more complex situations it will be the right solution.
The worker function is so short, i.e., finishes executing so quickly, that it's possible that each thread is completing before the next one even starts. Also, you may need to link with a thread library to get real threads, e.g., -lpthread. Even with that, the results you're getting are purely by chance and could appear in any order.
There are two corrections you need to make for your program to be properly synchronized. Change:
int n;
// ...
A[n] = tid; n++;
to
std::atomic_int n;
// ...
A[n++] = tid;
Often it's preferable to avoid synchronization issues altogether and split the workload across threads. Since the work done per iteration is the same here, it's as easy as dividing the work evenly:
void ThreadFunc(int tid, int first, int last)
{
for (int i = first; i < last; i++)
A[i] = tid;
}
Inside main, modify the thread create loop:
for (int first = 0, i = 0; i < num_threads; i++) {
// possible num_threads does not evenly divide ASIZE.
int last = (i != num_threads-1) ? std::size(A)/num_threads*(i+1) : std::size(A);
t[i] = thread(ThreadFunc, i, first, last);
first = last;
}
Of course by doing this, even though the array may be written out of order, the values will be stored to the same locations every time.

Getting a C++ segmentation fault

I'm in a linux server and when I try to execute the program it's returning a segmentation fault. when i use gdb to try and find out why, it returns..
Starting program: /home/cups/k
Program received signal SIGSEGV, Segmentation fault.
0x0000000000401128 in search(int) ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libgcc-4.4.7-17.el6.x86_64 libstdc++-4.4.7-17.el6.x86_64
I couldn't quite interpret this. In my program i have a function called "search()" but i don't see anything that would cause a seg fault. here's the function def:
int search (int bit_type) { // SEARCH FOR A CONSEC NUMBER (of type BIT_TYPE) TO SEE IF ALREADY ENCOUNTERED
for (int i = 1; i <= MAX[bit_type]; i++) { //GO THRU ALL ENCOUNTERED CONSEC NUMBERS SO FAR (for type BIT_TYPE)
if (consec == r[bit_type][i]) // IF: FOUND
return i; // -----> RETURN INDEX OF RECORDED CONSEC_NUM
}
// IF: NOT FOUND
r[bit_type][++MAX[bit_type]] = consec; // -----> INCREMENT MAX[bit_type] & RECORD NEW CONSEC_NUM -------> ARRAY[MAX]
n[bit_type][MAX[bit_type]] = 1;
return (MAX[bit_prev]); // -----> RETURN THE NEWLY FILLED INDEX
}
global functions:
int MAX[2];
int r[2][200];
int n[2][200];
The comments are pretty useless to you guys since you don't have the rest of the program.. but you can just ignore them.
But do you guys see anything I missed?
From the link to your code here, here is just one error:
int *tmp = new int[MAX[0]];
for (int y = 0; y <= MAX[0]; y++) {
tmp[y] = 1;
}
You are going out-of-bounds on the last iteration. You allocated an array with MAX[0] items, and on the last iteration you're accessing tmp[MAX[0]].
That loop should be:
int *tmp = new int[MAX[0]];
for (int y = 0; y < MAX[0]; y++) {
tmp[y] = 1;
}
or better yet:
#include <algorithm>
//...
std::fill(tmp, tmp + MAX[0], 1); // no loop needed
or skip the dynamic allocation using new[] and use std::vector:
#include <vector>
//...
std::vector<int> tmp(MAX[0], 1);
In general, you have multiple loops that do this:
for (int i = 1; i <= number_of_items_in_array; ++i )
and then you access your arrays with array[i]. It is the <= in that for loop condition that is suspicious since it will try to access the array with an out-of-bounds index on the last iteration.
Another example is this:
long sum(int arr_r[], int arr_n[], int limit)
{
long tot = 0;
for (int i = 1; i <= limit; i++)
{
tot += (arr_r[i])*(arr_n[i]);
}
return tot;
}
Here, limit is the number of elements in the array, and you access arr_r[i] on the last iteration, causing undefined behavior.
Arrays are indexed starting from 0 and up to n - 1, where n is the total number of elements. Trying to fake 1-based arrays as you're attempting to do almost always results in these types of errors somewhere inside of the code base.

C++ - Efficiently computing a vector-matrix product

I need to compute a product vector-matrix as efficiently as possible. Specifically, given a vector s and a matrix A, I need to compute s * A. I have a class Vector which wraps a std::vector and a class Matrix which also wraps a std::vector (for efficiency).
The naive approach (the one that I am using at the moment) is to have something like
Vector<T> timesMatrix(Matrix<T>& matrix)
{
Vector<unsigned int> result(matrix.columns());
// constructor that does a resize on the underlying std::vector
for(unsigned int i = 0 ; i < vector.size() ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
result[j] += (vector[i] * matrix.getElementAt(i, j));
// getElementAt accesses the appropriate entry
// of the underlying std::vector
}
}
return result;
}
It works fine and takes nearly 12000 microseconds. Note that the vector s has 499 elements, while A is 499 x 15500.
The next step was trying to parallelize the computation: if I have N threads then I can give each thread a part of the vector s and the "corresponding" rows of the matrix A. Each thread will compute a 499-sized Vector and the final result will be their entry-wise sum.
First of all, in the class Matrix I added a method to extract some rows from a Matrix and build a smaller one:
Matrix<T> extractSomeRows(unsigned int start, unsigned int end)
{
unsigned int rowsToExtract = end - start + 1;
std::vector<T> tmp;
tmp.reserve(rowsToExtract * numColumns);
for(unsigned int i = start * numColumns ; i < (end+1) * numColumns ; ++i)
{
tmp.push_back(matrix[i]);
}
return Matrix<T>(rowsToExtract, numColumns, tmp);
}
Then I defined a thread routine
void timesMatrixThreadRoutine
(Matrix<T>& matrix, unsigned int start, unsigned int end, Vector<T>& newRow)
{
// newRow is supposed to contain the partial result
// computed by a thread
newRow.resize(matrix.columns());
for(unsigned int i = start ; i < end + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matrix.columns() ; ++j)
{
newRow[j] += vector[i] * matrix.getElementAt(i - start, j);
}
}
}
And finally I modified the code of the timesMatrix method that I showed above:
Vector<T> timesMatrix(Matrix<T>& matrix)
{
static const unsigned int NUM_THREADS = 4;
unsigned int matRows = matrix.rows();
unsigned int matColumns = matrix.columns();
unsigned int rowsEachThread = vector.size()/NUM_THREADS;
std::thread threads[NUM_THREADS];
Vector<T> tmp[NUM_THREADS];
unsigned int start, end;
// all but the last thread
for(unsigned int i = 0 ; i < NUM_THREADS - 1 ; ++i)
{
start = i*rowsEachThread;
end = (i+1)*rowsEachThread - 1;
threads[i] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[i]));
}
// last thread
start = (NUM_THREADS-1)*rowsEachThread;
end = matRows - 1;
threads[NUM_THREADS - 1] = std::thread(&Vector<T>::timesMatrixThreadRoutine, this,
matrix.extractSomeRows(start, end), start, end, std::ref(tmp[NUM_THREADS-1]));
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
threads[i].join();
}
Vector<unsigned int> result(matColumns);
for(unsigned int i = 0 ; i < NUM_THREADS ; ++i)
{
result = result + tmp[i]; // the operator+ is overloaded
}
return result;
}
It still works but now it takes nearly 30000 microseconds, which is almost three times as much as before.
Am I doing something wrong? Do you think there is a better approach?
EDIT - using a "lightweight" VirtualMatrix
Following Ilya Ovodov's suggestion, I defined a class VirtualMatrix that wraps a T* matrixData, which is initialized in the constructor as
VirtualMatrix(Matrix<T>& m)
{
numRows = m.rows();
numColumns = m.columns();
matrixData = m.pointerToData();
// pointerToData() returns underlyingVector.data();
}
Then there is a method to retrieve a specific entry of the matrix:
inline T getElementAt(unsigned int row, unsigned int column)
{
return *(matrixData + row*numColumns + column);
}
Now the execution time is better (approximately 8000 microseconds) but maybe there are some improvements to be made. In particular the thread routine is now
void timesMatrixThreadRoutine
(VirtualMatrix<T>& matrix, unsigned int startRow, unsigned int endRow, Vector<T>& newRow)
{
unsigned int matColumns = matrix.columns();
newRow.resize(matColumns);
for(unsigned int i = startRow ; i < endRow + 1 ; ++i)
{
for(unsigned int j = 0 ; j < matColumns ; ++j)
{
newRow[j] += (vector[i] * matrix.getElementAt(i, j));
}
}
}
and the really slow part is the one with the nested for loops. If I remove it, the result is obviously wrong but is "computed" in less than 500 microseconds. This to say that now passing the arguments takes almost no time and the heavy part is really the computation.
According to you, is there any way to make it even faster?
Actually you make a partial copy of matrix for each thread in extractSomeRows. It takes a lot of time.
Redesign it so that "some rows" become virtual matrix pointing at data located in original matrix.
Use vectorized assembly instructions for an architecture by making it more explicit that you want to multiply in 4's, i.e. for the x86-64 SSE2+ and possibly ARM'S NEON.
C++ compilers can often unroll the loop into vectorized code if you explicitly make an operation happen in contingent elements:
Simple and fast matrix-vector multiplication in C / C++
There is also the option of using libraries specifically made for matrix multipication. For larger matrices, it may be more efficient to use special implementations based on the Fast Fourier Transform, alternate algorithms like Strassen's Algorithm, etc. In fact, your best bet would be to use a C library like this, and then wrap it in an interface that looks similar to a C++ vector.

Multithreading taking equal time as single thread quick sorting

I'm working on linux but multithreading and single threading both are taking 340ms. Can someone tell me what's wrong with what I'm doing?
Here is my code
#include<time.h>
#include<fstream>
#define SIZE_OF_ARRAY 1000000
using namespace std;
struct parameter
{
int *data;
int left;
int right;
};
void readData(int *data)
{
fstream iFile("Data.txt");
for(int i = 0; i < SIZE_OF_ARRAY; i++)
iFile>>data[i];
}
int threadCount = 4;
int Partition(int *data, int left, int right)
{
int i = left, j = right, temp;
int pivot = data[(left + right) / 2];
while(i <= j)
{
while(data[i] < pivot)
i++;
while(data[j] > pivot)
j--;
if(i <= j)
{
temp = data[i];
data[i] = data[j];
data[j] = temp;
i++;
j--;
}
}
return i;
}
void QuickSort(int *data, int left, int right)
{
int index = Partition(data, left, right);
if(left < index - 1)
QuickSort(data, left, index - 1);
if(index < right)
QuickSort(data, index + 1, right);
}
//Multi threading code starts from here
void *Sort(void *param)
{
parameter *param1 = (parameter *)param;
QuickSort(param1->data, param1->left, param1->right);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
clock_t start, diff;
int *data = new int[SIZE_OF_ARRAY];
pthread_t threadID, threadID1;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
parameter param, param1;
readData(data);
start = clock();
int index = Partition(data, 0, SIZE_OF_ARRAY - 1);
if(0 < index - 1)
{
param.data = data;
param.left = 0;
param.right = index - 1;
pthread_create(&threadID, NULL, Sort, (void *)&param);
}
if(index < SIZE_OF_ARRAY - 1)
{
param1.data = data;
param1.left = index + 1;
param1.right = SIZE_OF_ARRAY;
pthread_create(&threadID1, NULL, Sort, (void *)&param1);
}
pthread_attr_destroy(&attr);
pthread_join(threadID, NULL);
pthread_join(threadID1, NULL);
diff = clock() - start;
cout<<"Sorting Time = "<<diff * 1000 / CLOCKS_PER_SEC<<"\n";
delete []data;
return 0;
}
//Multithreading Ends here
Single thread main function
int main(int argc, char *argv[])
{
clock_t start, diff;
int *data = new int[SIZE_OF_ARRAY];
readData(data);
start = clock();
QuickSort(data, 0, SIZE_OF_ARRAY - 1);
diff = clock() - start;
cout<<"Sorting Time = "<<diff * 1000 / CLOCKS_PER_SEC<<"\n";
delete []data;
return 0;
}
//Single thread code ends here
some of functions single thread and multi thread use same
clock returns total CPU time, not wall time.
If you have 2 CPUs and 2 threads, then after a second of running both thread simultaneously clock will return CPU time of 2 seconds (the sum of CPU times of each thread).
So the result is totally expected. It does not matter how many CPUs you have, the total running time summed over all CPUs will be the same.
Note that you call Partition once from the main thread...
The code works on the same memory block which prevents a CPU from working when the other accesses that same memory block. Unless your data is really large you're likely to have many such hits.
Finally, if your algorithm works at memory speed when you run it with one thread, adding more threads doesn't help. I did such tests a while back with image data, and having multiple thread decreased the total speed because the process was so memory intensive that both threads were fighting to access memory... and the result was worse than not having threads at all.
What makes really fast computers today go really is fast is running one very intensive process per computer, not a large number of threads (or processes) on a single computer.
Build a thread pool with a producer-consumer queue with 24 threads hanging off it. Partition your data into two and issue a mergesort task object to the pool, the mergesort object should issue further pairs of mergesorts to the queue and wait on a signal for them to finish and so on until a mergersort object finds that it has [L1 cache-size data]. The object then qicksorts its data and signals completion to its parent task.
If that doesn't turn out to be blindingly quick on 24 cores, I'll stop posting about threads..
..and it will handle multiple sorts in parallel.
..and the pool can be used for other tasks.
.. and there is no No performance-destroying, deadlock-generating join(), synchronize(), (if you except the P-C queue, which only locks for long enough to push an object ref on), no thread-creation overhead and no dodgy thread-stopping/terminating/destroying code. Like the barbers, there is no waiting - as soon as a thread is finished with a task it can get another.
No thread micro-management, no tuning, (you could create 64 threads now, ready for the next generation of boxes). You could make the thread count tuneable - just add more threads at runtime, or delete some by queueing up poison-pills.
You don't need a reference to the threads at all - just set 'em off, (pass queue as parameter).