Segmentation Fault Counting Sort C++11 OpenMP - c++

I need help with this parallel counting sort. I got a segmentation fault. Gdb says the source of segmentation fault is at this line: c[i] = 0;
What could possibly gone wrong, and how to fix it? Thanks.
void radix_sort::sort_array(int array[], int n)
{
std::hash<int> hash;
std::size_t m = n / nthreads;
std::vector <int> a(n);
a.insert(a.end(), &array[0], &array[n]);
std::vector<int>::iterator begin = a.begin();
std::vector<int>::iterator end = a.end();
int max = *std::max_element(a.begin(), a.end());
//int min = *std::min_element(a.begin(), a.end());
//int x = max - min + 1;
int *split_positions = new int [nthreads+1];
for(std::size_t i=0; i<a.size(); i=i+m){
if(a.begin()+i+m <= a.end()){
split_positions[i] = *a.begin()+i;
split_positions[i+1] = *a.begin()+i+m;
}
else {
split_positions[i] = *a.begin()+i;
split_positions[i+1] = *a.end();
}
}
// create one counter array for each thread
int **thread_counters = new int* [nthreads];
for (int i = 0; i < nthreads; i++)
thread_counters[i] = new int[m];
// count occurences
#pragma omp parallel num_threads(_nthreads)
{
int thread_id = omp_get_thread_num();
int *&c = thread_counters[thread_id];
// reset counters
for (int i = 0; i <= max; i++)
c[i] = 0;
// count occurences
for (int i = split_positions[thread_id]; i < split_positions[thread_id + 1]; i++)
{
c[hash(begin[i])]++;
}
}
// Compute global prefix sums / ranks from local ones. We *could*
// make this parallel, too, but there are only num_threads * (max_key + 1)
// entries in total.
for (int i = 0, sum = 0; i <= max; i++)
{
for (int j = 0; j < nthreads; j++)
{
int t = thread_counters[j][i];
thread_counters[j][i] = sum;
sum += t;
}
}
int *buffer = new int[n]; // backbuffer, copied back to input later
// write sorted result to backbuffer
#pragma omp parallel num_threads(_nthreads)
{
int thread_id = omp_get_thread_num();
int *&c = thread_counters[thread_id];
for (int i = split_positions[thread_id]; i < split_positions[thread_id + 1]; i++)
{
buffer[c[hash(begin[i])]++] = begin[i];
}
}
// write result from buffer back into input
std::copy(buffer, buffer + n, array);
// cleanup
delete [] buffer;
for (int i = 0; i < nthreads; i++)
delete [] thread_counters[i];
delete [] thread_counters;
delete [] split_positions;
}

Related

Assign pointer to 2d array to an array

So I got a function which creates me 2D array and fill it with test data.
Now I need to assign the pointer to an array
//Fill matrix with test data
int *testArrData(int m, int n){
int arr[n][m];
int* ptr;
ptr = &arr[0][0];
for(int i = 0; i < m; i++){
for(int j = 0; j < n; j++){
*((ptr+i*n)+j) = rand()%10;
}
}
return (int *) arr;
}
int arr[m][n];
//Algorithm - transpose
for (int i = 0; i < m; i++){
for (int j = 0; j < n; j++){
arrT[j][i] = arr[i][j];
}
}
Is there any way of doing this?
There are at least four problems with the function.
//Fill matrix with test data
int *testArrData(int m, int n){
int arr[n][m];
int* ptr;
ptr = &arr[0][0];
for(int i = 0; i < m; i++){
for(int j = 0; j < n; j++){
*((ptr+i*n)+j) = rand()%10;
}
}
return (int *) arr;
}
First of all you declared a variable length array
int arr[n][m];
Variable length arrays are not a standard C++ feature.
The second problem is that these for loops
for(int i = 0; i < m; i++){
for(int j = 0; j < n; j++){
*((ptr+i*n)+j) = rand()%10;
}
}
are incorrect. It seems you mean
for(int i = 0; i < n; i++){
for(int j = 0; j < m; j++){
*((ptr+i*m)+j) = rand()%10;
}
}
You are returning a pointer to a local array with automatic storage duration that will not be alive after exiting the function. So the returned pointer will be invalid.
And arrays do not have the assignment operator.
Instead use the vector std::vector<std::vector<int>>. For example
std::vector<std::vector<int>> testArrData(int m, int n){
std::vector<std::vector<int>> v( n, std::vector<int>( m ) );
for ( auto &row : v )
{
for ( auto &item : row )
{
item = rand() % 10;
}
}
return v;
}
This is how I would accomplish this. I agree with int ** because it is easy to understand if you dont know how to use vectors. Also, the rand() can cause trouble if you are using the result to index an array. Make sure to use abs(rand() % number) if you don't want negative numbers.
I've updated the answer due to some vital missing code.
// This method creates the overhead / an array of pointers for each matrix
typedef int* matrix_cells;
int **create_row_col_matrix(int num_rows, int num_cols, bool init_rnd)
{
num_rows = min(max(num_rows, 1), 1000); // ensure num_rows = 1 - 1000
num_cols = min(max(num_cols, 1), 1000); // ensure num_cols = 1 - 1000
int *matrix_total = new int[num_rows*num_cols];
// overhead: create an array that points to each row
int **martix_row_col = new matrix_cells[num_rows];
// initialize the row pointers
for (int a = 0; a < num_rows; ++a)
{
// initialize the array of row pointers
matrix_row_col[a] = &matrix_total[num_cols*a];
}
// assign the test data
if (init_rnd)
{
for (int run_y = 0; run_y < num_rows; ++run_y)
{
for (int run_x = 0; run_x < num_cols; ++run_x)
{
matrix_row_col[run_y][run_x] = abs(rand() % 10);
}
}
}
return matrix_row_col;
}
int src_x = 7, dst_x = 11;
int src_y = 11, dst_y = 7;
int **arr_src = create_row_col_matrix(src_y, src_x, true);
int **arr_dst = create_row_col_matrix(dst_y, dst_x, false);
for (int a = 0; a < dst_y; ++a)
{
for (int b = 0; b < dst_x; ++b)
{
arr_dst[a][b] = arr_src[b][a];
}
}
delete matrix_src[0]; // int *matrix_total = new int[src_y*src_x]
delete matrix_src; // int **matrix_row_col = new matrix_cell[src_y]
delete matrix_dst[0]; // int *matrix_total = new int[dst_y*dst_x]
delete matrix_dst; // int **matrix_row_col = new matrix_cell[dst_y]
// the overhead is matrix_src and matrix_dst which are arrays of row pointers
// the row pointers makes it convenient to address the cells as [rown][coln]

Speedup when avoiding false sharing problem?

I am trying to add cache-line padding to avoid false sharing problem but I cant see a big difference in speedup. With padding its only 1.2 x faster. I am running the code without padding and the one with padding n = 700 milion times for testing. Should I get more speedup than 1.2 times? Maybe I have missed something with my padding implementation? I am adding 15 ints padding because I am assuming that counters doesnt have to be allocated at the start of a cache-line. Any tips appreciated.
Here is my code:
template <const int k> void par_countingsort2(int *out, int const *in, const int n) {
const int paddingAmount = cachelinesize / sizeof(int);
const int kPadded = k + (paddingAmount - 1);
printf("/n%d", kPadded);
int counters[nproc][kPadded] = {}; // all zeros
#pragma omp parallel
{
int *thcounters = counters[omp_get_thread_num()];
#pragma omp for
for (int i = 0; i < n; ++i)
++thcounters[in[i]];
#pragma omp single
{
int tmp, sum = 0;
for (int j = 0; j < k; ++j)
for (int i = 0; i < nproc; ++i) {
tmp = counters[i][j];
counters[i][j] = sum;
sum += tmp;
}
}
#pragma omp for
for (int i = 0; i < n; ++i)
out[thcounters[in[i]]++] = in[i];
}
}
#define k 1000
int main(int argc, char *argv[]) {
//init input
int n = argc>1 && atoi(argv[1])>0 ? atoi(argv[1]) : 0;
int* in = (int*)malloc(sizeof(int)*n);
int* out = (int*)malloc(sizeof(int)*n);;
for (int i = 0; i < n; ++i)
in[i] = rand()%k;
printf("n = %d\n", n);
//print some parameters
printf("nproc = %d\n", nproc);
printf("cachelinesize = %d byte\n", cachelinesize);
printf("k = %d\n", k);
double tp2 = omp_get_wtime();
par_countingsort2<k>(out, in, n);
tp2 = omp_get_wtime() - tp2;
printf("par2, elapsed time = %.3f seconds (%.1fx speedup from par1), check passed = %c\n", tp2, tp/tp2, checkreset(out,in,n)?'y':'n');
//free mem
free(in);
free(out);
return EXIT_SUCCESS;
}

My code is realy slow and i need optimization problem

#include <iostream>
#include <chrono>
using namespace std;
int main()
{
const unsigned int m = 200;
const unsigned int n = 200;
srand(static_cast<unsigned int>(static_cast<std::chrono::duration<double>
>(std::chrono::high_resolution_clock::now().time_since_epoch()).count()));
double** matrixa;
double** matrixb;
double** matrixc;
matrixa = new double* [m];
matrixb = new double* [m];
matrixc = new double* [m];
unsigned int max = static_cast<unsigned int>(1u << 31);
for (unsigned int i = 0; i < m; i++)
matrixa[i] = new double[n];
for (unsigned int i = 0; i < m; i++)
matrixb[i] = new double[n];
for (unsigned int i = 0; i < m; i++)
matrixc[i] = new double[n];
for (unsigned int i = 0; i < m; i++)
for (unsigned int j = 0; j < n; j++)
matrixa[i]
[j] = static_cast<double>(static_cast<double>(rand()) / max * 10);
for (unsigned int i = 0; i < m; i++)
for (unsigned int j = 0; j < n; j++)
matrixb[i]
[j] = static_cast<double>(static_cast<double>(rand()) / max * 10);
auto start = std::chrono::high_resolution_clock::now();
for (unsigned int i = 0; i < m; i++)
for (unsigned int j = 0; j < n; j++)
for (unsigned int k = 0; k < m; k++)
for (unsigned int l = 0; l < m; l++)
matrixc[i][j] += matrixa[k][l] * matrixb[l][k];
auto stop = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> time_diff = stop - start;
cout << "Czas wykonania programu " << time_diff.count() << " sekund." <<
endl;
for (unsigned int i = 0; i < m; i++)
delete[] matrixa[i];
for (unsigned int i = 0; i < m; i++)
delete[] matrixb[i];
for (unsigned int i = 0; i < m; i++)
delete[] matrixc[i];
delete[] matrixa;
delete[] matrixb;
delete[] matrixc;
return 0;
}
I have this code and I would like to optimize it, unfortunately I have absolutely no idea how to go about it. Maybe someone has an idea and would like to help me? I got to the point where the program for 400 arrays executes 105 seconds but it is still too much, I would like to optimize this code to run faster. I found OpenMP library and thread class but I don't know how to use it in my program.
Firstly, your matrix multiply algorithm is over complex than a normal one(Or it's just wrong), you may reference the wiki for a typical algorithm:
Input: matrices A and B
Let C be a new matrix of the appropriate size
For i from 1 to n:
For j from 1 to p:
Let sum = 0
For k from 1 to m:
Set sum ← sum + Aik × Bkj
Set Cij ← sum
Return C
There is a critical bug in your code, you haven't initialized the result matrix.
So the fixed code may like this:
#include <chrono>
#include <iostream>
using namespace std;
int main() {
const unsigned int m = 200;
const unsigned int n = 201;
const unsigned int p = 202;
srand(static_cast<unsigned int>(
static_cast<std::chrono::duration<double> >(
std::chrono::high_resolution_clock::now().time_since_epoch())
.count()));
double** matrixa;
double** matrixb;
double** matrixc;
matrixa = new double*[m];
matrixb = new double*[n];
matrixc = new double*[m];
unsigned int max = static_cast<unsigned int>(1u << 31);
for (unsigned int i = 0; i < m; i++) matrixa[i] = new double[n];
for (unsigned int i = 0; i < n; i++) matrixb[i] = new double[p];
for (unsigned int i = 0; i < m; i++) {
matrixc[i] = new double[p];
std::fill(matrixc[i], matrixc[i] + p, 0.0);
}
for (unsigned int i = 0; i < m; i++)
for (unsigned int j = 0; j < n; j++)
matrixa[i][j] =
static_cast<double>(static_cast<double>(rand()) / max * 10);
for (unsigned int i = 0; i < n; i++)
for (unsigned int j = 0; j < p; j++)
matrixb[i][j] =
static_cast<double>(static_cast<double>(rand()) / max * 10);
auto start = std::chrono::high_resolution_clock::now();
for (unsigned int i = 0; i < m; i++)
for (unsigned int j = 0; j < p; j++)
for (unsigned int k = 0; k < n; k++)
matrixc[i][j] += matrixa[i][k] * matrixb[k][j];
auto stop = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> time_diff = stop - start;
cout << "Czas wykonania programu " << time_diff.count() << " sekund." << endl;
for (unsigned int i = 0; i < m; i++) delete[] matrixa[i];
for (unsigned int i = 0; i < n; i++) delete[] matrixb[i];
for (unsigned int i = 0; i < m; i++) delete[] matrixc[i];
delete[] matrixa;
delete[] matrixb;
delete[] matrixc;
return 0;
}
Now it's much faster than the one in question.
It still can be faster with slightly modification:
#include <chrono>
#include <iostream>
using namespace std;
int main() {
const unsigned int m = 200;
const unsigned int n = 201;
const unsigned int p = 202;
srand(static_cast<unsigned int>(
static_cast<std::chrono::duration<double> >(
std::chrono::high_resolution_clock::now().time_since_epoch())
.count()));
double** matrixa;
double** matrixb;
double** matrixc;
matrixa = new double*[m];
matrixb = new double*[n];
matrixc = new double*[m];
unsigned int max = static_cast<unsigned int>(1u << 31);
for (unsigned int i = 0; i < m; i++) matrixa[i] = new double[n];
for (unsigned int i = 0; i < n; i++) matrixb[i] = new double[p];
for (unsigned int i = 0; i < m; i++) {
matrixc[i] = new double[p];
std::fill(matrixc[i], matrixc[i] + p, 0.0);
}
for (unsigned int i = 0; i < m; i++)
for (unsigned int j = 0; j < n; j++)
matrixa[i][j] =
static_cast<double>(static_cast<double>(rand()) / max * 10);
for (unsigned int i = 0; i < n; i++)
for (unsigned int j = 0; j < p; j++)
matrixb[i][j] =
static_cast<double>(static_cast<double>(rand()) / max * 10);
auto start = std::chrono::high_resolution_clock::now();
for (unsigned int i = 0; i < m; i++)
for (unsigned int k = 0; k < n; k++)
for (unsigned int j = 0; j < p; j++)
matrixc[i][j] += matrixa[i][k] * matrixb[k][j];
auto stop = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> time_diff = stop - start;
cout << "Czas wykonania programu " << time_diff.count() << " sekund." << endl;
for (unsigned int i = 0; i < m; i++) delete[] matrixa[i];
for (unsigned int i = 0; i < n; i++) delete[] matrixb[i];
for (unsigned int i = 0; i < m; i++) delete[] matrixc[i];
delete[] matrixa;
delete[] matrixb;
delete[] matrixc;
return 0;
}
This code is more cache-friendly, the explanation can be found here.
The code can still be improved by a parallel algorithm, to speed up the previous code with OpenMP, with only one line change:
Add we need to add the build option -fopenmp to compile it.
#include <chrono>
#include <iostream>
using namespace std;
int main() {
const unsigned int m = 200;
const unsigned int n = 201;
const unsigned int p = 202;
srand(static_cast<unsigned int>(
static_cast<std::chrono::duration<double> >(
std::chrono::high_resolution_clock::now().time_since_epoch())
.count()));
double** matrixa;
double** matrixb;
double** matrixc;
matrixa = new double*[m];
matrixb = new double*[n];
matrixc = new double*[m];
unsigned int max = static_cast<unsigned int>(1u << 31);
for (unsigned int i = 0; i < m; i++) matrixa[i] = new double[n];
for (unsigned int i = 0; i < n; i++) matrixb[i] = new double[p];
for (unsigned int i = 0; i < m; i++) {
matrixc[i] = new double[p];
std::fill(matrixc[i], matrixc[i] + p, 0.0);
}
for (unsigned int i = 0; i < m; i++)
for (unsigned int j = 0; j < n; j++)
matrixa[i][j] =
static_cast<double>(static_cast<double>(rand()) / max * 10);
for (unsigned int i = 0; i < n; i++)
for (unsigned int j = 0; j < p; j++)
matrixb[i][j] =
static_cast<double>(static_cast<double>(rand()) / max * 10);
auto start = std::chrono::high_resolution_clock::now();
#pragma omp parallel for
for (unsigned int i = 0; i < m; i++)
for (unsigned int k = 0; k < n; k++)
for (unsigned int j = 0; j < p; j++)
matrixc[i][j] += matrixa[i][k] * matrixb[k][j];
auto stop = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> time_diff = stop - start;
cout << "Czas wykonania programu " << time_diff.count() << " sekund." << endl;
for (unsigned int i = 0; i < m; i++) delete[] matrixa[i];
for (unsigned int i = 0; i < n; i++) delete[] matrixb[i];
for (unsigned int i = 0; i < m; i++) delete[] matrixc[i];
delete[] matrixa;
delete[] matrixb;
delete[] matrixc;
return 0;
}
It would be better to use std::vector rather than dynamically allocated arrays, the work is left for you.

std::bad_alloc during dijkstra calculation for big dataset

I am trying to solve shortest path for big graph using dijkstra algorithm.
Problem is when I am executing program in CLion I am getting std::bad alloc, always at node 491, however when I tried do the same on my Ubuntu VM, I am getting core dumped on the beggining.
I am new to c++ so it is hard for me to understand why does it happen.
Here is my code:
Utils:
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <sstream>
#include <ctime>
#define INFINITY 9999999
int maxNode = 0;
using namespace std;
vector<int> loadFile(const string &path) {
vector<int> graph;
ifstream file;
file.open(path);
if (!file.fail()) {
string line;
while (getline(file, line)) {
stringstream ss(line);
for (int i; ss >> i;) {
if (i + 1 > maxNode)
maxNode = i + 1;
graph.push_back(i);
if (ss.peek() == ';')
ss.ignore();
}
}
file.close();
}
return graph;
}
int **formatGraph(vector<int> inData) {
int **graph = 0;
int currentIndex = 0;
int srcNode = inData[0];
int dstNode = inData[1];
int cost = inData[2];
graph = new int *[maxNode];
for (int i = 0; i < maxNode; i++) {
graph[i] = new int[maxNode];
for (int j = 0; j < maxNode; j++) {
if (srcNode == i && dstNode == j) {
graph[i][j] = cost;
currentIndex++;
srcNode = inData[currentIndex * 3];
dstNode = inData[currentIndex * 3 + 1];
cost = inData[currentIndex * 3 + 2];
//printf("%d %d\n", i, j);
} else
graph[i][j] = 0;
}
}
for (int i = 0; i < maxNode; i++) {
for (int j = 0; j < maxNode; j++) {
graph[j][i] = graph[i][j];
}
}
return graph;
}
Algorithm:
void dijkstra(int **G, int n, int startnode) {
printf("%d\n", startnode);
int **cost = new int *[maxNode];
int distance[maxNode], pred[maxNode];
int visited[maxNode], count, mindistance, nextnode, i, j;
for (i = 0; i < n; i++) {
cost[i] = new int[maxNode];
for (j = 0; j < n; j++)
cost[i][j] = 0;
}
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
if (G[i][j] == 0)
cost[i][j] = INFINITY;
else
cost[i][j] = G[i][j];
for (i = 0; i < n; i++) {
distance[i] = cost[startnode][i];
pred[i] = startnode;
visited[i] = 0;
}
distance[startnode] = 0;
visited[startnode] = 1;
count = 1;
while (count < n - 1) {
mindistance = INFINITY;
for (i = 0; i < n; i++) {
if (distance[i] < mindistance && !visited[i]) {
mindistance = distance[i];
nextnode = i;
}
}
visited[nextnode] = 1;
for (i = 0; i < n; i++) {
if (!visited[i]) {
if (mindistance + cost[nextnode][i] < distance[i]) {
distance[i] = mindistance + cost[nextnode][i];
pred[i] = nextnode;
}
}
}
count++;
}
delete[] cost;
for (i = 0; i < n; i++)
if (i != startnode) {
j = i;
do {
j = pred[j];
} while (j != startnode);
}
}
And here is my main function:
int main() {
vector<int> graph = loadFile("..\\data\\newFile2.csv");
int **graphConverted = formatGraph(graph);
//printMatrix(graphConverted);
clock_t begin = clock();
for (int i = 0; i < maxNode; i++)
dijkstra(graphConverted, maxNode, i);
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
printf("\nTime: %f", elapsed_secs);
return 0;
}
First the data is loaded into vector, and then it is converted to adjacency matrix.
Data is stored in form:
src_node;dst_node;cost
1;2;3
1;3;30
1;66;20
etc.
Dataset consinsts of 1004 nodes and 25571 edges.
Could you please suggest me any solution how to fix this?
In dijkstra you have dynamic memory allocations here:
int **cost = new int *[maxNode];
and here in a loop over i:
cost[i] = new int[maxNode];
You have only one call to delete[] in this function:
delete[] cost;
So all the allocations from the second new line are guaranteed to be leaked. After a while you will be out-of-memory, resulting in the std::bad_alloc.
You need to match each new[] call with exactly one delete[] call.
Don't use new/delete at all. Instead declare all your arrays as std::vector, which will take care of this automatically.
Also don't use variable-length arrays such as
int distance[maxNode], pred[maxNode];
They are a non-standard compiler extension. Make these std::vector as well.

c++ threading in openmp

Hello I'm having a hard time with this program, I'm supposed to go trough whole data vector sequentially and sum up each one of the vectors in there in parallel using openmp(and store the sum in solution[i]). But the program gets stuck for some reason. The input vectors that I'm given aren't many but are very large (like 2.5m ints each). Any idea what am I doing wrong?
Here is the code, ps: igone the unused minVectorSize parameter:
void sumsOfVectors_omp_per_vector(const vector<vector<int8_t>> &data, vector<long> &solution, unsigned long minVectorSize) {
unsigned long vectorNum = data.size();
for (int i = 0; i < vectorNum; i++) {
#pragma omp parallel
{
unsigned long sum = 0;
int thread = omp_get_thread_num();
int threadnum = omp_get_num_threads();
int begin = thread * data[i].size() / threadnum;
int end = ((thread + 1) * data[i].size() / threadnum) - 1;
for (int j = begin; j <= end; j++) {
sum += data[i][j];
}
#pragma omp critical
{
solution[i] += sum;
}
}
}
}
void sumsOfVectors_omp_per_vector(const vector<vector<int8_t>> &data, vector<long> &solution, unsigned long minVectorSize) {
unsigned long vectorNum = data.size();
for (int i = 0; i < vectorNum; i++) {
unsigned long sum = 0;
int begin = 0;
int end = data[i].size();
#omp parallel for reduction(+:sum)
for (int j = begin; j < end; j++) {
sum += data[i][j];
}
solution[i] += sum;
}
}
Something like this should be more elegant and work better, Could you compile and comment if it works for you or doesnt