openmp parallel sections benchmark - c++

I'm trying to benchmark my implementation of merge sort using openmp. I have written the following code.
#include <iostream>
#include <vector>
#include <cstdlib>
#include <ctime>
#include <omp.h>
using namespace std;
class Sorter {
private:
int* data;
int size;
bool isSorted;
public:
Sorter(int* data, int size){
this->data = data;
this->size = size;
this->isSorted = false;
}
void sort(){
vector<int> v(data,data+size);
vector<int> ans = merge_sort(v);
copy(ans.begin(),ans.end(),data);
isSorted = true;
}
vector<int> merge_sort(vector<int>& vec){
if(vec.size() == 1){
return vec;
}
std::vector<int>::iterator middle = vec.begin() + (vec.size() / 2);
vector<int> left(vec.begin(), middle);
vector<int> right(middle, vec.end());
#pragma omp parallel sections
{
#pragma omp section
{left = merge_sort(left);}
#pragma omp section
{right = merge_sort(right);}
}
return merge(vec,left, right);
}
vector<int> merge(vector<int> &vec,const vector<int>& left, const vector<int>& right){
vector<int> result;
unsigned left_it = 0, right_it = 0;
while(left_it < left.size() && right_it < right.size()) {
if(left[left_it] < right[right_it]){
result.push_back(left[left_it]);
left_it++;
}else{
result.push_back(right[right_it]);
right_it++;
}
}
while(left_it < left.size()){
result.push_back(left[left_it]);
left_it++;
}
while(right_it < right.size()){
result.push_back(right[right_it]);
right_it++;
}
return result;
}
int* getSortedData(){
if(!isSorted){
sort();
}
return data;
}
};
void printArray(int* array, int size){
for(int i=0;i<size;i++){
cout<<array[i]<<", ";
}
cout<<endl;
}
bool isSorted(int* array, int size){
for(int i=0;i<size-1;i++){
if(array[i] > array[i+1]) {
cout<<array[i]<<" > "<<array[i+1]<<endl;
return false;
}
}
return true;
}
int main(int argc, char** argv){
if(argc<3){
cout<<"Specify size and threads"<<endl;
return -1;
}
int size = atoi(argv[1]);
int threads = atoi(argv[2]);
//omp_set_nested(1);
omp_set_num_threads(threads);
cout<<"Merge Sort of "<<size<<" with "<<omp_get_max_threads()<<endl;
int *array = new int[size];
srand(time(NULL));
for(int i=0;i<size;i++){
array[i] = rand() % 100;
}
//printArray(array,size);
Sorter* s = new Sorter(array, size);
cout<<"Starting sort"<<endl;
double start = omp_get_wtime();
s->sort();
double stop = omp_get_wtime();
cout<<"Time: "<<stop-start<<endl;
int* array2 = s->getSortedData();
if(size<=10)
printArray(array2,size);
cout<<"Array sorted: "<<(isSorted(array2,size)?"yes":"no")<<endl;
return 0;
}
The program runs correctly, but when i specify the number of threads to be, say 4, the program still creates only 2 threads. I tried using omp_set_nested(1) before omp_set_num_threads(threads) but that hands the whole terminal until the program crashes and says "libgomp: Thread creation failed: Resource temporarily unavailable" I think because too many threads are created? I haven't found a work around it yet.
Edit:
After the program crashes, I check the system load and it shows the load to be over 1000!
I have a 4-core AMD A8 CPU and 10GB RAM
If I uncomment omp_set_nested(1) and run the program
$ ./mergeSort 10000000 4
Merge Sort of 10000000 with 4
Starting sort
libgomp: Thread creation failed: Resource temporarily unavailable
libgomp: Thread creation failed: Resource temporarily unavailable
$ uptime
02:14:12 up 1 day, 11:13, 4 users, load average: 482.21, 522.87, 338.75
Watching the processes, I can spot 4 threads being launched. If I comment out the omp_set_nested(1) the program runs normally but only uses 2 threads
Edit:
If i use tasks and remove omp_set_nested then it launches the threads correctly, but it doesn't speed up. Execution with 1 thread becomes faster than with 4. With sections, it speeds up. but only by a factor less than two (as it launches only 2 threads at a time)

I tested your code and it did create 4 or more threads, didn't get what you meant exactly. Also I suggest you to change omp section to omp task, as by definition in a section only 1 thread handles a given section and in your recursive call you would never utilize your idle threads.

Related

Why my C++14 KosaRaju algo getting TLE when a similar written code runs much faster

TLE code completes at 2.1 secs. I'm also passing many things through reference but it's still throwing a TLE. Why this code takes so much time?
here is the problem at hackerearth:
https://www.hackerearth.com/problem/algorithm/falling-dominos-49b1ed46/
Dominos are lots of fun. Children like to stand the tiles on their side in long lines. When one domino falls, it knocks down the next one, which knocks down the one after that, all the way down the line. However, sometimes a domino fails to knock the next one down. In that case, we have to knock it down by hand to get the dominos falling again. Your task is to determine, given the layout of some domino tiles, the minimum number of dominos that must be knocked down by hand in order for all of the dominos to fall.
Input
The first line of input contains one integer specifying the number of test cases to follow. Each test case begins with a line containing two integers, each no larger than 100 000. The first integer n is the number of domino tiles and the second integer m is the number of lines to follow in the test case. The domino tiles are numbered from 1 to n. Each of the following lines contains two integers x and y indicating that if domino number x falls, it will cause domino number y to fall as well.
Output
For each test case, output a line containing one integer, the minimum number of dominos that must be knocked over by hand in order for all the dominos to fall.
SAMPLE INPUT
1
3 2
1 2
2 3
SAMPLE OUTPUT
1
code completes at 2.1
#include <iostream>
#include <vector>
#include <unordered_set>
#include <stack>
using namespace std;
void dfs(const vector<vector<int>> &edges, unordered_set<int> &visited,int sv, stack<int> &stk){
visited.insert(sv);
for(int i=0;i<edges[sv].size();i++){
int current = edges[sv][i];
if(visited.find(current)==visited.end())
dfs(edges, visited, current, stk);
}
stk.push(sv);
}
void dfs(const vector<vector<int>> &edges, unordered_set<int> &visited,int sv){
visited.insert(sv);
for(int i=0;i<edges[sv].size();i++){
int current = edges[sv][i];
if(visited.find(current)==visited.end())
dfs(edges, visited, current);
}
}
int main()
{
int t;
cin>>t;
while(t--){
int V, E;
cin>>V>>E;
vector<vector<int>> edges(V+1);
unordered_set<int> visited;
stack<int> stk;
while(E--){
int f, s;
cin>>f>>s;
edges[f].push_back(s);
}
for(int i=1;i<=V;i++)
if(visited.find(i)==visited.end())
dfs(edges, visited, i, stk);
visited.clear();
int count{0};
while(!stk.empty()){
int current = stk.top();
stk.pop();
if(visited.find(current)==visited.end()){
dfs(edges, visited, current);
count++;
}
}
cout<<count<<endl;
}
return 0;
}
Efficient Code completes at 0.7 sec.
#include<iostream>
#include<bits/stdc++.h>
using namespace std;
void dfs( vector<int> *edges , int start,int n,bool *visit ,stack<int> *nodex)
{
visit[start] = true;
// cout<<start<<endl;
for (int i = 0; i < edges[start].size(); ++i)
{
int next = edges[start][i];
if(visit[next] == false)
dfs(edges,next,n,visit,nodex);
}
nodex->push(start);
}
void dfs(vector<int> *edges,int start, bool *visit,int n)
{
visit[start] = true;
for(int i=0;i<edges[start].size();i++)
{
int next = edges[start][i];
if(visit[next]==false)
dfs(edges,next,visit,n);
}
}
int main()
{
int t;
cin>>t;
while(t--)
{
int n,m;
cin>>n>>m;
vector<int> *edges = new vector<int>[n+1];
for (int i = 0; i < m; ++i)
{
int start,end;
cin>>start>>end;
edges[start].push_back(end);
}
// cout<<"PHASE 1"<<endl;
bool *visit = new bool[n+1];
for (int i = 0; i<=n; ++i)
{
visit[i] = false;
}
stack<int> *nodex = new stack<int>();
for (int i = 1; i<=n; ++i)
{
if(visit[i] == false)
dfs(edges,i,n,visit,nodex);
}
// cout<<"PHASE 2"<<endl;
for(int i=0;i<=n;i++)
visit[i] = false;
int count=0;
while(!nodex->empty())
{
int up = nodex->top();
nodex->pop();
// cout<<" EVERYTHING ISS FINE "<<up<<endl;
if(visit[up] ==false )
{
dfs(edges,up,visit,n);
count++;
}
// cout<<"Everrything is fine "<<up<<endl;
}
cout<<count<<endl;
}
return 0;
}
Use Fast I/O and "\n"in place of endl. This helps a lot in getting rid of TLE. For me the rest of the code seems to be fine

What is the fastest way to read from a small (on the order of 10 elements) vector of class pointers in parallel?

    I am looking for the fastest way to have multiple threads reading from the same small vector (one which is not static but will only ever be changed by the main thread and only ever when the child threads are not reading from it) of pointers.
    I've tried using a shared std::vector of pointers which is somewhat faster than a shared array of pointers but still slower per thread... I thought that the reason for that is the threads reading so close together in memory causing false sharing, but I am unsure.
    I'm hoping there is either a way around that since the data is read only when the threads are accessing it or there's an entirely different approach that is faster. Below is a minimal example
#include <thread>
#include <iostream>
#include <iomanip>
#include <vector>
#include <atomic>
#include <chrono>
namespace chrono=std::chrono;
class A {
public:
A(int n=1) {
a=n;
}
int a;
};
void tfunc();
int nelements=10;
int nthreads=1;
std::vector<A*> elements;
std::atomic<int> complete;
std::atomic<int> remaining;
std::atomic<int> next;
std::atomic<int> tnow;
int tend=1000000;
int main() {
complete=false;
remaining=0;
next=0;
tnow=0;
for (int i=0; i < nelements; i++) {
A* a=new A();
elements.push_back(a);
}
std::thread threads[nthreads];
for (int i=0; i < nthreads; i++) {
threads[i]=std::thread(tfunc);
}
auto begin=chrono::high_resolution_clock::now();
while (tnow < tend) {
remaining=nthreads;
next=0;
tnow += 1;
while (remaining > 0) {}
// if {elements} is changed it is changed here
}
complete=true;
for (int i=0; i < nthreads; i++) {
threads[i].join();
}
auto complete=chrono::high_resolution_clock::now();
auto elapsed=chrono::duration_cast<chrono::microseconds>(complete-begin).count();
std::cout << std::setw(2) << nthreads << "Time - " << elapsed << std::endl;
}
void tfunc() {
int sum=0;
int tpre=0;
int curr=0;
while (tnow == 0) {}
while (!complete) {
if (tnow-tpre > 0) {
tpre=tnow;
while (remaining > 0) {
curr=next++;
if (curr > nelements) break;
for (int i=0; i < nelements; i++) {
if (i != curr) {
sum += elements[i] -> a;
}
}
remaining--;
}
}
}
}
Which for nthreads between 1 and 10 on my system outputs (the times are in microseconds)
1 Time - 281548
2 Time - 404926
3 Time - 546826
4 Time - 641898
5 Time - 714259
6 Time - 812776
7 Time - 922391
8 Time - 994909
9 Time - 1147579
10 Time - 1199838
I am wondering if there is a faster way to do this or if such a parallel operation will always be slower than serial due to the smallness of the vector.

Suspected memory leak with _beginthread

So, i have some issues with i suspect a memory leak, in order to test i wrote this small code. By commenting the following line:
printf("Calc index: %d\n", ArrLength);
the code runs well. But when i uncomment it, the program crashed after a couple thousand threads.. When i use try/catch, the program just crashed inside the try function. Anyone who can help me out here?
#include "stdafx.h"
#include <process.h>
#include <iostream>
#include <mutex>
#include <windows.h>
using namespace std;
typedef struct {
int StartNode;
int EndNode;
int GangID;
int MemberID;
int ArrLength;
int arr[10000];
}t;
t *arg;
mutex m;
void myFunc(void *param) {
m.lock();
printf("Calculate thread started\n");
t *args = (t*)param;
int StartNode = args->StartNode;
int EndNode = args->EndNode;
int GangID = args->GangID;
int MemberID = args->MemberID;
int ArrLength = args->ArrLength;
printf("Calc index: %d\n", ArrLength);
free(args);
m.unlock();
}
int main()
{
for (int i = 0; i < 1000000; i++)
{
HANDLE handle;
arg = (t *)malloc(sizeof(t));
arg->StartNode = 2;
arg->EndNode = 1;
arg->GangID = 1;
arg->MemberID = 1;
arg->ArrLength = 5;
for (int j = 0; j < 10000; j++)
{
arg->arr[j] = j;
}
handle = (HANDLE)_beginthread(myFunc, 0, (void*)arg);
}
cin.get();
return 0;
}
Well, let do some calc. Your t struct has 40020 bytes per instance. You do allocate it 1M times that causes about 40 Gb allocated in total. And this is not all the memory required because each thread is not for free. By default, Windows allocates 1Mb stack per thread that gives you 1 Tb (one terabyte) of memory required just to let the threads live.
So, total memory amount is something like 1040 Gb. Do you really intend that?

Threads failing to affect performance

Below is a small program meant to parallelize the approximation of the 1/(n^2) series. Note the global parameter NUM_THREADS.
My issue is that increasing the number of threads from 1 to 4 (the number of processors my computer has is 4) does not significantly affect the outcomes of timing experiments. Do you see a logical flaw in the ThreadFunction? Is there false sharing or misplaced blocking that ends up serializing the execution?
#include <iostream>
#include <thread>
#include <vector>
#include <mutex>
#include <string>
#include <future>
#include <chrono>
std::mutex sum_mutex; // This mutex is for the sum vector
std::vector<double> sum_vec; // This is the sum vector
int NUM_THREADS = 1;
int UPPER_BD = 1000000;
/* Thread function */
void ThreadFunction(std::vector<double> &l, int beg, int end, int thread_num)
{
double sum = 0;
for(int i = beg; i < end; i++) sum += (1 / ( l[i] * l[i]) );
std::unique_lock<std::mutex> lock1 (sum_mutex, std::defer_lock);
lock1.lock();
sum_vec.push_back(sum);
lock1.unlock();
}
void ListFill(std::vector<double> &l, int z)
{
for(int i = 0; i < z; ++i) l.push_back(i);
}
int main()
{
std::vector<double> l;
std::vector<std::thread> thread_vec;
ListFill(l, UPPER_BD);
int len = l.size();
int lower_bd = 1;
int increment = (UPPER_BD - lower_bd) / NUM_THREADS;
for (int j = 0; j < NUM_THREADS; ++j)
{
thread_vec.push_back(std::thread(ThreadFunction, std::ref(l), lower_bd, lower_bd + increment, j));
lower_bd += increment;
}
for (auto &t : thread_vec) t.join();
double big_sum;
for (double z : sum_vec) big_sum += z;
std::cout << big_sum << std::endl;
return 0;
}
From looking at your code, I suspect that ListFill is taking longer than ThreadFunction. Why pass a list of values to the thread instead of the bounds each thread should loop over? Something like:
void ThreadFunction( int beg, int end ) {
double sum = 0.0;
for(double i = beg; i < end; i++)
sum += (1.0 / ( i * i) );
std::unique_lock<std::mutex> lock1 (sum_mutex);
sum_vec.push_back(sum);
}
To maximize parallelism, you need to push as much work as possible onto the threads. See Amdahl's Law
In addition to dohashi's nice improvement, you can remove the need for the mutex by populating the sum_vec in advance in the main thread:
sum_vec.resize(4);
then writing directly to it in ThreadFunction:
sum_vec[thread_num] = sum;
since each thread writes to a distinct element and doesn't modify the vector itself there is no need to lock anything.

OpenMP vs C++11 threads

In the following example the C++11 threads take about 50 seconds to execute, but the OMP threads only 5 seconds. Any ideas why? (I can assure you it still holds true if you are doing real work instead of doNothing, or if you do it in a different order, etc.) I'm on a 16 core machine, too.
#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>
using namespace std;
void doNothing() {}
int run(int algorithmToRun)
{
auto startTime = std::chrono::system_clock::now();
for(int j=1; j<100000; ++j)
{
if(algorithmToRun == 1)
{
vector<thread> threads;
for(int i=0; i<16; i++)
{
threads.push_back(thread(doNothing));
}
for(auto& thread : threads) thread.join();
}
else if(algorithmToRun == 2)
{
#pragma omp parallel for num_threads(16)
for(unsigned i=0; i<16; i++)
{
doNothing();
}
}
}
auto endTime = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = endTime - startTime;
return elapsed_seconds.count();
}
int main()
{
int cppt = run(1);
int ompt = run(2);
cout<<cppt<<endl;
cout<<ompt<<endl;
return 0;
}
OpenMP thread-pools for its Pragmas (also here and here). Spinning up and tearing down threads is expensive. OpenMP avoids this overhead, so all it's doing is the actual work and the minimal shared-memory shuttling of the execution state. In your Threads code you are spinning up and tearing down a new set of 16 threads every iteration.
I tried a code of an 100 looping at
Choosing the right threading framework and it took
OpenMP 0.0727, Intel TBB 0.6759 and C++ thread library 0.5962 mili-seconds.
I also applied what AruisDante suggested;
void nested_loop(int max_i, int band)
{
for (int i = 0; i < max_i; i++)
{
doNothing(band);
}
}
...
else if (algorithmToRun == 5)
{
thread bristle(nested_loop, max_i, band);
bristle.join();
}
This code looks like taking less time than your original C++ 11 thread section.