OpenMP parallelise std::next_permutation - c++

I have written the function scores_single, which calculates all the permuatations of a board and scores it.
I tried to follow this SO answer to parallelise the function using OpenMP, and came up with scores_parallel. The problem is that the parallel version iterates over the same permutations in multiple loops.
The code follows:
#include <iostream>
#include <omp.h>
#include <vector>
// Single threaded version
int scores_single(const int tokens, const int board_length) {
std::vector<int> scores;
// Generate boards
std::vector<char> board(board_length);
std::fill(board.begin(), board.end() - tokens, '-');
std::fill(board.end() - tokens, board.end(), 'T');
do {
// printf("Board: %s\n", std::string(board.data(), board.size()).c_str());
// Score board
auto value = 3;
scores.push_back(value);
} while (std::next_permutation(board.begin(), board.end()));
return scores.size();
}
// OpenMP parallel version
int scores_parallel(const int tokens, const int board_length) {
std::vector<std::vector<int>*> score_lists(board_length);
// Generate boards
std::vector<char> board(board_length);
std::fill(board.begin(), board.end() - tokens, '-');
std::fill(board.end() - tokens, board.end(), 'T');
printf("Starting\n");
#pragma omp parallel default(none) shared(board, board_length, score_lists)
{
#pragma omp single nowait
for (int i = 0; i < board_length; ++i) {
#pragma omp task untied
{
auto *scores = new std::vector<int>;
// Make a copy of the board for this thread
auto board_thread = board;
// Subset for this thread, see: https://stackoverflow.com/questions/30865231/parallel-code-for-next-permutation
std::rotate(board_thread.begin(), board_thread.begin() + i, board_thread.begin() + i + 1);
do {
printf("[%02d] board: %s\n", i, std::string(board_thread.data(), board_thread.size()).c_str());
// Score board
auto value = 3;
scores->push_back(value);
} while (std::next_permutation(board_thread.begin() + 1, board_thread.end()));
score_lists[i] = scores;
printf("[%02d] Finished on thread %d with %lu values\n", i, omp_get_thread_num(), scores->size());
}
}
}
std::vector<int> scores;
int k = 0;
for (auto & list : score_lists) {
for (int j : *list) {
scores.push_back(j);
}
}
printf("Finished, size: %lu\n", scores.size());
return scores.size();
}
int main() {
int p = scores_parallel(2, 4);
int s = scores_single(2, 4);
std::cout << p << " != " << s << std::endl;
return 0;
}
Output:
Starting
[01] board: --TT
[03] board: T--T
[03] board: T-T-
[02] board: T--T
[03] board: TT--
[01] board: -T-T
[02] board: T-T-
[00] board: --TT
[03] Finished on thread 10 with 3 values
[00] board: -T-T
[00] board: -TT-
[02] board: TT--
[00] Finished on thread 11 with 3 values
[01] board: -TT-
[02] Finished on thread 12 with 3 values
[01] Finished on thread 4 with 3 values
Finished, size: 12
12 != 6
I think I understand the SO answer I am copying, but I am not sure what I have done wrong.
6 is the expected answer, as 4C2 = 6.

Figured it out, I was calculating the same permutation multiple times. I fixed it with the if statement below.
#include <iostream>
#include <omp.h>
#include <vector>
// OpenMP parallel version
int scores_parallel(const int tokens, const int board_length) {
std::vector<std::vector<int>*> score_lists(board_length);
// Generate boards
std::vector<char> board(board_length);
std::fill(board.begin(), board.end() - tokens, '-');
std::fill(board.end() - tokens, board.end(), 'T');
printf("Starting\n");
#pragma omp parallel default(none) shared(board, board_length, score_lists)
{
#pragma omp single nowait
for (int i = 0; i < board_length; ++i) {
#pragma omp task untied
{
auto *scores = new std::vector<int>;
// No need to process this branch if it will be identical to a prior branch.
if (board[i] != board[i + 1]) {
// Make a copy of the board for this thread
auto board_thread = board;
printf("[%02d] evaluating: %s\n", i, std::string(board_thread.data(), board_thread.size()).c_str());
// Subset for this thread, see: https://stackoverflow.com/questions/30865231/parallel-code-for-next-permutation
std::rotate(board_thread.begin(), board_thread.begin() + i, board_thread.begin() + i + 1);
do {
printf("[%02d] board: %s\n", i, std::string(board_thread.data(), board_thread.size()).c_str());
// Score board
auto value = 3;
scores->push_back(value);
} while (std::next_permutation(board_thread.begin() + 1, board_thread.end()));
}
score_lists[i] = scores;
printf("[%02d] Finished on thread %d with %lu values\n", i, omp_get_thread_num(), scores->size());
}
}
}
std::vector<int> scores;
int k = 0;
for (auto & list : score_lists) {
for (int j : *list) {
scores.push_back(j);
}
}
printf("Finished, size: %lu\n", scores.size());
return scores.size();
}
int main() {
int p = scores_parallel(2, 4);
int s = scores_single(2, 4);
std::cout << p << " == " << s << std::endl;
return p != s;
}

Related

Multiple threads to data to array C++

I'm using for loop to create given number of threads, each one of them makes approximation of part of my integral, I want them to give that data back to array so later I can sum it up (if I think right, I can't just make sum += in each thread because they will collide), everything worked right, to the moment when I want to take that data from each thread, I get error:
calka.cpp:49:33: error: request for member 'get_future' in 'X', which is of non-class type 'std::promise<float>[(N + -1)]'
code:
#include <iostream> //cout
#include <thread> //thread
#include <future> //future , promise
#include <stdlib.h> //atof
#include <string> //string
#include <sstream> //stringstream
using namespace std;
// funkcja 4x^3 + (x^2)/3 - x + 3
// całka x^4 + (x^3)/9 - (x^2)/2 + 3x
void thd(float begin, float width, promise<float> & giveback)
{
float x = begin + 1/2 * width;
float height = x*x*x*x + (x*x*x)/9 - (x*x)/2 + 3*x ;
float outcome = height * width;
giveback.set_value(outcome);
stringstream ss;
ss << this_thread::get_id();
string output = "thread #id: " + ss.str() + " outcome" + to_string(outcome);
cout << output << endl;
}
int main(int argc, char* argv[])
{
int sum = 0;
float begin = atof(argv[1]);
float size = atof(argv[2]);
int N = atoi(argv[3]);
float end = begin + N*size;
promise<float> X[N-1];
thread t[N];
for(int i=0; i<N; i++){
t[i] = thread(&thd, begin, size, ref(X[i]));
begin += size;
}
future<float> wynik_ftr = X.get_future();
float wyniki[N-1];
for(int i=0; i<N; i++){
t[i].join();
wyniki[i] = wynik_ftr.get();
}
//place for loop adding outcome from threads to sum
cout << N;
return 0;
}
Don't use VLA - promise<float> X[N-1]. It is an extension of some compilers, so your code is not portable. Use std::vector instead.
It seems you want to split calculation of integral to N threads. You create N-1 background threads and one invocation of thd is executed from main thread. In main you join all results, so
you don't need to create wyniki as array to store a result per thread,
because you are gathering these results in serially manner - inside for loop in main function.
Therefore one float wyniki variable is sufficient.
Steps you have to do are:
prepare N promises
starts N-1 threads
call thd from main
join and add results from N-1 threads in for loop
join and add main thread result
Code:
std::vector<promise<float>> partialResults(N);
std::vector<thread> t(N-1);
for (int i = 0; i<N-1; i++) {
t[i] = thread(&thd, begin, size, ref(partialResults[i]));
begin += size;
}
thd(begin,size,ref(partialResults[N-1]));
float wyniki = 0.0f;
for (int i = 0; i<N-1; i++) {
t[i].join();
std::future<float> res = partialResults[i].get_future();
wyniki += res.get();
}
std::future<float> res = partialResults[N-1].get_future(); // get res from main
wyniki += res.get();
cout << wyniki << endl;

What is the fastest way to read from a small (on the order of 10 elements) vector of class pointers in parallel?

    I am looking for the fastest way to have multiple threads reading from the same small vector (one which is not static but will only ever be changed by the main thread and only ever when the child threads are not reading from it) of pointers.
    I've tried using a shared std::vector of pointers which is somewhat faster than a shared array of pointers but still slower per thread... I thought that the reason for that is the threads reading so close together in memory causing false sharing, but I am unsure.
    I'm hoping there is either a way around that since the data is read only when the threads are accessing it or there's an entirely different approach that is faster. Below is a minimal example
#include <thread>
#include <iostream>
#include <iomanip>
#include <vector>
#include <atomic>
#include <chrono>
namespace chrono=std::chrono;
class A {
public:
A(int n=1) {
a=n;
}
int a;
};
void tfunc();
int nelements=10;
int nthreads=1;
std::vector<A*> elements;
std::atomic<int> complete;
std::atomic<int> remaining;
std::atomic<int> next;
std::atomic<int> tnow;
int tend=1000000;
int main() {
complete=false;
remaining=0;
next=0;
tnow=0;
for (int i=0; i < nelements; i++) {
A* a=new A();
elements.push_back(a);
}
std::thread threads[nthreads];
for (int i=0; i < nthreads; i++) {
threads[i]=std::thread(tfunc);
}
auto begin=chrono::high_resolution_clock::now();
while (tnow < tend) {
remaining=nthreads;
next=0;
tnow += 1;
while (remaining > 0) {}
// if {elements} is changed it is changed here
}
complete=true;
for (int i=0; i < nthreads; i++) {
threads[i].join();
}
auto complete=chrono::high_resolution_clock::now();
auto elapsed=chrono::duration_cast<chrono::microseconds>(complete-begin).count();
std::cout << std::setw(2) << nthreads << "Time - " << elapsed << std::endl;
}
void tfunc() {
int sum=0;
int tpre=0;
int curr=0;
while (tnow == 0) {}
while (!complete) {
if (tnow-tpre > 0) {
tpre=tnow;
while (remaining > 0) {
curr=next++;
if (curr > nelements) break;
for (int i=0; i < nelements; i++) {
if (i != curr) {
sum += elements[i] -> a;
}
}
remaining--;
}
}
}
}
Which for nthreads between 1 and 10 on my system outputs (the times are in microseconds)
1 Time - 281548
2 Time - 404926
3 Time - 546826
4 Time - 641898
5 Time - 714259
6 Time - 812776
7 Time - 922391
8 Time - 994909
9 Time - 1147579
10 Time - 1199838
I am wondering if there is a faster way to do this or if such a parallel operation will always be slower than serial due to the smallness of the vector.

Using openMP to get the index of minimum element parallelly

I tried to write this code
float* theArray; // the array to find the minimum value
int index, i;
float thisValue, min;
index = 0;
min = theArray[0];
#pragma omp parallel for reduction(min:min_dist)
for (i=1; i<size; i++) {
thisValue = theArray[i];
if (thisValue < min)
{ /* find the min and its array index */
min = thisValue;
index = i;
}
}
return(index);
However this one is not outputting correct answers. Seems the min is OK but the correct index has been destroyed by threads.
I also tried some ways provided on the Internet and here (using parallel for for outer loop and use critical for final comparison) but this cause a speed drop rather than speedup.
What should I do to make both the min value and its index correct? Thanks!
I don't know of an elegant want to do a minimum reduction and save an index. I do this by finding the local minimum and index for each thread and then the global minimum and index in a critical section.
index = 0;
min = theArray[0];
#pragma omp parallel
{
int index_local = index;
float min_local = min;
#pragma omp for nowait
for (i = 1; i < size; i++) {
if (theArray[i] < min_local) {
min_local = theArray[i];
index_local = i;
}
}
#pragma omp critical
{
if (min_local < min) {
min = min_local;
index = index_local;
}
}
}
With OpenMP 4.0 it's possible to use user-defined reductions. A user-defined minimum reduction can be defined like this
struct Compare { float val; sizt_t index; };
#pragma omp declare reduction(minimum : struct Compare : omp_out = omp_in.val < omp_out.val ? omp_in : omp_out)
Then the reduction can be done like this
struct Compare min;
min.val = theArray[0];
min.index = 0;
#pragma omp parallel for reduction(minimum:min)
for(int i = 1; i<size; i++) {
if(theArray[i]<min.val) {
min.val = a[i];
min.index = i;
}
}
That works for C and C++. User defined reductions have other advantages besides simplified code. There are multiple algorithms for doing reductions. For example the merging can be done in O(number of threads) or O(Log(number of threads). The first solution I gave does this in O(number of threads) however using user-defined reductions let's OpenMP choose the algorithm.
Basic Idea
This can be accomplished without any parellelization-breaking critical or atomic sections by creating a custom reduction. Basically, define an object that stores both the index and value, and then create a function that sorts two of these objects by only the value, not the index.
Details
An object to store an index and value together:
typedef std::pair<unsigned int, float> IndexValuePair;
You can access the index by accessing the first property and the value by accessing the second property, i.e.,
IndexValuePair obj(0, 2.345);
unsigned int ix = obj.first; // 0
float val = obj.second; // 2.345
Define a function to sort two IndexValuePair objects:
IndexValuePair myMin(IndexValuePair a, IndexValuePair b){
return a.second < b.second ? a : b;
}
Then, construct a custom reduction following the guidelines in the OpenMP documentation:
#pragma omp declare reduction \
(minPair:IndexValuePair:omp_out=myMin(omp_out, omp_in)) \
initializer(omp_priv = IndexValuePair(0, 1000))
In this case, I've chosen to initialize the index to 0 and the value to 1000. The value should be initialized to some number larger than the largest value you expect to sort.
Functional Example
Finally, combine all these pieces with the parallel for loop!
// Compile with g++ -std=c++11 -fopenmp demo.cpp
#include <iostream>
#include <utility>
#include <vector>
typedef std::pair<unsigned int, float> IndexValuePair;
IndexValuePair myMin(IndexValuePair a, IndexValuePair b){
return a.second < b.second ? a : b;
}
int main(){
std::vector<float> vals {10, 4, 6, 2, 8, 0, -1, 2, 3, 4, 4, 8};
unsigned int i;
IndexValuePair minValueIndex(0, 1000);
#pragma omp declare reduction \
(minPair:IndexValuePair:omp_out=myMin(omp_out, omp_in)) \
initializer(omp_priv = IndexValuePair(0, 1000))
#pragma omp parallel for reduction(minPair:minValueIndex)
for(i = 0; i < vals.size(); i++){
if(vals[i] < minValueIndex.second){
minValueIndex.first = i;
minValueIndex.second = vals[i];
}
}
std::cout << "minimum value = " << minValueIndex.second << std::endl; // Should be -1
std::cout << "index = " << minValueIndex.first << std::endl; // Should be 6
return EXIT_SUCCESS;
}
Because you're not only trying to find the minimal value (reduction(min:___)) but also retain the index, you need to make the check critical. This can significantly slow down the loop (as reported). In general, make sure that there is enough work so you don't encounter overhead as in this question. An alternative would be to have each thread find the minimum and it's index and save them to a unique variable and have the master thread do a final check on those as in the following program.
#include <iostream>
#include <vector>
#include <ctime>
#include <random>
#include <omp.h>
using std::cout;
using std::vector;
void initializeVector(vector<double>& v)
{
std::mt19937 generator(time(NULL));
std::uniform_real_distribution<double> dis(0.0, 1.0);
v.resize(100000000);
for(int i = 0; i < v.size(); i++)
{
v[i] = dis(generator);
}
}
int main()
{
vector<double> vec;
initializeVector(vec);
float minVal = vec[0];
int minInd = 0;
int startTime = clock();
for(int i = 1; i < vec.size(); i++)
{
if(vec[i] < minVal)
{
minVal = vec[i];
minInd = i;
}
}
int elapsedTime1 = clock() - startTime;
// Change the number of threads accordingly
vector<float> threadRes(4, std::numeric_limits<float>::max());
vector<int> threadInd(4);
startTime = clock();
#pragma omp parallel for
for(int i = 0; i < vec.size(); i++)
{
{
if(vec[i] < threadRes[omp_get_thread_num()])
{
threadRes[omp_get_thread_num()] = vec[i];
threadInd[omp_get_thread_num()] = i;
}
}
}
float minVal2 = threadRes[0];
int minInd2 = threadInd[0];
for(int i = 1; i < threadRes.size(); i++)
{
if(threadRes[i] < minVal2)
{
minVal2 = threadRes[i];
minInd2 = threadInd[i];
}
}
int elapsedTime2 = clock() - startTime;
cout << "Min " << minVal << " at " << minInd << " took " << elapsedTime1 << std::endl;
cout << "Min " << minVal2 << " at " << minInd2 << " took " << elapsedTime2 << std::endl;
}
Please note that with optimizations on and nothing else to be done in the loop, the serial version seems to remain king. With optimizations turned off, OMP gains the upper hand.
P.S. you wrote reduction(min:min_dist) and the proceeded to use min instead of min_dist.
Actually, we can use omp critical directive to make only one thread run the code inside the critical region at a time.So only one thread can run it and the indexvalue wont be destroyed by other threads.
About omp critical directive:
The omp critical directive identifies a section of code that must be executed by a single thread at a time.
This code solves your issue:
#include <stdio.h>
#include <omp.h>
int main() {
int i;
int arr[10] = {11,42,53,64,55,46,47, 68, 59, 510};
float* theArray; // the array to find the minimum value
int index;
float thisValue, min;
index = 0;
min = arr[0];
int size=10;
#pragma omp parallel for
for (i=1; i<size; i++) {
thisValue = arr[i];
#pragma omp critical
if (thisValue < min)
{ /* find the min and its array index */
min = thisValue;
index = i;
}
}
printf("min:%d index:%d",min,index);
return 0;
}

openmp parallel sections benchmark

I'm trying to benchmark my implementation of merge sort using openmp. I have written the following code.
#include <iostream>
#include <vector>
#include <cstdlib>
#include <ctime>
#include <omp.h>
using namespace std;
class Sorter {
private:
int* data;
int size;
bool isSorted;
public:
Sorter(int* data, int size){
this->data = data;
this->size = size;
this->isSorted = false;
}
void sort(){
vector<int> v(data,data+size);
vector<int> ans = merge_sort(v);
copy(ans.begin(),ans.end(),data);
isSorted = true;
}
vector<int> merge_sort(vector<int>& vec){
if(vec.size() == 1){
return vec;
}
std::vector<int>::iterator middle = vec.begin() + (vec.size() / 2);
vector<int> left(vec.begin(), middle);
vector<int> right(middle, vec.end());
#pragma omp parallel sections
{
#pragma omp section
{left = merge_sort(left);}
#pragma omp section
{right = merge_sort(right);}
}
return merge(vec,left, right);
}
vector<int> merge(vector<int> &vec,const vector<int>& left, const vector<int>& right){
vector<int> result;
unsigned left_it = 0, right_it = 0;
while(left_it < left.size() && right_it < right.size()) {
if(left[left_it] < right[right_it]){
result.push_back(left[left_it]);
left_it++;
}else{
result.push_back(right[right_it]);
right_it++;
}
}
while(left_it < left.size()){
result.push_back(left[left_it]);
left_it++;
}
while(right_it < right.size()){
result.push_back(right[right_it]);
right_it++;
}
return result;
}
int* getSortedData(){
if(!isSorted){
sort();
}
return data;
}
};
void printArray(int* array, int size){
for(int i=0;i<size;i++){
cout<<array[i]<<", ";
}
cout<<endl;
}
bool isSorted(int* array, int size){
for(int i=0;i<size-1;i++){
if(array[i] > array[i+1]) {
cout<<array[i]<<" > "<<array[i+1]<<endl;
return false;
}
}
return true;
}
int main(int argc, char** argv){
if(argc<3){
cout<<"Specify size and threads"<<endl;
return -1;
}
int size = atoi(argv[1]);
int threads = atoi(argv[2]);
//omp_set_nested(1);
omp_set_num_threads(threads);
cout<<"Merge Sort of "<<size<<" with "<<omp_get_max_threads()<<endl;
int *array = new int[size];
srand(time(NULL));
for(int i=0;i<size;i++){
array[i] = rand() % 100;
}
//printArray(array,size);
Sorter* s = new Sorter(array, size);
cout<<"Starting sort"<<endl;
double start = omp_get_wtime();
s->sort();
double stop = omp_get_wtime();
cout<<"Time: "<<stop-start<<endl;
int* array2 = s->getSortedData();
if(size<=10)
printArray(array2,size);
cout<<"Array sorted: "<<(isSorted(array2,size)?"yes":"no")<<endl;
return 0;
}
The program runs correctly, but when i specify the number of threads to be, say 4, the program still creates only 2 threads. I tried using omp_set_nested(1) before omp_set_num_threads(threads) but that hands the whole terminal until the program crashes and says "libgomp: Thread creation failed: Resource temporarily unavailable" I think because too many threads are created? I haven't found a work around it yet.
Edit:
After the program crashes, I check the system load and it shows the load to be over 1000!
I have a 4-core AMD A8 CPU and 10GB RAM
If I uncomment omp_set_nested(1) and run the program
$ ./mergeSort 10000000 4
Merge Sort of 10000000 with 4
Starting sort
libgomp: Thread creation failed: Resource temporarily unavailable
libgomp: Thread creation failed: Resource temporarily unavailable
$ uptime
02:14:12 up 1 day, 11:13, 4 users, load average: 482.21, 522.87, 338.75
Watching the processes, I can spot 4 threads being launched. If I comment out the omp_set_nested(1) the program runs normally but only uses 2 threads
Edit:
If i use tasks and remove omp_set_nested then it launches the threads correctly, but it doesn't speed up. Execution with 1 thread becomes faster than with 4. With sections, it speeds up. but only by a factor less than two (as it launches only 2 threads at a time)
I tested your code and it did create 4 or more threads, didn't get what you meant exactly. Also I suggest you to change omp section to omp task, as by definition in a section only 1 thread handles a given section and in your recursive call you would never utilize your idle threads.

Reversing a string using threads

Recently, I was asked in a interview to implement a string reverse function using threads. I came up with most part of the solution below. Got selected or not is a different story :-). I tried to run the below solution on my home PC running Windows 8 consumer preview. The compiler is VC11 Beta.
The question is, the multi-threaded code is always either as fast or 1 millisecond slower than the sequential code. The input I gave is a text file of size 32.4 MB. Is there a way to make the multi-threaded code faster ? Or is it that the input given is too less to make any difference ?
EDIT
I only wrote void Reverse(char* str, int beg, int end, int rbegin, int rend); and
void CustomReverse(char* str); methods in the interview. All the other code is written at home.
template<typename Function>
void TimeIt(Function&& fun, const char* caption)
{
clock_t start = clock();
fun();
clock_t ticks = clock()-start;
std::cout << std::setw(30) << caption << ": " << (double)ticks/CLOCKS_PER_SEC << "\n";
}
void Reverse(char* str)
{
assert(str != NULL);
for ( int i = 0, j = strlen(str) - 1; i < j; ++i, --j)
{
if ( str[i] != str[j])
{
std::swap(str[i], str[j]);
}
}
}
void Reverse(char* str, int beg, int end, int rbegin, int rend)
{
for ( ; beg <= end && rbegin >= rend; ++beg, --rbegin)
{
if ( str[beg] != str[rbegin])
{
char temp = str[beg];
str[beg] = str[rbegin];
str[rbegin] = temp;
}
}
}
void CustomReverse(char* str)
{
int len = strlen(str);
const int MAX_THREADS = std::thread::hardware_concurrency();
std::vector<std::thread> threads;
threads.reserve(MAX_THREADS);
const int CHUNK = len / MAX_THREADS > (4096) ? (4096) : len / MAX_THREADS;
/*std::cout << "len:" << len << "\n";
std::cout << "MAX_THREADS:" << MAX_THREADS << "\n";
std::cout << "CHUNK:" << CHUNK << "\n";*/
for ( int i = 0, j = len - 1; i < j; )
{
if (i + CHUNK < j && j - CHUNK > i )
{
for ( int k = 0; k < MAX_THREADS && (i + CHUNK < j && j - CHUNK > i ); ++k)
{
threads.push_back( std::thread([=, &str]() { Reverse(str, i,
i + CHUNK, j, j - CHUNK); }));
i += CHUNK + 1;
j -= CHUNK + 1;
}
for ( auto& th : threads)
{
th.join();
}
threads.clear();
}
else
{
char temp = str[i];
str[i] = str[j];
str[j] = str[i];
i++;
j--;
}
}
}
void Write(std::ostream&& os, const std::string& str)
{
os << str << "\n";
}
void CustomReverseDemo(int argc, char** argv)
{
std::ifstream inpfile;
for ( int i = 0; i < argc; ++i)
std::cout << argv[i] << "\n";
inpfile.open(argv[1], std::ios::in);
std::ostringstream oss;
std::string line;
if (! inpfile.is_open())
{
return;
}
while (std::getline(inpfile, line))
{
oss << line;
}
std::string seq(oss.str());
std::string par(oss.str());
std::cout << "Reversing now\n";
TimeIt( [&] { CustomReverse(&par[0]); }, "Using parallel code\n");
TimeIt( [&] { Reverse(&seq[0]) ;}, "Using Sequential Code\n");
TimeIt( [&] { Reverse(&seq[0]) ;}, "Using Sequential Code\n");
TimeIt( [&] { CustomReverse(&par[0]); }, "Using parallel code\n");
Write(std::ofstream("sequential.txt"), seq);
Write(std::ofstream("Parallel.txt"), par);
}
int main(int argc, char* argv[])
{
CustomReverseDemo(argc, argv);
}
I found the code to be hard to comprehend but I have found the following problems:
Your block size of 4096 is far too small to be worth a thread. Starting a thread might be about as costly as the actual operation.
You are fork-joining a lot (for every CHUNK * MAX_THREADS chars). This is introducting a lot of unneeded join points (sequential parts) and overhead.
Partition the string statically into MAX_THREADS chunks and start MAX_THREADS threads. There are more efficient ways to do it but at least this will give you some speedup.
I tried to write the program with same functionality:
My effort of "Reversing a string using threads"
I have tested that with 2 core processor with VC11 Beta and mingw(gcc 4.8) on Windows 7
Testing results:
VC11 Beta:
7 Mb file:
Debug
Simple reverse: 0.468
Async reverse : 0.275
Release
Simple reverse: 0.006
Async reverse : 0.014
98 Mb file:
Debug
Simple reverse: 5.982
Async reverse : 3.091
Release
Simple reverse: 0.063
Async reverse : 0.079
782 Mb file
Release
Simple reverse: 0.567
Async reverse : 0.689
Mingw:
782 Mb file
Release
Simple reverse: 0.583
Async reverse : 0.566
As you can see multi-threaded code wins only in debug build. But in release compiler makes optimization and uses all cores even in case of single-threaded code.
So trust your compiler =)
While you are using all the new threading features, you aren't using all the old good parts of the standard library, like std::string and iterators
You shouldn't write the threading stuff yourself but instead use a parallel algorithms library which offers something like a parallel_for construct.
Your task can be simplified to this:
std::string str;
// fill string
auto worker = [&] (iter begin, iter end) {
for(auto it = begin; it != end; ++it) {
std::swap(*it, *(std::end(str) - std::distance(std::begin(str), it) - 1));
}
};
parallel_for(std::begin(str),
std::begin(str) + std::distance(std::begin(str), std::end(str)) / 2, worker);
Note that you need quite a big text file to gain a speed up of this parallel approach. 34 MB might not be enough.
On small strings, effects like false sharing can have a negative impact on your performance.
Limiting the chunksize to 4096 does not make any sense.
Init once and then synchronize at the end should always be the pattern for parallel operations (think map/reduce)
Smaller things:
Checking if the chars are identical is every bad for any kind of pipeline optimization. Just do the swap().
In the parallel and sequential version you use different code for the swap. why?
Starting with 300 MB string size I'm seeing that multi-threaded version (TBB-based, see below) performs on average 3 times better than the serial version. Have to admit that for this 3x speedup it uses 12 real-hw cores :). I experimented a little with grain sizes (you can specify those in TBB for the blocked_range class object), but this did not make any significant impact, default auto_partitioner seems to be able to partition the data almost optimally. The code I used:
tbb::parallel_for(tbb::blocked_range<size_t>(0, (int)str.length()/2), [&] (const tbb::blocked_range<size_t>& r) {
const size_t r_end = r.end();
for(size_t i = r.begin(); i < r_end; ++i) {
std::swap(*(std::begin(str) + i), *(std::end(str) - 1 - i));
}
});
Tested code
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>
#include <string.h>
#include <stdio.h>
#include <memory.h>
#include <stdlib.h>
void strrev(char *p, char *q, int num)
{
for(int i=0;i < num ; ++i,--q, ++p)
*q = *p;
}
int main(int argc, char **argv)
{
char *str;
if(argc>1)
{
str = argv[1];
printf("String to be reversed %s\n", str);
}
else
{
return 0;
}
int length = strlen(str);
int N = 5;
char *rev_str = (char *)malloc(length+1);
rev_str[length] = '\0';
if (N>length)
{
N = length;
}
std::vector<std::thread> threads;
int begin=0, end=length-1, k = length/N;
for(int i=1; i <= N; ++i)
{
threads.emplace_back(strrev, &str[begin], &rev_str[end], k);
//strrev(&str[begin], &rev_str[end], k);
begin += k;
end -= k;
}
while (true)
{
if (end < 0 && begin > length-1)
{
break;
}
rev_str[end] = str[begin];
--end; ++begin;
}
for (auto& i: threads)
{
i.join();
}
printf("String after reversal %s\n", rev_str);
return 0;
}