Negligible Perfomance Boost from p_thread c++ - c++

I've been using Mac OS gcc 4.2.1 and Eclipse to write a program that sorts numbers using a simple merge sort. I've tested the sort extensively and I know it works, and I thought, maybe somewhat naively, that because of the way the algorithm divides up the list, I could simply have a thread sort half and the main thread sort half, and then it would take half the time, but unfortunately, it doesn't seem to be working.
Here's the main code:
float x = clock(); //timing
int half = (int)size/2; // size is the length of the list
status = pthread_create(thready,NULL,voidSort,(void *)datay); //start the thread sorting
sortStep(testArray,tempList,half,0,half); //sort using the main thread
int join = pthread_join(*thready,&someptr); //wait for the thread to finish
mergeStep(testArray,tempList,0,half,half-1); //merge the two sublists
if (status != 0) { std::cout << "Could not create thread.\nError: " << status << "\n"; }
if (join != 0) { std::cout << "Could not create thread.\nError: " << status << "\n"; }
float y = clock() - x; //timing
sortStep is the main sorting function, mergeStep is used to merge two sublists within one array (it uses a placeholder array to switch the numbers around), and voidSort is a function I use to pass a struct containing all the arguments for sortStep to the thread. I feel like maybe the main thread is waiting until the new thread is done, but I'm not sure how to overcome that. I'm extremely, unimaginably grateful for any and all help, thank you in advanced!
EDIT:
Here's the merge step
void mergeStep (int *array,int *tempList,int start, int lengthOne, int lengthTwo) //the merge step of a merge sort
{
int i = start;
int j = i+lengthOne;
int k = 0; // index for the entire templist
while (k < lengthOne+lengthTwo) // a C++ while loop
{
if (i - start == lengthOne)
{ //list one exhausted
for (int n = 0; n+j < lengthTwo+lengthOne+start;n++ ) //add the rest
{
tempList[k++] = array[j+n];
}
break;
}
if (j-(lengthOne+lengthTwo)-start == 0)
{//list two exhausted
for (int n = i; n < start+lengthOne;n++ ) //add the rest
{
tempList[k++] = array[n];
}
break;
}
if (array[i] > array[j]) // figure out which variable should go first
{
tempList[k] = array[j++];
}
else
{
tempList[k] = array[i++];
}
k++;
}
for (int s = 0; s < lengthOne+lengthTwo;s++) // add the templist into the original
{
array[start+s] = tempList[s];
}
}
-Will

The overhead of creating threads is quite large, so unless you have a large amount (to be determined) of data to sort your better off sorting it in the main thread.
The mergeStep also counts against the part of the code that can't be palletized, remember Amdahl's law.
If you don't have a coarsening step as the last part of you sortStep when you get below 8-16 elements much of your performance will go up in function calls. The coarsening step will have to be done by a simpler sort, insertion sort or sorting network.
Unless you have a large enough sorting the actual timing could drown in measuring uncertainty.

Related

Implementation of a lock free vector

After several searches, I cannot find a lock-free vector implementation.
There is a document that speaks about it but nothing concrete (in any case I have not found it). http://pirkelbauer.com/papers/opodis06.pdf
There are currently 2 threads dealing with arrays, there may be more in a while.
One thread that updates different vectors and another thread that accesses the vector to do calculations, etc. Each thread accesses the different array a large number of times per second.
I implemented a lock with mutex on the different vectors but when the reading or writing thread takes too long to unlock, all further updates are delayed.
I then thought of copying the array all the time to go faster, but copying thousands of times per second an array of thousands of elements doesn't seem great to me.
So I thought to use 1 mutex per value in each table to lock only the value I am working on.
A lock-free could be better but I can not find a solution and I wonder if the performances would be really better.
EDIT:
I have a thread that receives data and ranges in vectors.
When I instantiate the structure, I use a fixed size.
I have to do 2 different things for the updates:
-Update vector elements. (1d vector which simulates a 2d vector)
-Add a line at the end of the vector and remove the first line. The array always remains sorted. Adding elements is much much rarer than updating
The thread that is read-only walks the array and performs calculations.
To limit the time spent on the array and do as little calculation as possible, I use arrays that store the result of my calculations. Despite this, I often have to scan the table enough to do new calculations or just update them. (the application is in real-time so the calculations to be made vary according to the requests)
When a new element is added to the vector, the reading thread must directly use it to update the calculations.
When I say calculation, it is not necessarily only arithmetic, it is more a treatment to be done.
There is no perfect implementation to run concurrency, each task has it's own good enogh. My goto method to find a decent implementation is to only alow what is needed and then check if i would need somthing more in the future.
You described a quite simple scenario, one thread one accion to a shared vector, then the vector needs to tell if the acction is alowed soo std::atomic_flag is good enogh.
This example shuld give you an idea on how it works and what to expent. Mainly i just attached a flag to eatch array and checkt it before to see if is safe to do somthing and some people like to add a guard to the flag, just in case.
#include <iostream>
#include <thread>
#include <atomic>
#include <chrono>
const int vector_size = 1024;
struct Element {
void some_yield(){
std::this_thread::yield();
};
void some_wait(){
std::this_thread::sleep_for(
std::chrono::microseconds(1)
);
};
};
Element ** data;
std::atomic_flag * vector_safe;
bool alive = true;
uint32_t c_down_time = 0;
uint32_t p_down_time = 0;
uint32_t c_intinerations = 0;
uint32_t p_intinerations = 0;
std::chrono::high_resolution_clock::time_point c_time_point;
std::chrono::high_resolution_clock::time_point p_time_point;
int simple_consumer_work(){
Element a_read;
uint16_t i, e;
while (alive){
// Loops thru the vectors
for (i=0; i < vector_size; i++){
// locks the thread untin the vector
// at index i is free to read
while (!vector_safe[i].test_and_set()){}
// Do the watherver
for (e=0; e < vector_size; e++){
a_read = data[i][e];
}
// And signal that this vector is done
vector_safe[i].clear();
}
}
return 0;
};
int simple_producer_work(){
uint16_t i;
while (alive){
for (i=0; i < vector_size; i++){
while (!vector_safe[i].test_and_set()){}
data[i][i].some_wait();
vector_safe[i].clear();
}
p_intinerations++;
}
return 0;
};
int consumer_work(){
Element a_read;
uint16_t i, e;
bool waiting;
while (alive){
for (i=0; i < vector_size; i++){
waiting = false;
c_time_point = std::chrono::high_resolution_clock::now();
while (!vector_safe[i].test_and_set(std::memory_order_acquire)){
waiting = true;
}
if (waiting){
c_down_time += (uint32_t)std::chrono::duration_cast<std::chrono::nanoseconds>
(std::chrono::high_resolution_clock::now() - c_time_point).count();
}
for (e=0; e < vector_size; e++){
a_read = data[i][e];
}
vector_safe[i].clear(std::memory_order_release);
}
c_intinerations++;
}
return 0;
};
int producer_work(){
bool waiting;
uint16_t i;
while (alive){
for (i=0; i < vector_size; i++){
waiting = false;
p_time_point = std::chrono::high_resolution_clock::now();
while (!vector_safe[i].test_and_set(std::memory_order_acquire)){
waiting = true;
}
if (waiting){
p_down_time += (uint32_t)std::chrono::duration_cast<std::chrono::nanoseconds>
(std::chrono::high_resolution_clock::now() - p_time_point).count();
}
data[i][i].some_wait();
vector_safe[i].clear(std::memory_order_release);
}
p_intinerations++;
}
return 0;
};
void print_time(uint32_t down_time){
if ( down_time <= 1000) {
std::cout << down_time << " [nanosecods] \n";
} else if (down_time <= 1000000) {
std::cout << down_time / 1000 << " [microseconds] \n";
} else if (down_time <= 1000000000) {
std::cout << down_time / 1000000 << " [miliseconds] \n";
} else {
std::cout << down_time / 1000000000 << " [seconds] \n";
}
};
int main(){
std::uint16_t i;
std::thread consumer;
std::thread producer;
vector_safe = new std::atomic_flag [vector_size] {ATOMIC_FLAG_INIT};
data = new Element * [vector_size];
for(i=0; i < vector_size; i++){
data[i] = new Element;
}
consumer = std::thread(consumer_work);
producer = std::thread(producer_work);
std::this_thread::sleep_for(
std::chrono::seconds(10)
);
alive = false;
producer.join();
consumer.join();
std::cout << " Consumer loops > " << c_intinerations << std::endl;
std::cout << " Consumer time lost > "; print_time(c_down_time);
std::cout << " Producer loops > " << p_intinerations << std::endl;
std::cout << " Producer time lost > "; print_time(p_down_time);
for(i=0; i < vector_size; i++){
delete data[i];
}
delete [] vector_safe;
delete [] data;
return 0;
}
And dont forget that the compiler can and will change portions of the code, spagueti code is realy realy buggy in multithreading.

Multi source BFS multithreading

I have a graph represented with an adjacency matrix arr. And a vector source for the multiple starting vertices.
My idea is to split the source vector into "equal" pieces depending on the number of threads(if it doesn't split equally I add the remaining on the last piece). And create threads that run this function. bool used[] is a global array
I am trying to get (I think its called) "liner" scaling. I assume the number of starting vertices is at least equal to the number of threads.
If I use a mutex to synchronise the threads it is very inefficient.
And If I don't some vertices get traversed more then once.
Question is there a data structure that would let me remove the mutex?
or another way to implement this algorithm?
mutex m;
void msBFS(bool** arr, int n, vector<int> s, atomic<bool>* used) //s is a different
// piece of the original source
{
queue<int> que;
for(auto i = 0; i < s.size(); ++i)
{
que.push(s[i]);
used[s[i]] = true;
}
while (!que.empty())
{
int curr = que.front();
que.pop();
cout << curr << " ";
for (auto i = 0; i < n; ++i)
{
lock_guard<mutex> guard(m);
if (arr[curr][i] == 1 && !used[i] && curr != i)
{
que.push(i);
used[i] = true;
}
}
}
}```
With the atomic<bool> I think you are almost there. The only piece missing is the atomic exchange operation. It allows you to read-modify-write as an atomic operation. For bool atomic type there is usually a piece of hardware that supports it.
void msBFS(bool** arr, int n, vector<int> s, atomic<bool>* used) //s is a different
// piece of the original source
{
//used[i] initialized to 'false' for all i
queue<int> que;
for(auto i = 0; i < s.size(); ++i)
{
que.push(s[i]);
//we don't change used just yet!
}
while (!que.empty())
{
int curr = que.front();
que.pop();
bool alreadyUsed = used[curr].exchange(true);
if(alreadyUsed) {
continue; //some other thread already processing it
}
cout << curr << " ";
for (auto i = 0; i < n; ++i) {
if (arr[curr][i] == 1 && !used[i] && curr != i) {
que.push(i);
}
}
}
}
Note, there is one logical change: the used[i] is set to true when a thread starts processing the node, and not when it is added to the queue.
At the first attempt to process a node, when used[i] is set to true, alreadyUsed will hold the previous value (false) indicating that noone else started processing the node earlier. At subsequent attempts to process the node, alreadyUsed will be already set to true and the processing will skip.
The above approach is not ideal: it is possible for a node to be added many times to a queue before it is processed. Depending on the shape of your graph it may or may not be a problem.
If this is a problem - I would suggest using a three-value used state: not_visited, queued and processed.
static constexpr int not_visited = 0;
static constexpr int queued = 1;
static constexpr int processed = 2;
Now, every time we try to push onto the que, and every time we try to process the node, we advance the state accordingly. These advancements need to be performed atomically, via compare_exchange_strong so that each change may happen exactly once. The call to compare_exchange_strong returns true if it succeeded (i.e. the previously contained value actually matched the expected one)
void msBFS(bool** arr, int n, vector<int> s, atomic<int>* used) //s is a different
// piece of the original source
{
//used[i] initialized to '0' for all i
queue<int> que;
int empty = 0;
for(auto i = 0; i < s.size(); ++i) {
int expected = not_visited;
//we check it even here, because one thread may be seriously lagging behind others which already entered the while loop
if(used[s[i]].compare_exchange_strong(expected, queued)) {
que.push(s[i]);
}
}
while (!que.empty())
{
int curr = que.front();
que.pop();
int expected = queued;
if(!used[curr].compare_exchange_strong(expected, processed)) {
continue;
}
cout << curr << " ";
for (auto i = 0; i < n; ++i) {
if (arr[curr][i] == 1 && curr != i) {
int expected = not_visited;
if(used[i].compare_exchange_strong(expected, queued)) {
que.push(i);
}
}
}
}
}
Check the performance. There are many atomic operations, but those are generally cheaper than a mutex. Internally, mutex also performs atomic operations similar to these, but in addition it may completely block a thread. The code I have shown never blocks (the thread is never put on a halt), all synchronization is done on the atomic variables only.
Edit: Some possible optimisations for second approach:
I realized, that if the transition not_visited->queued is guaranteed to occur exactly once, the other transition does not even have to be performed, because the node is present exactly once in exactly one queue anyway. So, you may save up on few atomic operations and use bool again. Since it is rare, though, I don't think it will have that much of an impact.
When iterating over the neighbors - there is an if statement: if (arr[curr][i] == 1 && curr != i) -- this does not check if the neighbor was visited or not. It is checked only later, through the atomic exchange. However, you may want to see if checking within that if anyway will help. If you find early that used[i] is already in a queue or processed, you skip over the branch, and over the no-longer-needed atomic compare-and-swap.
If you want to squeeze every tick from your processor, consider using bitfields, instead of bools for your adjacency matrix and used array. I think the iteration over neighbors, and the conditions can be evaluated with bitwise operations, for 32 bits/neighbors at once. There is even std::atomic<uint32_t>::fetch_or to aid you with updating 32 used at once.
Edit2: Possible optimisation for first approach:
Each thread can hold its own localUsed array, which would be checked and set to true when pushing to the queue (similarly how you did in your original code). This would be local for the thread, so no atomics, mutexes etc. With this simple check you have a guarantee that a given node appears in a queue of each thread at most once. So, at most, a node will appear N times, where N is the number of threads.
I think this is a compromise worth considering between scalability and the memory footprint, and may perform better than the second approach.

Parallel implemention of Lisp-style mapping of a function to a list in C++ fails without cout after use of thread

This code works only when any of the lines under /* debug messages */ are uncommented. Or if the list being mapped to is less than 30 elements.
func_map is a linear implementation of a Lisp-style mapping and can be assumed to work.
Use of it would be as follows func_map(FUNC_PTR foo, std::vector* list, locs* start_and_end)
FUNC_PTR is a pointer to a function that returns void and takes in an int pointer
For example: &foo in which foo is defined as:
void foo (int* num){ (*num) = (*num) * (*num);}
locs is a struct with two members int_start and int_end; I use it to tell func_map which elements it should iterate over.
void par_map(FUNC_PTR func_transform, std::vector<int>* vector_array) //function for mapping a function to a list alla lisp
{
int array_size = (*vector_array).size(); //retain the number of elements in our vector
int num_threads = std::thread::hardware_concurrency(); //figure out number of cores
int array_sub = array_size/num_threads; //number that we use to figure out how many elements should be assigned per thread
std::vector<std::thread> threads; //the vector that we will initialize threads in
std::vector<locs> vector_locs; // the vector that we will store the start and end position for each thread
for(int i = 0; i < num_threads && i < array_size; i++)
{
locs cur_loc; //the locs struct that we will create using the power of LOGIC
if(array_sub == 0) //the LOGIC
{
cur_loc.int_start = i; //if the number of elements in the array is less than the number of cores just assign one core to each element
}
else
{
cur_loc.int_start = (i * array_sub); //otherwise figure out the starting point given the number of cores
}
if(i == (num_threads - 1))
{
cur_loc.int_end = array_size; //make sure all elements will be iterated over
}
else if(array_sub == 0)
{
cur_loc.int_end = (i + 1); //ditto
}
else
{
cur_loc.int_end = ((i+1) * array_sub); //otherwise use the number of threads to determine our ending point
}
vector_locs.push_back(cur_loc); //store the created locs struct so it doesnt get changed during reference
threads.push_back(std::thread(func_map,
func_transform,
vector_array,
(&vector_locs[i]))); //create a thread
/*debug messages*/ // <--- whenever any of these are uncommented the code works
//cout << "i = " << i << endl;
//cout << "int_start == " << cur_loc.int_start << endl;
//cout << "int_end == " << cur_loc.int_end << endl << endl;
//cout << "Thread " << i << " initialized" << endl;
}
for(int i = 0; i < num_threads && i < array_size; i++)
{
(threads[i]).join(); //make sure all the threads are done
}
}
I think that the issue might be in how vector_locs[i] is used and how threads are resolved. But the use of a vector to maintain the state of the locs instance referenced by thread should prevent that from being an issue; I'm really stumped.
You're giving the thread function a pointer, &vector_locs[i], that may become invalidated as you push_back more items into the vector.
Since you know beforehand how many items vector_locs will contain - min(num_threads, array_size) - you can reserve that space in advance to prevent reallocation.
As to why it doesn't crash if you uncomment the output, I would guess that the output is so slow that the thread you just started will finish before the output is done, so the next iteration can't affect it.
I think you should make this loop inner to the main one:
...
for(int i = 0; i < num_threads && i < array_size; i++)
{
(threads[i]).join(); //make sure all the threads are done
}
}

Need advice on improving my code: Search Algorithm

I'm pretty new at C++ and would need some advice on this.
Here I have a code that I wrote to measure the number of times an arbitrary integer x occurs in an array and to output the comparisons made.
However I've read that by using multi-way branching("Divide and conqurer!") techniques, I could make the algorithm run faster.
Could anyone point me in the right direction how should I go about doing it?
Here is my working code for the other method I did:
#include <iostream>
#include <cstdlib>
#include <vector>
using namespace std;
vector <int> integers;
int function(int vectorsize, int count);
int x;
double input;
int main()
{
cout<<"Enter 20 integers"<<endl;
cout<<"Type 0.5 to end"<<endl;
while(true)
{
cin>>input;
if (input == 0.5)
break;
integers.push_back(input);
}
cout<<"Enter the integer x"<<endl;
cin>>x;
function((integers.size()-1),0);
system("pause");
}
int function(int vectorsize, int count)
{
if(vectorsize<0) //termination condition
{
cout<<"The number of times"<< x <<"appears is "<<count<<endl;
return 0;
}
if (integers[vectorsize] > x)
{
cout<< integers[vectorsize] << " > " << x <<endl;
}
if (integers[vectorsize] < x)
{
cout<< integers[vectorsize] << " < " << x <<endl;
}
if (integers[vectorsize] == x)
{
cout<< integers[vectorsize] << " = " << x <<endl;
count = count+1;
}
return (function(vectorsize-1,count));
}
Thanks!
If the array is unsorted, just use a single loop to compare each element to x. Unless there's something you're forgetting to tell us, I don't see any need for anything more complicated.
If the array is sorted, there are algorithms (e.g. binary search) that would have better asymptotic complexity. However, for a 20-element array a simple linear search should still be the preferred strategy.
If your array is a sorted one you can use a divide to conquer strategy:
Efficient way to count occurrences of a key in a sorted array
A divide and conquer algorithm is only beneficial if you can either eliminate some work with it, or if you can parallelize the divided work parts accross several computation units. In your case, the first option is possible with an already sorted dataset, other answers may have addressed the problem.
For the second solution, the algorithm name is map reduce, which split the dataset in several subsets, distribute the subsets to as many threads or processes, and gather the results to "compile" them (the term is actually "reduce") in a meaningful result. In your setting, it means that each thread will scan its own slice of the array to count the items, and return its result to the "reduce" thread, which will add them up to return the final result. This solution is only interesting for large datasets though.
There are questions dealing with mapreduce and c++ on SO, but I'll try to give you a sample implementation here:
#include <utility>
#include <thread>
#include <boost/barrier>
constexpr int MAP_COUNT = 4;
int mresults[MAP_COUNT];
boost::barrier endmap(MAP_COUNT + 1);
void mfunction(int start, int end, int rank ){
int count = 0;
for (int i= start; i < end; i++)
if ( integers[i] == x) count++;
mresult[rank] = count;
endmap.wait();
}
int rfunction(){
int count = 0;
for (int i : mresults) {
count += i;
}
return count;
}
int mapreduce(){
vector<thread &> mthreads;
int range = integers.size() / MAP_COUNT;
for (int i = 0; i < MAP_COUNT; i++ )
mthreads.push_back(thread(bind(mfunction, i * range, (i+1) * range, i)));
endmap.wait();
return rfunction();
}
Once the integers vector has been populated, you call the mapreduce function defined above, which should return the expected result. As you can see, the implementation is very specialized:
the map and reduce functions are specific to your problem,
the number of threads used for map is static,
I followed your style and used global variables,
for convenience, I used a boost::barrier for synchronization
However this should give you an idea of the algorithm, and how you could apply it to similar problems.
caveat: code untested.

How can i find bicomponents in graph ? called block

Here my tryings, and copypastings. But what i must write to find biconnectedcomponent (called block)?
#include <fstream>
#include <vector>
using namespace std;
ifstream cin ("test3.txt");
ofstream cout ("output.txt");
const int l = 6;
int G[l][l];
int MAXN;
int used[l];
int number[l], low[l], counter = 1, kids = 0;
vector <int> block[l];
void BiComp(int curr, int prev) {
int kids = 0;
low[curr] = number[curr] = counter++;
used[curr] = 1;
for(int i = 0; i < MAXN; i++) {
if(G[curr][i] == 1) {
if (i != prev) {
if (used[i] == 0) {
kids++;
block[0].push_back(curr);
block[0].push_back(i);
BiComp(i, curr);
low[curr] = min(low[curr], low[i]);
if(low[i] >= number[curr] && (prev != -1 || kids >= 2)) {
cout << "tochka " << curr + 1 << endl;
}
} else {
block[0].push_back(i);
block[0].push_back(prev);
cout<<block<<endl;
low[curr] = min(low[curr], number[i]);
}
}
}
}
}
void main()
{
MAXN = 6;
for (int i = 0; i < MAXN; i++)
{
for (int j = 0; j < MAXN; j++)
{
cin >> G[i][j];
cout << G[i][j] << " ";
}
cout << endl;
}
//for (int i = 0; i < MAXN; i++) {
//if (number[i] == 0) {
BiComp(0, -1);
//}
//}
}
How can i find by this code, finding cutpoints at the same time blocks???
In graph theory, a biconnected component (or 2-connected component) is a maximal biconnected subgraph.
Ok what comes to my mind is a very brute-force approach that isn't going to scale well, but I also remember reading that finding biconnected components is in fact a hard problem, computationally, so let's just start with it and then see if there's optimizations to be done.
Given a set of N nodes, check for each possible subset of nodes whether they form a biconnected component. Typically, you'll want the biggest component available, so just start with the whole graph, then with all subgraphs of N-1 nodes, N-2, and so on. As soon as you find one solution, you'll know you have found one of the biggest possible size and you can quite. Still, you'll end up checking 2^N subgraphs in the worst case. So start with a loop constructing your graphs to be tested.
To find out if a given graph with K nodes is a biconnected component, loop over all K*(K-1)/2 pairs of nodes and find out if there are two independent paths between them.
In order to find out if two nodes i and j are biconnected, first find all paths between them. For each path, find out if there is an alternative connection to that path. If you find one, you're done for that pair. If not, you've found proof that the graph you're looking at is not biconnected and you can break from all loops but the outer one and test the next graph.
In order to see if there is an alternative connection between i and j, take out all edges you used in the first path, and see if you can find another one. If you can, you're fine with i and j. If you can't, continue with the next path in the initial list of paths you found. If you reach the end of your list of paths without finding one for which an alternative exists when taking out the involved edges, the two nodes are not biconnected and hence the whole graph isn't.
There is a linear run time algorithm for finding all cut points (or cut vertices or articulation points) in a given graph using Depth First Search.
Once you found all the cut points, it's easy to find all of the bicomponents.