use std::thread and join for parallelism - c++

I'm making a script that iterates through all chromosomes of a fasta file and splitting it into pieces of 10 bp, the function is called chrdata and i am saving these fragments into a single file. This fragmentation can occur on each chromosome individually completely separate for the other chromosomes, as such i'm trying threads.
chrdata(faidx_t *seq_ref ,int chr_no,FILE *fp)
My goal is wish to make this process faster. To achieve this i have tried multi-threading with the std::thread function.
I have tried different things.
First i tried to create a thread for the first chromosome and then thread.join() then the next thread for next chromosome and so on.
Then i tried to create multiple threads at once, like explained in Simultaneous Threads in C++ using <thread>
This is the example below.
However as far as I understand and that I can read, I always need to use join otherwise I'll end up with "terminate called without an active exception". The issue is there is no time execution difference between example (1) and (2).
Based on my understanding its becuase despite of creating a vector with thread object they still have to join and thus wait for all the threads to execute. This means this would be concurrent execution and not parallele.
So my question is: Would anyone be able to give me suggestions to the function below where i might change to make the execution faster by using parallele execution?
Or is my understanding of join and concurrent wrong in this instance? I'm not completely sure why we cannot just skip the whole join part, if all the threads are done, why cant we just use detach()?
void function(const char* fastafile,FILE *fp,int thread_no) {
std::vector<std::thread> threads;
//extracting the chromosome file
faidx_t *seq_ref = NULL;
seq_ref = fai_load(fastafile);
assert(seq_ref!=NULL);
int chr_total = 10; //just the first 10 chromosomes
int chr_idx = 0;
int chr_no = 0;
while(chr_idx < chr_total){
for (chr_no; chr_no < std::min(chr_idx+thread_no,chr_total);chr_no++){
threads.push_back(std::thread(chrdata,seq_ref,chr_no,fp));
}
for (auto &th : threads) { th.join(); }
threads.clear();
chr_idx = chr_idx + thread_no;
}
}
I havent attacked main() or chrdata() to make the code and question more clear.
pastebin.com/iY6u9CbH

Related

Multithreaded concurrent file reading/writing, managing container of processes

Wholly new to multithreading.
I am writing a program which takes as input a vector of objects and an integer for the number of threads to dedicate. The nature of the objects isn't important, only that each has several members that are file paths to large text files. Here's a simplified version:
// Not very important. Reads file, writes new version omitting
// some lines
void proc_file(OBJ obj) {
std::string inFileStr(obj.get_path().c_str());
std::string outFileStr(std::string(obj.get_path().replace_extension("new.txt").c_str()));
std::ifstream inFile(inFileStr);
std::ofstream outFile(outFileStr);
std::string currLine;
while (getline(inFile, currLine)) {
if (currLine.size() == 1 ||
currLine.compare(currLine.length()-5, 5, "thing") != 0) {
outFile << currLine << '\n';
}
else {
for (int i = 0; i < 3; i++) {
getline(inFile, currLine);
}
}
}
inFile.close();
outFile.close();
}
// Processes n file concurrently, working way through
// all OBJ in objs
void multi_file_proc(std::vector<OBJ> objs, int n) {
std::vector<std::thread> procVec;
for (int i = 0; i < objs.size(); i++) {
/*
Ensure that n files are always being processed.
Upon completion of one, initiate another, until
all OBJ in objs have had their text files changed.
*/
}
}
I want to loop through each OBJ and write altered versions of their text files in concurrence, the limitation on simultaneous file read/writes being the thread value (n). Ultimately, all the objects' text files must be changed, but in such a way that there are always n files being processed, to maximize efficiency in concurrence.
Note the vector of threads, procVec. I originally approached this by managing a vector of threads, with a file being processed for each thread in procVec. From my reading, it seems a vector for managing these tasks is logical. But how do I always ensure there are n files open until all have been processed, without exiting with an open thread?
Edit: Apologies, my intention was not to ask others to write code for me. I just didn't want my approach to bias anyone's answer if the approach was bad to begin with.
These are some things I've tried (this code would go into the block comment in my function):
1. First approach. Idea is to add to procVec up until the thread limit n was reached, then join, remove a process from the front of the vector upon its completion. This is a summary of several similar iterations, none of which worked:
if (i >= n) {
procVec.front().join();
procVec.erase(procVec.begin());
}
procVec.push_back(std::thread(proc_file, sra[i]));
Problems with this:
Incorrectly assumes front of vector will always finish first
(Possibly?) Invalidates all iterators in procVec after first is erased
2. Using mutexes, I attempt writing a lambda function where the thread would be removed upon its completion. This is my current approach. Unsure why it isn't working, or if it even suits my needs:
// remThread() and lamb() defined above main function, **procVec** and **threadMutex**
//are global variables
void remThread(std::thread::id id) {
std::lock_guard<std::mutex lock(threadMutex);
auto iter = std::find_if(procVec.begin(), procVec.end(), [=](std::thread &t)
{return (t.get_id() == id); });
if (iter != procVec.end()) {
iter->join();
procVec.erase(iter);
}
}
void lamb(SRA sra, std::thread::id id) {
proc_file(sra);
remThread(id);
}
// This is the code contained in the main for loop. called lambda to process file
// and then remove thread
std::lock_guard<std::mutex> lock(threadMutex);
procVec.push_back(std::thread([sras, i]() {
std::thread(lamb, sras[i], std::this_thread::get_id()).detach();
}));
Problems with this:
Program terminates, likely a joinable thread is active, leaves scope
Given that the example you show is fairly simple, a for loop of fixed size, no strange dependencies, a very simple solution could be to use OpenMP which would allow you to do what you describe (providing I understood correctly) by adding a single line
void multi_file_proc(std::vector<OBJ> objs, int n) {
std::vector<std::thread> procVec;
#pragma omp parallel for num_threads(n) schedule(dynamic, 1)
for (int i = 0; i < objs.size(); i++) {
/*
...
*/
}
}
in front of the for loop. Of course you then have to modify your compile command to add openmp support, the precise flag naturally being different from compiler to compiler i.e. -fopenmp for g++, -qopenmp for icpc, etc.
The line above basically instructs the compiler to create code to execute the for loop below in parallel. The important bit here is the last one where we set the schedule. Dynamic simply means that the order is not predetermined, instead threads will get their next iteration when they finish with the last. The integer 1 there defines the number of steps they take at a time, given that each file is large we want something fine grained since we don't expect too much overhead from the scheduling.
A word of caution, OpenMP, like most of C++, will not even try to stop you from shooting yourself in the foot. And with concurrency there are whole new ways to do just that.
Finally, this is by no means guaranteed to be the absolute best solution outright. For instance if your files are of varying lengths then you would probably want to sort the objects from longest to shortest before the loop. This way once the last object is being processed (at some point only a single thread will be working on the final object) that won't take too long.

How to avoid destroying and recreating threads inside loop?

I have a loop with that creates and uses two threads. The threads always do the same thing and I'm wondering how they can be reused instead of created and destroyed each iteration? Some other operations are do inside the loop that affect the data the threads process. Here is a simplified example:
const int args1 = foo1();
const int args2 = foo2();
vector<string> myVec = populateVector();
int a = 1;
while(int i = 0; i < 100; i++)
{
auto func = [&](const vector<string> vec){
//do stuff involving variable a
foo3(myVec[a]);
}
thread t1(func, args1);
thread t2(func, args2);
t1.join();
t2.join();
a = 2 * a;
}
Is there a way to have t1 and t2 restart? Is there a design pattern I should look into? I ask because adding threads made the program slightly slower when I thought it would be faster.
You can use std::async as suggested in the comments.
What you're also trying to do is a very common usage for a Threadpool. I simple header only implementation of which I commonly utilize is here
To use this library, create the pool outside of the loop with a number of threads set during construction. Then enqueue a function in which a thread will go off and execute. With this library, you'll be getting a std::future (much like the std::async steps) and this is what you'd wait on in your loop.
Generically, you'd want to make access to any data thread-safe with mutexs (or other means, there are a lot of ways to do this) but under very specific situations, you'll not need to.
In this case,
so long as the vector isn't being increased in size (doesn't need to reallocate)
Only reading items or only modifying each item at a time in its own thread
the you wouldn't need to worry about synchronization.
Though its just good habit to do the sync anyways... When other people eventually modify the code, they're not going to know your rules and will cause issues.

Synchronize n Threads with only using Semaphore and/or mutex in C++

We're studying for our test next week, and have been given an exercise from our teacher, and we just don't see the solution:
How to synchronize n threads, so that all n threads wait at a specific location and only continue with their "work" together when all n threads have reached that location?
We're allowed to use Mutex and Semaphore constructs. The solution should be easy, but we just cant find the answer.
Here's a big hint. You need 2 semaphores, both with N flags. You can solve this with an extra thread. The key is that you can call down() on a semaphore multiple times. e.g. If you call down() on a semaphore 8 times, you need all 8 up()'s before you can continue.
// an additional thread (not one of the N)
void trigger(Semaphore* workersCollect, Semaphore* workersRelease, int n)
{
while(true)
{
for (int i = 0; i < n; ++i)
workersCollect->down();
for (int i = 0; i < n; ++i)
workersRelease->up();
}
}
// Prototype for the "checkpoint" function (exercise for the reader)
void await(Semaphore* workersCollect, Semaphore* workersRelease);
You can also solve it without the extra thread, by using more complicated state checking.
This design has a drawback. If a worker finishes its work extremely quickly, it can grab more than one task (while another thread ends up not running at all). This is fine if you have a threadpool kind of design, but bad if, say, each thread is supposed to work on it's own distinct section of a dataset.
To fix that, you need a semaphore per thread. Something akin to
Semaphore workerRelease[N];
but being careful to avoid false sharing. (You don't want more than 1 semaphore on a cache line.)

Multihreading recursive program c++

I am working on a recursive algorithm which we want to parallelize to improve the performance.
I implemented multithreading using Visual c++ 12.0 and < thread > library . However I dont see any performance improvements. The time taken either less by a few milliseconds or is more than the time with single thread.
Kindly let me know if am doing something wrong and what corrections should I make to the code.
Here is my code
void nonRecursiveFoo(<className> &data, int first, int last)
{
//process the data between first and last index and set its value to true based on some condition
//no threads are created here
}
void recursiveFoo(<className> &data, int first, int last)
{
int partitionIndex = -1;
data[first]=true;
data[last]=true;
for (int i = first + 1; i < last; i++)
{
//some logic setting the index
If ( some condition is true)
partitionIndex = i;
}
//no dependency of partitions on one another and so can be parallelized
if( partitionIndex != -1)
{
data[partitionIndex]=true;
//assume some threadlimit
if (Commons::GetCurrentThreadCount() < Commons::GetThreadLimit())
{
std::thread t1(recursiveFoo, std::ref(data), first, index);
Commons::IncrementCurrentThreadCount();
recursiveFoo(data, partitionIndex , last);
t1.join();
}
else
{
nonRecursiveFoo(data, first, partitionIndex );
nonRecursiveFoo(data, partitionIndex , last);
}
}
}
//main
int main()
{
recursiveFoo(data,0,data.size-1);
}
//commons
std::mutex threadCountMutex;
static void Commons::IncrementCurrentThreadCount()
{
threadCountMutex.lock();
CurrentThreadCount++;
threadCountMutex.unlock();
}
static int GetCurrentThreadCount()
{
return CurrentThreadCount;
}
static void SetThreadLimit(int count)
{
ThreadLimit = count;
}
static int GetThreadLimit()
{
return ThreadLimit;
}
static int GetMinPointsPerThread()
{
return MinimumPointsPerThread;
}
Without further information (see comments) this is mostly guesswork, but there are a few things you should watch out for:
First of all, make sure that your partitioning logic is very short and fast compared to the processing. Otherwise, you are just creating more work than you gain processing power.
Make sure, there is enough work to begin with or the speedup might be not enough to pay for the additional overhead of thread creation.
Check that your work gets evenly distributed among the different threads and don't spawn more threads than you have cores on your computer (print the number of total threads at the end - don't rely on your ThreadLimit).
Don't let your partitions get too small, (especially no less than 64 Bytes) or you end up with false sharing.
It would be MUCH more efficient, to implement CurrentThreadCount as a std::atomic<int> in which case you don't need a mutex.
Put the increment of the counter before the creation of the thread. Otherwise, the newly created thread might read the counter before it is incremented and spawn a new thread again, even if the max number of threads is already reached (This is still not a perfect solution, but I would only invest more time on this if you have verified, that overcommitting is your actual problem)
If you really must use a mutex (for reasons outside of the example code) you have to use it for every access to CurrentThreadCount (read and write access). Otherwise this is - strictly speaking - a race condition and thus UB.
By using t1.join you're basically waiting for the other thread to finish - i.e. not doing anything in parallel.
By looking at your algorithm I don't see how it can be parallelized(thus improved) by using threads - you have to wait for a single recursive call to end.
First of all, you are not doing anything in parallel, as every thread creation blocks, until the created thread has finished. Hence, your multithreaded code will always be slower than the non multithreaded version.
In order to parallelize you could spawn threads for that part, where the non-recursive function is called, put the thread ID into a vector and join on the highest level of the recursion, by walking through the vector. (Although there are more elegant ways to do that, but for a first should this would be OK, I think).
Thus, all non recursive calls will run in parallel. But you should use another condition than the max number of threads, but the size of the problem, e.g. last-first<threshold.

when to use mutex

Here is the thing: there is a float array float bucket[5] and 2 threads, say thread1 and thread2.
Thread1 is in charge of tanking up the bucket, assigning each element in bucket a random number. When the bucket is tanked up, thread2 will access bucket and read its elements.
Here is how I do the job:
float bucket[5];
pthread_mutex_t mu = PTHREAD_MUTEX_INITIALIZER;
pthread_t thread1, thread2;
void* thread_1_proc(void*); //thread1's startup routine, tank up the bucket
void* thread_2_proc(void*); //thread2's startup routine, read the bucket
int main()
{
pthread_create(&thread1, NULL, thread_1_proc, NULL);
pthread_create(&thread2, NULL, thread_2_proc, NULL);
pthread_join(thread1);
pthread_join(thread2);
}
Below is my implementation for thread_x_proc:
void* thread_1_proc(void*)
{
while(1) { //make it work forever
pthread_mutex_lock(&mu); //lock the mutex, right?
cout << "tanking\n";
for(int i=0; i<5; i++)
bucket[i] = rand(); //actually, rand() returns int, doesn't matter
pthread_mutex_unlock(&mu); //bucket tanked, unlock the mutex, right?
//sleep(1); /* this line is commented */
}
}
void* thread_2_proc(void*)
{
while(1) {
pthread_mutex_lock(&mu);
cout << "reading\n";
for(int i=0; i<5; i++)
cout << bucket[i] << " "; //read each element in the bucket
pthread_mutex_unlock(&mu); //reading done, unlock the mutex, right?
//sleep(1); /* this line is commented */
}
}
Question
Is my implementation right? Cuz the output is not as what I expected.
...
reading
5.09434e+08 6.58441e+08 1.2288e+08 8.16198e+07 4.66482e+07 7.08736e+08 1.33455e+09
reading
5.09434e+08 6.58441e+08 1.2288e+08 8.16198e+07 4.66482e+07 7.08736e+08 1.33455e+09
reading
5.09434e+08 6.58441e+08 1.2288e+08 8.16198e+07 4.66482e+07 7.08736e+08 1.33455e+09
reading
tanking
tanking
tanking
tanking
...
But if I uncomment the sleep(1); in each thread_x_proc function, the output is right, tanking and reading follow each other, like this:
...
tanking
reading
1.80429e+09 8.46931e+08 1.68169e+09 1.71464e+09 1.95775e+09 4.24238e+08 7.19885e+08
tanking
reading
1.64976e+09 5.96517e+08 1.18964e+09 1.0252e+09 1.35049e+09 7.83369e+08 1.10252e+09
tanking
reading
2.0449e+09 1.96751e+09 1.36518e+09 1.54038e+09 3.04089e+08 1.30346e+09 3.50052e+07
...
Why? Should I use sleep() when using mutex?
Your code is technically correct, but it does not make a lot of sense, and it does not do what you assume.
What your code does is, it updates a section of data atomically, and reads from that section, atomically. However, you don't know in which order this happens, nor how often the data is written to before being read (or if at all!).
What you probably wanted is generate exactly one sequence of numbers in one thread every time and read exactly one new sequence each time in the other thread. For this, you would use either have to use an additional semaphore or better a single-producer-single-consumer queue.
In general the answer to "when should I use a mutex" is "never, if you can help it". Threads should send messages, not share state. This makes a mutex most of the time unnecessary, and offers parallelism (which is the main incentive for using threads in the first place).
The mutex makes your threads run lockstep, so you could as well just run in a single thread.
There is no implied order in which threads will get to run. This means you shall not expect any order. What's more it is possible to get on thread running over and over without letting the other to run. This is implementation specific and should be assumed random.
The case you presented falls much rather for a semaphor which is "posted" with each element added.
However if it has always to be like:
write 5 elements
read 5 elements
you should have two mutexes:
one that blocks producer until the consumer finished
one that blocks consumer until the producer finished
So the code should look something like that:
Producer:
while(true){
lock( &write_mutex )
[insert data]
unlock( &read_mutex )
}
Consumer:
while(true){
lock( &read_mutex )
[insert data]
unlock( &write_mutex )
}
Initially write_mutex should be unlocked and read_mutex locked.
As I said your code seems to be a better case for semaphores or maybe condition variables.
Mutexes are not meant for cases such as this (which doesn't mean you can't use them, it just means there are more handy tools to solve that problem).
You have no right to assume that just because you want your threads to run in a particular order, the implementation will figure out what you want and actually run them in that order.
Why shouldn't thread2 run before thread1? And why shouldn't each thread complete its loop several times before the other thread gets a chance to run up to the line where it acquires the mutex?
If you want execution to switch between two threads in a predictable way, then you need to use a semaphore, condition variable, or other mechanism for messaging between the two threads. sleep appears to result in the order you want on this occasion, but even with the sleep you haven't done enough to guarantee that they will alternate. And I have no idea why the sleep makes a difference to which thread gets to run first -- is that consistent across several runs?
If you have two functions that should execute sequentially, i.e. F1 should finish before F2 starts, then you shouldn't be using two threads. Run F2 on the same thread as F1, after F1 returns.
Without threads, you won't need the mutex either.
It isn't really the issue here.
The sleep only lets the 'other' thread access the mutex lock (by chance, it is waiting for the lock so Probably it will have the mutex), there is no way you can be sure the first thread won't re-lock the mutex though and let the other thread access it.
Mutex is for protecting data so two threads don't :
a) write simultaneously
b) one is writing when another is reading
It is not for making threads work in a certain order (if you want that functionality, ditch the threaded approach or use a flag to tell that the 'tank' is full for example).
By now, it should be clear, from the other answers, what are the mistakes in the original code. So, let's try to improve it:
/* A flag that indicates whose turn it is. */
char tanked = 0;
void* thread_1_proc(void*)
{
while(1) { //make it work forever
pthread_mutex_lock(&mu); //lock the mutex
if(!tanked) { // is it my turn?
cout << "tanking\n";
for(int i=0; i<5; i++)
bucket[i] = rand(); //actually, rand() returns int, doesn't matter
tanked = 1;
}
pthread_mutex_unlock(&mu); // unlock the mutex
}
}
void* thread_2_proc(void*)
{
while(1) {
pthread_mutex_lock(&mu);
if(tanked) { // is it my turn?
cout << "reading\n";
for(int i=0; i<5; i++)
cout << bucket[i] << " "; //read each element in the bucket
tanked = 0;
}
pthread_mutex_unlock(&mu); // unlock the mutex
}
}
The code above should work as expected. However, as others have pointed out, the result would be better accomplished with one of these two other options:
Sequentially. Since the producer and the consumer must alternate, you don't need two threads. One loop that tanks and then reads would be enough. This solution would also avoid the busy waiting that happens in the code above.
Using semaphores. This would be the solution if the producer was able to run several times in a row, accumulating elements in a bucket (not the case in the original code, though).
http://en.wikipedia.org/wiki/Producer-consumer_problem#Using_semaphores