I have created an application which uses the pipeline pattern to do some processing.
However, I noticed that when the pipeline is run multiple times in a row it tends to get slower and slower.
This is also the case when no actual processing is done in the pipeline stages - so I am curious if maybe my pipeline implementation has a problem.
This is a simple test program which repoduces the effect:
#include <iostream>
#include <boost/thread.hpp>
class Pipeline {
void processStage(int i) {
return;
}
public:
void run() {
boost::thread_group threads;
for (int i=0; i< 8; ++i) {
threads.add_thread(new boost::thread(&Pipeline::processStage, this, i));
}
threads.join_all();
}
};
int main() {
Pipeline pipeline;
int n=2000;
for (int i=0;i<n; ++i) {
pipeline.run();
if (((i+1)*100)/n > (i*100)/n)
std::cout << "\r" << ((i+1)*100)/n << " %";
}
}
In my understanding the threads are created in run() and at the end of run() they are terminated. So the state of the program at the beginning of the outer loop in the main program should always be the same...
But what I observe is an increating slowdown when processing this loop.
I know that it would be more efficient to keep the pipeline threads alive thoughout the whole program - but I need to know if there is a problem with my pipeline implementation.
Thanks!
Constantin
I do not know the exact reason for the slowdown in run(), but when I use the code obove and insert a little sleep (500ms) at the end of the loop in main() then the slowdown of run() is gone. So the system seems to need some "recover time" until it is able to create new threads.
Since you do new boost::thread() did you try to clean them up? If you run on windows, see the Task Manager about the number of threads opened by the process and if required close the thread handles. I suspect, The number of threads created by the system is keep increasing..
Related
I have a task - to write a multithreaded webcrawler (actually I have a local set.html files that I need to parse and move to another directory). The main condition for this task is to make it possible to enter an arbitrary number of threads and determine at what number the program will stop adding in performance.
#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
#include <queue>
#include <ctime>
#include <set>
#include <chrono>
#include <atomic>
using namespace std;
class WebCrawler{
private:
const string start_path = "/";
const string end_path = "/";
int thread_counts;
string home_page;
queue<string> to_visit;
set<string> visited;
vector<thread> threads;
mutex mt1;
int count;
public:
WebCrawler(int thread_counts_, string root_)
:thread_counts(thread_counts_), home_page(root_) {
to_visit.push(root_);
visited.insert(root_);
count = 0;
}
void crawler(){
for(int i = 0; i<thread_counts; i++)
threads.push_back(thread(&WebCrawler::start_crawl, this));
for(auto &th: threads)
th.join();
cout<<"Count: "<<count<<endl;
}
void parse_html(string page_){
cout<<"Thread: "<<this_thread::get_id()<<" page: "<<page_<< endl;
ifstream page;
page.open(start_path+page_, ios::in);
string tmp;
getline(page, tmp);
page.close();
for(int i = 0; i<tmp.size(); i++){
if( tmp[i] == '<'){
string tmp_num ="";
while(tmp[i]!= '>'){
if(isdigit(tmp[i]))
tmp_num+=tmp[i];
i++;
}
tmp_num+= ".html";
if((visited.find(tmp_num) == visited.end())){
mt1.lock();
to_visit.push(tmp_num);
visited.insert(tmp_num);
mt1.unlock();
}
}
}
}
void move(string page_){
mt1.lock();
count++;
ofstream page;
page.open(end_path+page_, ios::out);
page.close();
mt1.unlock();
}
void start_crawl(){
cout<<"Thread started: "<<this_thread::get_id()<< endl;
string page;
while(!to_visit.empty()){
mt1.lock();
page = to_visit.front();
to_visit.pop();
mt1.unlock();
parse_html(page);
move(page);
}
}
};
int main(int argc, char const *argv\[])
{
int start_time = clock();
WebCrawler crawler(7, "0.html");
crawler.crawler();
int end_time = clock();
cout<<"Time: "<<(float)(end_time -start_time)/CLOCKS_PER_SEC<<endl;
cout<<thread::hardware_concurrency()<<endl;
return 0;
}
1 thread = Time: 0.709504
2 thread = Time: 0.668037
4 thread = Time: 0.762967
7 thread = Time: 0.781821
I've been trying to figure out for a week why my program is running slower even on two threads. I probably don't fully understand how mutex works, or perhaps the speed is lost during the joining of threads. Do you have any ideas how to fix it?
There are many ways to protect things in multithreading, implicit or explicit.
In addition to the totally untested code, there are also some implicit assumptions, for example of that int is large enough for your task, that must be considered.
Lets make a short analysis of what is needing protection.
Variables that are accessed from multiple threads
things that are const can be excluded
unless you const cast them
part of them are mutable
global objects like files or cout
could be overwritten
written from multiple threads
streams have their own internal locks
so you can write to a stream from multiple threads to cout
but you don't want it for the files in this case.
if multiple threads want to open the same file, you will get an error.
std::endl forces an synchronization, so change it to "\n" like a commenter noted.
So this boils down to:
queue<string> to_visit;
set<string> visited; // should be renamed visiting
int count;
<streams and files>
count is easy
std::atomic<int> count;
The files are implicit protected by your visited/visiting check, so they are good too. So the mutex in move can be removed.
The remaining needs an mutex each as they could be independently updated.
mutex mutTovisit, // formerly known as mut1.
mutVisiting.
Now we have the problem that we could deadlock with two mutexes, if we try to lock in different order in two places. You need to read up on all the lock stuff if you add more locks, scoped_lock and lock are good places to start.
Changing the code to
{
scoped_lock visitLock(mutVisiting); // unlocks at end of } even if stuff throws
if((visited.find(tmp_num) == visited.end())){
scoped_lock toLock(mutTo);
to_visit.push(tmp_num);
visited.insert(tmp_num);
}
}
And in this code there are multiple errors, that are hidden by the not thread safe access to to_visit and the randomness of the thread starts.
while(!to_visit.empty()){ // 2. now the next thread starts and sees its empty and stops
// 3. or worse it starts then hang at lock
mt1.lock();
page = to_visit.front(); // 4. and does things that are not good here with an empty to_visit
to_visit.pop(); // 1. is now empty after reading root
mt1.unlock();
parse_html(page);
move(page);
}
To solve this you need an (atomic?) counter, found(Pages) of current known unvisited pages so we know if are done. Then to start threads when there is new work that needs to be done we can use std::condition_variable(_any)
The general idea of the plan is to have the threads wait until work is available, then each time a new page is discovered notify_one to start work.
To Startup, set the found to 1 and notify_one once the threads have started, when a thread is done with the work decrease found. To stop when found is zero, the thread that decrease it to zero notify_all so they all can stop.
What you will find is that if the data is on a single slow disk, it is unlikely you will see much effect from more than 2 threads, if all files are currently cached in ram, you might see more effect.
I think there's a bottle neck on your move function. Each thread takes the same amount of time to go through that function. You could start with that
Question
I want to know if it is possible to wait in the main-Thread without any while(1)-loop.
I launch a few threads via std::async() and do calculation of numbers on each thread. After i start the threads i want to receive the results back. I do that with a std::future<>.get().
My problem
When i receive the result i call std::future.get(), which blocks the main thread until the calculation on the thread is done. This leads to some slower execution time, if one thread needs considerably more time then the following, where i could do some calculation with the finished results instead and then when the slowest thread is done i maybe have some some further calculation.
Is there a way to idle the main thread until ANY of the threads has finished running? I have thought of a callback function which wakes the main thread up, but i still don't know how to idle the main function without making it unresponsive for i.e. a second and not running a while(true) loop instead.
Current code
#include <iostream>
#include <future>
uint64_t calc_factorial(int start, int number);
int main()
{
uint64_t n = 1;
//The user entered number
uint64_t number = 0;
// get the user input
printf("Enter number (uint64_t): ");
scanf("%lu", &number);
std::future<uint64_t> results[4];
for (int i = 0; i < 4; i++)
{
// push to different cores
results[i] = std::async(std::launch::async, calc_factorial, i + 2, number);
}
for (int i = 0; i < 4; i++)
{
//retrieve result...I don't want to wait here if one threads needs more time than usual
n *= results[i].get();
}
// print n or the time needed
return 0;
}
uint64_t calc_factorial(int start, int number)
{
uint64_t n = 1;
for (int i = start; i <= number; i+=4) n *= i;
return n;
}
I prepared a code snippet which runs fine, I am using the GMP Lib for the big results, but the code runs with uint64_t instead if you enter small numbers.
Note
If you have compiled the GMP library for whatever reason on your PC already you could replace every uint64_t with mpz_class
I'd approach this somewhat differently.
Unless I have a fairly specific reason to do otherwise, I tend to approach most multithreaded code the same general way: use a (thread-safe) queue to transmit results. So create an instance of a thread-safe queue, and pass a reference to it to each of the threads that's doing to generate the data. The have whatever thread is going to collect the results grab them from the queue.
This makes it automatic (and trivial) that you create each result as it's produced, rather than getting stuck waiting for one after another has produced results.
So I'm trying to create a program that implements a function that generates a random number (n) and based on n, creates n threads. The main thread is responsible to print the minimum and maximum of the leafs. The depth of hierarchy with the Main thread is 3.
I have written the code below:
#include <iostream>
#include <thread>
#include <time.h>
#include <string>
#include <sstream>
using namespace std;
// a structure to keep the needed information of each thread
struct ThreadInfo
{
long randomN;
int level;
bool run;
int maxOfVals;
double minOfVals;
};
// The start address (function) of the threads
void ChildWork(void* a) {
ThreadInfo* info = (ThreadInfo*)a;
// Generate random value n
srand(time(NULL));
double n=rand()%6+1;
// initialize the thread info with n value
info->randomN=n;
info->maxOfVals=n;
info->minOfVals=n;
// the depth of recursion should not be more than 3
if(info->level > 3)
{
info->run = false;
}
// Create n threads and run them
ThreadInfo* childInfo = new ThreadInfo[(int)n];
for(int i = 0; i < n; i++)
{
childInfo[i].level = info->level + 1;
childInfo[i].run = true;
std::thread tt(ChildWork, &childInfo[i]) ;
tt.detach();
}
// checks if any child threads are working
bool anyRun = true;
while(anyRun)
{
anyRun = false;
for(int i = 0; i < n; i++)
{
anyRun = anyRun || childInfo[i].run;
}
}
// once all child threads are done, we find their max and min value
double maximum=1, minimum=6;
for( int i=0;i<n;i++)
{
// cout<<childInfo[i].maxOfVals<<endl;
if(childInfo[i].maxOfVals>=maximum)
maximum=childInfo[i].maxOfVals;
if(childInfo[i].minOfVals< minimum)
minimum=childInfo[i].minOfVals;
}
info->maxOfVals=maximum;
info->minOfVals=minimum;
// we set the info->run value to false, so that the parrent thread of this thread will know that it is done
info->run = false;
}
int main()
{
ThreadInfo info;
srand(time(NULL));
double n=rand()%6+1;
cout<<"n is: "<<n<<endl;
// initializing thread info
info.randomN=n;
info.maxOfVals=n;
info.minOfVals=n;
info.level = 1;
info.run = true;
std::thread t(ChildWork, &info) ;
t.join();
while(info.run);
info.maxOfVals= max<unsigned long>(info.randomN,info.maxOfVals);
info.minOfVals= min<unsigned long>(info.randomN,info.minOfVals);
cout << "Max is: " << info.maxOfVals <<" and Min is: "<<info.minOfVals;
}
The code compiles with no error, but when I execute it, it gives me this :
libc++abi.dylib: terminating with uncaught exception of type
std::__1::system_error: thread constructor failed: Resource
temporarily unavailable Abort trap: 6
You spawn too many threads. It looks a bit like a fork() bomb. Threads are a very heavy-weight system resource. Use them sparingly.
Within the function void Childwork I see two mistakes:
As someone already pointed out in the comments, you check the info level of a thread and then you go and create some more threads regardless of the previous check.
Within the for loop that spawns your new threads, you increment the info level right before you spawn the actual thread. However you increment a freshly created instance of ThreadInfo here ThreadInfo* childInfo = new ThreadInfo[(int)n]. All instances within childInfo hold a level of 0. Basically the level of each thread you spawn is 1.
In general avoid using threads to achieve concurrency for I/O bound operations (*). Just use threads to achieve concurrency for independent CPU bound operations. As a rule of thumb you never need more threads than you have CPU cores in your system (**). Having more does not improve concurrency and does not improve performance.
(*) You should always use direct function calls and an event based system to run pseudo concurrent I/O operations. You do not need any threading to do so. For example a TCP server does not need any threads to serve thousands of clients.
(**) This is the ideal case. In practice your software is composed of multiple parts, developed by independent developers and maintained in different modes, so it is ok to have some threads which could be theoretically avoided.
Multithreading is still rocket science in 2019. Especially in C++. Do not do it unless you know exactly what you are doing. Here is a good series of blog posts that handle threads.
I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.
The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)
I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.
I tried to use pthread to do some task faster. I have thousands files (in args) to process and i want to create just a small number of thread many times.
Here's my code :
void callThread(){
int nbt = 0;
pthread_t *vp = (pthread_t*)malloc(sizeof(pthread_t)*NBTHREAD);
for(int i=0;i<args.size();i+=NBTHREAD){
for(int j=0;j<NBTHREAD;j++){
if(i+j<args.size()){
pthread_create(&vp[j],NULL,calcul,&args[i+j]);
nbt++;
}
}
for(int k=0;k<nbt;k++){
if(pthread_join(vp[k], NULL)){
cout<<"ERROR pthread_join()"<<endl;
}
}
}
}
It returns error, i don't know if it's a good way to solve my problem. All the resources are in args (vector of struct) and are independants.
Thanks for help.
You're better off making a thread pool with as many threads as the number of cores the cpu has. Then feed the tasks to this pool and let it do its job. You should take a look at this blog post right here for a great example of how to go about creating such thread pool.
A couple of tips that are not mentioned in that post:
Use std::thread::hardware_concurrency() to get the number of cores.
Figure out a way how to store the tasks, hint: std::packaged_task or something along
those lines wrapped in a class so you can track things such as when a task is done, or implement task.join().
Also, github with the code of his implementation plus some extra stuff such as std::future support can be found here.
You can use a semaphore to limit the number of parallel threads, here is a pseudo code:
Semaphore S = MAX_THREADS_AT_A_TIME // Initial semaphore value
declare handle_array[NUM_ITERS];
for(i=0 to NUM_ITERS)
{
wait-while(S<=0);
Acquire-Semaphore; // S--
handle_array[i] = Run-Thread(MyThread);
}
for(i=0 to NUM_ITERS)
{
Join_thread(handle_array[i])
Close_handle(handle_array[i])
}
MyThread()
{
mutex.lock
critical-section
mutex.unlock
release-semaphore // S++
}