QThread always one thread left behind - c++

This is my thread:
class calThread : public QThread
{
Q_OBJECT
public:
calThread(QList<int> number);
~calThread();
void run();
QList<int> cal(QList<int> input);
signals:
void calFinished(QList<int> result);
};
void calThread::run()
{
output = cal(number);
emit calFinished(output);
sleep(1);
}
This is how I call the thread:
calThread* worker3 = new calThread(numberList);
connect(worker3, SIGNAL(calFinished(List<int>)), this, SLOT(handleResult(List<int>)));
connect(worker3, SIGNAL(finished()), worker3, SLOT(deleteLater()));
worker3->start();
I have a large list of input. I divide the list into four equal sized list and put them into individual thread to calculate. They are namely worker0 to worker3.
Every time the program runs, the four threads start at similar time. But there is always one thread that returns much much slower. For example, it takes about 2 minutes for first 3 threads to finish, the fourth thread takes maybe 5 minutes to return.
But all thread should have same number of items and same complexity to calculate.
Why is there always a thread left behind?
Debug output:
inputThread0 item numbers: 1736
inputThread1 item numbers: 1736
inputThread2 item numbers: 1736
inputThread3 item numbers: 1737
"20:29:58" Thread 0 Thread ID 0x7f07119df700
"20:29:58" Thread 3 Thread ID 0x7f06fc1d3700
"20:29:58" Thread 1 Thread ID 0x7f06fd1d5700
"20:29:58" Thread 2 Thread ID 0x7f06fc9d4700
"20:29:58" Thread 0 Thread ID 0x7f07119df700
"20:29:58" Thread 1 Thread ID 0x7f06fd1d5700
….............................
//Most of them are Thread 0,1,2 afterward
….............................
"20:29:58" Thread 1 Thread ID 0x7f06fd1d5700
// This is last Thread from thread 0,1,or2
// It takes less than one second to finish
"20:29:59" Thread 3 Thread ID 0x7f06fc1d3700
"20:29:59" Thread 3 Thread ID 0x7f06fc1d3700
"20:29:59" Thread 3 Thread ID 0x7f06fc1d3700
"20:29:59" Thread 3 Thread ID 0x7f06fc1d3700
….................................
// Only Thread 3 left
"20:30:17" Thread 3 Thread ID 0x7f06fc1d3700
// This last thread takes 19 second to finish

"Why is there always a thread left behind?"
Why not? Thread scheduling is completely on the whim of the OS. There is no guarantee at all that any threads will get any sort of a "fair share" of any CPU resources. You need to assign small chunks of work and have them automatically distributed across the worker threads. QtConcurrent::run and the QtConcurrent framework in general offers a trivial way of getting that done. As long as the chunks of work passed to run, mapReduce, etc. are reasonably sized (say take between 0.1 and 1s), all of the threads in the pool will be done within a couple tenths of a second of each one.
A partial explanation for your observed behavior is that a thread that already runs on a given core is more likely to be rescheduled on the same core to utilize the warm caches there. If there are three out of four cores that run your threads almost continuously, the fourth thread often ends up sharing the core with your GUI thread, and will necessarily run slower if the GUI is not idle. If the GUI thread is busy processing the results from the other threads, it is not unexpected that the computation thread would be starved on that core. This is actually the most power- and time-efficient way to schedule the threads, with least amount of overhead.
As long as you give the threads small chunks of work to do, and distribute them on as-ready basis - as QtConcurrent does - it will also lead to smallest wall clock runtimes. If the scheduler was forcing "fair" reschedules, your long-running threads would all finish roughly at the same time, but would take more time and power to finish the job. Modern schedulers enable you to run your jobs most efficiently, but you must set the jobs up to take advantage of that.
In a way, your scheduler is helping you improve your code to be more resource-efficient. That's a good thing.

Related

How to always have as many threads running simultaneously as the number of cores?

I have a vector of points. Each point is n-dimensional vector.
There are k=100 points for me.
I have to calculate for each point the values of the several functions of many variables.
After that I write the results to the txt file.
I have 4 cores. I use multithreading for this task. I want to have only
4 threads running simultaneously.
My threads are almost independent. They only share one resource. The txt file.
Each thread write the results at the end.
I use mutex to write to this file.
Now I have next solution:
for (int i = 0; i < k; k+=number_of_threads){
// create 4 threads
// start function for each 4 threads
// join thread 1
// join thread 2
// join thread 3
// join thread 4
}
The problem is that new threads will not start until each thread ends.
This works. But there is a better solution, I think.
How can I start a new thread right after one of threads is finished?
And I want to have only 4 running threads running at the same time(4 is the number of cores)
I am using pthreads, by the way.

Thread pools and context switching slowdowns

I have a thread pool with idling threads that wait for jobs to be pushed to a queue, in a windows application.
I have a loop in my main application thread that adds 1000 jobs to the pool's queue sequentially (it adds a job, then waits for the job to finish, then adds another job, x1000). So no actual parallel processing is happening...here's some pseudocode:
////threadpool:
class ThreadPool
{
....
std::condition_variable job_cv;
std::condition_variable finished_cv;
std::mutex job_mutex;
std::queue<std::function <void(void)>> job_queue;
void addJob(std::function <void(void)> jobfn)
{
std::unique_lock <std::mutex> lock(job_mutex);
job_queue.emplace(std::move(jobfn));
job_cv.notify_one();
}
void waitForJobToFinish()
{
std::unique_lock<std::mutex> lock(job_mutex);
finished_cv.wait(lock, [this]() {return job_queue.empty(); });
}
....
void threadFunction() //called by each thread when it's first started
{
std::function <void(void)> job;
while (true)
{
std::unique_lock <std::mutex> latch(job_mutex);
job_cv.wait(latch, [this](){return !job_queue.empty();});
{
job = std::move(job_queue.front());
job_queue.pop();
latch.unlock();
job();
latch.lock();
finished_cv.notify_one();
}
}
}
}
...
////main application:
void jobfn()
{
//do some lightweight calculation
}
void main()
{
//test 1000 calls to the lightweight jobfn from the thread pool
for (int q = 0; q < 1000; q++)
{
threadPool->addJob(&jobfn);
threadPool->waitForJobToFinish();
}
}
So basically what's happening is a job is added to the queue and the main loop begins to wait, a waiting thread then picks it up, and when the thread finishes, it notifies the application that the main loop can continue and another job can be added to the queue, etc. So that way 1000 jobs are processed sequentially.
It's worth noting that the jobs themselves are tiny and can complete in a few milliseconds.
However, I've noticed something strange....
The time it takes for the loop to complete is essentially O(n) where n is the number of threads in the thread pool. So even though jobs are processed one-at-a-time in all scenarios, a 10-thread pool takes 10x longer to complete the full 1000-job task than a 1-thread pool.
I'm trying to figure out why, and my only guess so far is that context switching is the bottleneck...maybe less (or zero?) context switching overhead is required when only 1 thread is grabbing jobs...but when 10 threads are continually taking their turn to process a single job at a time, there's some extra processing required? But that doesn't make sense to me...wouldn't it be the same operation required to unlock thread A for a job, as it would be thread B,C,D...? Is there some OS-level caching going on, where a thread doesn't lose context until a different thread is given it? So calling on the same thread over and over is faster than calling threads A,B,C sequentially?
But that's a complete guess at this point...maybe someone else could shed some insight on why I'm getting these results...Intuitively I assumed that so long as only 1 thread is executing at a time, I could have a thread pool with an arbitrarily large number of threads and the total task completion time for [x] jobs would be the same (so long as each job is identical and the total number of jobs is the same)...why is that wrong?
Your "guess" is correct; it's simply a resource contention issue.
Your 10 threads are not idle, they're waiting. This means that the OS has to iterate over the currently active threads for your application, which means a context switch likely occurs.
The active thread is pushed back, a "waiting" thread pulled to the front, in which the code checks if the signal has been notified and the lock can be acquired, since it likely can't in the time slice for that thread, it continues to iterate over the remaining threads, each trying to see if the lock can be acquired, which it can't because your "active" thread hasn't been allotted a time slice to complete yet.
A single-thread pool doesn't have this issue because no additional threads need to be iterated over at the OS level; granted, a single-thread pool is still slower than just calling job 1000 times.
Hope that can help.

guaranteeing to wake up all threads and only once per each

I found a bug in my program, that the same thread is awoke twice taking the opportunity for another thread to run, thus causing unintended behaviours. It is required in my program that all threads waiting should run exactly once per turn. This bug happens because I use semaphores to make the threads wait. With a semaphore initialized with count 0, every thread calls down to the semaphore at the start of its infinite loop, and the main thread calls up in a for loop NThreads (the number of threads) times. Occasionally the same thread takes the up call twice and the problem arises.
What is the way to deal with this problem properly? Is using condition variables and broadcasting a way to do this? Will it guarantee that every thread is awoke once and only once? What are other good ways possible?
On windows, you could use WaitForMultipleObjects to select a ready thread from the threads that have not been run in the current Nthread iterations.
Each thread should have a "ready" event to signal when it is ready, and a "wake" event to wait on after it has signaled its "ready" event.
At the start of your main thread loop (1st of NThreads iteration), call WaitForMultipleObjects with an array of your NThreads "ready" events.
Then set the "wake" event of the thread corresonding to the "ready" event returned by WaitForMultipleObjects, and remove it from the array of "ready" handles. That will guaranty that the thread that has already been run won't be returned by WaitForMultipleObjects on the next iteration.
Repeat until the last iteration, where you will call WaitForMultipleObjects with an array of only 1 thread handle (I think this will work as if you called WaitForSingleObject).
Then repopulate the array of NThreads "ready" events for the next new Nthreads iterations.
Well, use an array of semaphores, one for each thread. If you want the array of threads to run once only, send one unit to each semaphore. If you want the threads to all run exactly N times, send N units to each semaphore.

Thread synchronization in qt

I have a program that have 3 threads.All of them take data from ethernet on different ports.The frequencies of the data coming for 3 of the threads may be different. But all of the incoming data must be processed at the same time.
So if one data comes for one thread, it must wait the others data to come. How can I get it?
Boost.Thread has a barrier class, whose purpose is to block multiple threads until a specified number have reached the barrier.
You would just create a boost::barrier initialized with 3, meaning that it blocks until three threads are waiting on the barrier. When each of your threads is done waiting for data, you have them call wait() on the barrier. When the third thread calls wait(), all three threads will continue execution.
boost::barrier barrier(3);
void thread_function()
{
read_data();
barrier.wait(); // Threads will block here until all three are ready.
process_data();
}
If you only want one thread to process the data, you can check the return value of wait(); the function will only return true for one of the threads at the barrier.
You need a barrier. Barrier has preset capacity N and blocks N-1 threads until N-th arrives. After the N-th arrives, all N threads are released simultaneously.
Unfortunately Qt has no direct support for barriers, but there is simple implementation using Qt primitives here: https://stackoverflow.com/a/9639624/1854587
Not as simple as boost's barrier as answered by #dauphic, but this can be done with Qt alone, using slots, signals and another class on a 4th thread.
Create a class on a separate thread that coordinates the other 3, the network threads can send a signal to the 'coordinator' class when they receive data. Once this coordinator class has received messages from all 3 network threads, it can then signal the threads to process the data.

Multi threading independent tasks

I have N tasks, which are independent (ie., write at different memory addresses) but don't take exactly the same time to complete (from 2 to 10 seconds, say). I have P threads.
I can divide my N tasks into P threads, and launch my threads. Ultimately, at the end, there will be a single thread remaining to complete the last few tasks, which is not optimal.
I can also launch P threads with 1 task each, WaitForMultipleObjects, and relaunch P threads etc. (that's what I currently do, as the overhead of creating threads is small compared to the task). However, this does not solve the problem either, there will still be P-1 threads waiting for the last one at some point.
Is there a way to launch threads, and as soon as the thread has finished its task, go on to the next available task until all tasks are completed ?
Thanks !
yes, it's called thread pooling. it's a very common practice.
http://en.wikipedia.org/wiki/Thread_pool_pattern
Basically, you create a queue of tasks (function pointers with their arguments), and push the tasks there. You have N threads running which do the following loop (schematic code):
while (bRunning) {
task = m_pQueue.pop();
if (task) {
executeTask(task);
}
else {
//you can sleep a bit here if you want
}
}
there are more elegant ways to implement it (avoiding sleeps, etc) but this is the gist of it.