I have an application on Linux on an i7 using boost::thread, and currently I have about 10 threads (between the 8 cores) that run simultaneously doing image processing on images of sizes of approximately 240 x 180 to 960 x 720, so naturally the smaller images finish faster than the larger images. At some point in the future I may need to bump up the number of threads so there will definitely be many more threads than cores.
So, is there a rule of thumb to decide which order to start and wait for threads; fastest first to get the small tasks out of the way, or slowest first to get it started sooner so finished sooner? Or is my synchronisation code wonky?
For reference, my code is approximately this, where the lower-numbered threads are slower:
Globals
static boost::mutex mutexes[MAX_THREADS];
static boost::condition_variable_any conditions[MAX_THREADS];
Main thread
// Wake threads
for (int thread = 0; thread < MAX_THREADS; thread++)
{
mutexes[thread].unlock();
conditions[thread].notify_all();
}
// Wait for them to sleep again
for (int thread = 0; thread < MAX_THREADS; thread++)
{
boost::unique_lock<boost::mutex> lock(mutexes[thread]);
}
Processing thread
static void threadFunction(int threadIdx)
{
while(true)
{
boost::unique_lock<boost::mutex> lock(mutexes[threadIdx]);
// DO PROCESSING
conditions[threadIdx].notify_all();
conditions[threadIdx].wait(lock);
}
}
Thanks to the the hints from the commenters and much Googling, I've completely reworked my code, which seems to be slightly faster without the mutexes.
Globals
// None now
Main thread
boost::asio::io_service ioService;
boost::thread_group threadpool;
{
boost::asio::io_service::work work(ioService);
for (size_t i = 0; i < boost::thread::hardware_concurrency(); i++)
threadpool.create_thread(boost::bind(&boost::asio::io_service::run, &ioService));
for (int thread = 0; thread < MAX_THREADS; thread++)
ioService.post(std::bind(threadFunction, thread));
}
threadpool.join_all();
Processing thread
static void threadFunction(int threadIdx)
{
// DO PROCESSING
}
(I've made this a Community Wiki as it's not really my answer.)
Related
#include <iostream>
#include <thread>
#include <unistd.h>
using namespace std;
void taskA() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskA: %d\n", i*i);
fflush(stdout);
}
}
void taskB() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskB: %d\n", i*i);
fflush(stdout);
}
}
void taskC() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskC: %d\n", i*i);
fflush(stdout);
}
}
void taskD() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskD: %d\n", i*i);
fflush(stdout);
}
}
void taskE() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskE: %d\n", i*i);
fflush(stdout);
}
}
void taskF() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskF: %d\n", i*i);
fflush(stdout);
}
}
void taskG() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskG: %d\n", i*i);
fflush(stdout);
}
}
void taskH() {
for(int i = 0; i < 10; ++i) {
sleep(1);
printf("TaskH: %d\n", i*i);
fflush(stdout);
}
}
int main(void) {
thread t1(taskA);
thread t2(taskB);
thread t3(taskC);
thread t4(taskD);
thread t5(taskE);
thread t6(taskF);
thread t7(taskG);
thread t8(taskH);
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
t7.join();
t8.join();
return 0;
}
In this C++ program, I created 8 similar functions (taskA to taskH) and 8 threads, one for each. When I executed, I got outputs of all the 8 functions parallely. But my Laptop has only 4 cores.
So the problem is how is it happening? 4 cores running 8 threads parallely, I didn't understand it! Please explain what's happening inside?
Thanks for your explanation!
Each core can run 1 thread at a time, or 2 threads in parallel if hyperthreading is enabled. So, on a system with 4 cores installed, there can be 4 or 8 threads running in parallel, max.
However, even so, your app is not the only one running threads. Every running process has at least 1 thread, maybe more. And the OS itself has dozens, maybe hundreds, of threads running. So clearly way way more threads total than the number of cores that are installed.
So, the OS has a built-in scheduler that is actively scheduling all of these running threads in such a way that cores will switch between threads at regular intervals, known as "time slices". This scheduling process is commonly known as "task switching".
This means that when a time slice on a core elapses, the core will temporarily pause the thread currently running on it, save that thread's state, and then resume an earlier paused thread for the next time slice, then pause and save that thread, switch to another thread, and so on, dozens/hundreds of times a second. Spread out over however many cores are installed.
Most systems are not real-time, so true parallel processing is just an illusion. Just a lot of switching between threads as core time slices become available.
That is it, in a nutshell. Obviously, things are more complex in practical use. There are lengthy articles, research papers, even books, on this topic, if you really want to know the gritty implementation details.
A bit about human physiology. Every person in the world has this thing called "reaction time". It is time between an event actually occuring and your brain realizing it and reacting to it. It varies between people, but it is extremely rare for a human to have a reaction time below 100ms, and below 10ms is unheard of. This means that if two events happens one after another, but the time between them is less than 10ms then a human will think that these events happened simultaneously.
What does it have to do with threading and cores? A CPU can only run N threads in parallel, where N is the number of cores (well, technically it is more complicated due to features like hyperthreading, but lets forget about it for now). So when you fire, say 10*N threads, then these cannot run in parallel. It is technically impossible. What actually happens is that the operating system has this internal piece of code called the scheduler, which controls which threads runs on which core at a given moment. And it jumps from one thread to another, so that every thread has some CPU time and can actually progress.
But you say "outputs of all the functions are coming at a same time". No they don't. The CPU processes billions of instructions per second. Or equivalently one or more instructions every 1/1bln second. The exact number depends on what exactly the CPU does, for example printing stuff to a monitor requires much more time, but still it can print probably thousands or tens of thousands characters into monitor below 10ms. And since this is below your reaction time, you only think that it happened in parallel, while in reality it did not.
and I am also using a sleep of 1sec
Sleep is not an action. It is a lack of action. And as such does not require CPU. The operations of "falling asleep" and "waking up" require some CPU (even though very little), but not waiting itself. And indeed, waiting does happen truely parallely, regardless of how many threads you have.
I am trying to synchronize a function I am parallelizing with pthreads.
The issue is, I am having a deadlock because a thread will exit the function while other threads are still waiting for the thread that exited to reach the barrier. I am unsure whether the pthread_barrier structure takes care of this. Here is an example:
static pthread_barrier_t barrier;
static void foo(void* arg) {
for(int i = beg; i < end; i++) {
if (i > 0) {
pthread_barrier_wait(&barrier);
}
}
}
int main() {
// create pthread barrier
pthread_barrier_init(&barrier, NULL, NUM_THREADS);
// create thread handles
//...
// create threads
for (int i = 0; i < NUM_THREADS; i++) {
pthread_create(&thread_handles[i], NULL, &foo, (void*) i);
}
// join the threads
for (int i = 0; i < NUM_THREADS; i++) {
pthread_join(&thread_handles[i], NULL);
}
}
Here is a solution I tried for foo, but it didn't work (note NUM_THREADS_COPY is a copy of the NUM_THREADS constant, and is decremented whenever a thread reaches the end of the function):
static void foo(void* arg) {
for(int i = beg; i < end; i++) {
if (i > 0) {
pthread_barrier_wait(&barrier);
}
}
pthread_barrier_init(&barrier, NULL, --NUM_THREADS_COPY);
}
Is there a solution to updating the number of threads to wait in a barrier for when a thread exits a function?
You need to decide how many threads it will take to pass the barrier before any threads arrive at it. Undefined behavior results from re-initializing the barrier while there are threads waiting at it. Among the plausible manifestations are that some of the waiting threads are prematurely released or that some of the waiting threads never get released, but those are by no means the only unwanted things that could happen. In any case ...
Is there a solution to updating the number of threads to wait in a
barrier for when a thread exits a function?
... no, pthreads barriers do not support that.
Since a barrier seems not to be flexible enough for your needs, you probably want to fall back to the general-purpose thread synchronization object: a condition variable (used together with a mutex and some kind of shared variable).
I have created 4 threads by pthread_create. I want them to start running at the very same time, so I add sem_wait(&sem) at the very beginning of the thread procedure. In main thread, I may using something like this, but I don't think it is a good solution:
for (int i = 0; i < 4; i++)
{
sem_post(&sem);
}
I googled and found pthread_cond_t. However, pthread_cond_broadcast can only wake up threads that are currently waiting. Even if I put pthread_cond_wait at the very beginning of the procedure, it is still not guaranteed that pthread_cond_wait is called before pthread_cond_broadcast (in main thread).
To avoid this, I have to add lots of additional codes to make sure the calling sequence of wait and broadcast, which is also not smart.
So, is there a simple way to 'line-up' all threads (make them start to run at the same time)?
There seems to be a sem_post_multiple, but it is a win32 extension in pthread. I am using Linux (Android) however.
you are searching for a barrier
pthread_barrier_t
you initialize it with the number of threads (n) and then call pthread_barrier_wait() with every thread. This call will block the execution until n threads have reached the barrier.
example:
int num_threads = 4;
pthread_barrier_t bar;
void* thread_start(void* arg) {
pthread_barrier_wait(&bar);
//...
}
int main() {
pthread_barrier_init(&bar,NULL,num_threads);
pthread_t thread[num_threads];
for (int i=0; i < num_threads; i++) {
pthread_create(thread + i, NULL, &thread_start, NULL);
}
for (int i=0; i < num_threads; i++) {
pthread_join(thread[i], NULL);
}
pthread_barrier_destroy(&bar);
return 0;
}
Am trying to limit the number of threads at anytime to be equal to utmost the number of cores available. Is the following a reasonable method? Is there a better alternative? Thanks!
boost::thread_group threads;
iThreads = 0;
for (int i = 0; i < Utility::nIterations; i++)
{
threads.create_thread(
boost::bind(&ScenarioInventory::BuildInventoryWorker, this,i));
thread_limiter.lock();
iThreads++;
thread_limiter.unlock();
while (iThreads > nCores)
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
threads.join_all();
void ScenarioInventory::BuildInventoryWorker(int i)
{
//code code code....
thread_limiter.lock();
iThreads--;
thread_limiter.unlock();
}
What you are likely looking for is thread_pool with a task queue.
Have a fixed number of threads blocking an queue. Whenever a task is pushed onto the queue a worker thread gets signalled (condition variable) and processes the task.
That way you
don't have the (inefficient) waiting lock
don't have any more threads than the "maximum"
don't have to block in the code that pushes a task
don't have redundant creation of threads each time around
See this answer for two different demos of such a thread pool w/ task queue: Calculating the sum of a large vector in parallel
I successfully was testing an example about boost io_service:
for(x = 0; x < loops; x++)
{
// Add work to ioService.
for (i = 0; i < number_of_threads; i++)
{
ioService.post(boost::bind(worker_task, data, pre_data[i]));
}
// Now that the ioService has work, use a pool of threads to service it.
for (i = 0; i < number_of_threads; i++)
{
threadpool.create_thread(boost::bind(
&boost::asio::io_service::run, &ioService));
}
// threads in the threadpool will be completed and can be joined.
threadpool.join_all();
}
This will loop several times and it take a little bit long because every time the threads are created for each loop.
Is there a way to create all needed threads.
Then post in the loop the work for each thread.
After the work it is needed to wait until all threads have finished their work!
Something like this:
// start/create threads
for (i = 0; i < number_of_threads; i++)
{
threadpool.create_thread(boost::bind(
&boost::asio::io_service::run, &ioService));
}
for(x = 0; x < loops; x++)
{
// Add work to ioService.
for (i = 0; i < number_of_threads; i++)
{
ioService.post(boost::bind(worker_task, data, pre_data[i]));
}
// threads in the threadpool will be completed and can be joined.
threadpool.join_all();
}
The problem here is that your worker threads will finish immediately after creation, since there is no work to be done. io_service::run() will just return right away, so unless you manage to sneak in one of the post-calls before all worker threads have had an opportunity to call run(), they will all finish right away.
Two ways to fix this:
Use a barrier to stop the workers from calling run() right away. Only unblock them once the work has been posted.
Use an io_service::work object to prevent run from returning. You can destroy the work object once you posted everything (and must do so before attempting to join the workers again).
The loop wasn't realy usefull.
Here is a better shwoing how it works.
I getting data in a callback:
void worker_task(uint8_t * data, uint32_t len)
{
uint32_t pos = 0;
while(pos < len)
{
pos += process_data(data + pos);
}
}
void callback_f(uint8_t *data, uint32_t len)
{
//split data into parts
uint32_t number_of_data_per_thread = len / number_of_threads;
// Add work to ioService.
uint32_t x = 0;
for (i = 0; i < number_of_threads; i++)
{
ioService.post(boost::bind(worker_task, data + x, number_of_data_per_thread));
x += number_of_data_per_thread ;
}
// Now that the ioService has work, use a pool of threads to service it.
for (i = 0; i < number_of_threads; i++)
{
threadpool.create_thread(boost::bind(
&boost::asio::io_service::run, &ioService));
}
// threads in the threadpool will be completed and can be joined.
threadpool.join_all();
}
So this callback get called very fast from the host application (media stream). If the len what is comming in is big enough the threadpool makes sense. This is because the working time is higher than the init time of the threads and start running.
If the len of the data is small the advantage of the threadpool is getting lost because the init and starting of the threads takes more time then the processing of the data.
May question is now if it is possible to have the threads already running and waiting for data. If the callback get called push the data to the threads and wait for their finish. The number of threads is constant (CPU count).
And as because it is a callback from a host application the pointer to data is only valid while being in the callback function. This is why I have to wait until all threads have finished work.
The thread can starting working immediately after getting data even before other threads are getting started. There is no sync problem because every thread have its own memory area of the data.