I am trying to build a generic task system where I can post tasks that get executed on whatever thread is free. With previous attempt I often ran out of threads because they would block at some point. So I am trying boost fibers; when one fiber blocks the thread is free to work on some other fiber, sounds perfect.
The work-stealing algorithm seems to be ideal for my purpose, but I have a very hard time to use it. In the example code fibers get created and only then the threads and schedulers get created, so all the fibers actually get executed on all the threads. But I want to start fibers later and by then all the other threads are suspended indefinitely because they didn't have any work. I have not found any way to wake them up again, all my fibers get only executed on the main thread. "notify" seems to be the method to call, but I don't see any way to actually get to an instance of an algorithm.
I tried keeping pointers to all instances of the algorithm so I could call notify(), but that doesn't really help; most of the time the algorithms in the worker threads cannot steal anything from the main one because the next one is the dispatcher_context.
I could disable "suspend", but threads are busy-waiting then, not an option.
I also tried the shared_work-algorithm. Same problem, once a thread cannot find a fiber it will never wake up again. I tried the same hack manually calling notify(), same result, very unreliable.
I tried using the channels, but AFAICT, if a fiber is waiting for it, the current context just "hops" over and runs the waiting fiber, suspending the current one.
In short: I find it very hard to reliably run a fiber on another thread. When profiling most threads are just waiting on a condition_variable, even though I did create tons of fibers.
As a small testing case I am trying:
std::vector<boost::fibers::future<int>> v;
for (auto i = 0; i < 16; ++i)
v.emplace_back(boost::fibers::async([i] {
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
return i;
}));
int s = 0;
for (auto &f : v)
s += f.get();
I am intentionally using this_thread::sleep_for to simulate the CPU being busy.
With 16 threads I would expect this code to run in 1s, but mostly it ends up being 16s. I was able to get this specific example to actually run in 1s just hacking around stuff; but no way felt "right" and no way did work for other scenarios, it always had to be hand-crafted to one specific scenario.
I think this example should just work as expected with a work_stealing algorithm; what am I missing? Is it just a misuse of fibers? How could I implement this reliably?
Thanks,
Dix
boost.fiber contains an example using the work_stealing algorithm (examples/work_stealing.cpp).
You have to install the algorithm on each worker-thread that should handle/steal fibers.
boost::fibers::use_scheduling_algorithm< boost::fibers::algo::work_stealing >( 4); // 4 worker-threads
Before you process tasks/fibers, you have to wait till all worker-threads have been registered at the algotithm. The example uses a barrier for this purpose.
You need an idication that all work/task has been procesed, for isntance using a condition-variable.
Take a look at Running with worker threads (boost documentation).
Related
I have an app which has a main thread that talks to a remote server over tcp and obtains some data. Before establishing a tcp connection, the main thread spawns a new thread that among other things prints the data obtained by the main thread.
Here's some pseudo-code on how the parallel thread does that -
while (1)
{
if(dataObtained)
{
printDataFromMainThread(remoteData);
exit;
}
}
In main thread:
recvOverTCPSocket();
saveData(remoteData);
dataObtained = true;
I was initially concerned that this is not thread-safe and when we run this app, there are seemingly no issues. I still have my concerns though.
It is clear that the parallel thread will read data only when the dataObtained flag is true...which happens only after all the data is saved. this ensures there is no chance of the parallel thread reading data while it is still being updated by the main thread.
However, I still am not sure if this is the right approach and whether I am missing any scenario where this is likely to fail.
Any inputs/comments. Does the above code snippet help us avoid using a semaphore while updating the data?
From your code, it looks like your parallel thread never blocks; that means that while your program is running, your parallel thread will probably be using up all of the available CPU cycles on one of the computer's cores. That's generally something you want to avoid, as it's very inefficient (e.g. it will tend to slow down other programs that are running on the computer, and on battery-powered devices it will unnecessarily deplete the battery).
You can avoid that e.g. by having the parallel thread block on a condition variable which the main thread would signal when the data is ready... but then the question becomes, why are you using multiple threads at all? It doesn't appear that your program is getting any benefit from having two threads, since only one of them is doing useful work at any given time anyway. Why not just have the same thread do both the TCP data download and then the printing of the downloaded data? That would give you the sequential behavior you are looking for, and be much simpler, more efficient, and less error-prone.
I would make boolean flag an atomic, but otherwise it should work
std::atomic<bool> dataObtained(false);
...
// in working thread
while (!dataObtained) { std::this_thread::yield(); }
// receive and print data
I want to create a thread or task (more than one to be exact) that goes and does some non CPU intensive work that will take a lot of time because of external causes, such a HTTP request or a file IO operation from a slow disk. I could do this with async await in C# and would be exactly what i am trying to do here. Spawn a thread or task and let it do it's own thing while i continue with execution of the program and simply let it return the result whenever ready. The problem with TBB i have is that all tasks i can make think they are created for a CPU intensive work.
Is what TBB calls GUI Thread what i want in this case ? I would need more than one, is that possible ? Can you point me to the right direction ? Should i look for another library that provides threading and is available for multiple OS ?
Any I/O blocking activity is poorly modeled by a task -- since tasks are meant to run to completion, it's just not what tasks are for. You will not find any TBB task-based approach that circumvents this. Since what you want is a thread, and you want it to work more-or-less nicely with other TBB code you already have, just use TBB's native thread class to solve the problem as you would with any other threading API. You won't need to set priority or anything else on this TBB-managed thread, because it'll get to its blocking call and then not take up any further time until the resource is available.
About the only thing I can think of specifically in TBB is that a task can be assigned a priority. But this isn't the same thing as a thread priority. TBB task priorities only dictate when a task will be selected from the ready pool, but like you said - once the task is running, it's expected to be working hard. The way to do use this to solve the problem you mentioned is to break your IO work into segments, then submit them into the work pool as a series of (dependent) low-priority tasks. But I don't think this gets to your real problem ...
The GUI Thread you mentioned is a pattern in the TBB patterns document that says how to offload a task and then wait for a callback to signal that it's complete. It's not altogether different from an async. I don't think this solves your problem either.
I think the best way for you here is to make an OS-level thread. That's pthreads on Linux or windows threads on Windows. Then you'll want to call this on it: http://msdn.microsoft.com/en-us/library/windows/desktop/ms686277(v=vs.85).aspx ... if you happen to be in C++11, you could use a std::thread to create the thread and then call thread::native_handle to get a handle to call the Windows API to set the priority.
I am having trouble understanding some concepts of multithreading. I know the basic principles but am having trouble with the understanding of when individual threads are sent and used by cores.
I know that having multiple threads allow code to run in parallel. I think this would be a good addition to my archive extraction program which could decompress blocks using multiple cores. It decompresses all of the files in a for loop and I am hoping that each available core will work on a file.
Here are my questions:
Do I need to query or even consider the number of cores on a machine or when the threads are running, they are automatically sent to free cores?
Can anyone show me an example of a for loop using threads. Say in each loop iteration it would call a function using a different thread. I read that the ideal number of threads to have active are the number of cores. How do I know when a core is free or should I check to see if it has joined main thread, and create a new thread when it has to keep a certain number of threads running.
Am I overcomplicating things or are my questions indicative that I am not grasping the concepts?
If you're decompressing files then you'll probably want a bounded number of thread rather than one thread per file. Otherwise, if you're processing 1000 files you're going to create 1000 thread, which won't make efficient use of the cpu.
As you've mentioned, one approach is to create as many threads as there are cores, and this is a reasonable approach in your case as decompression is reasonably cpu bound, and therefore any threads you create will be active for most of their time slice. If your problem with IO bound then your threads would be spending a lot of time waiting for IO to complete, and therefore you could have spin of more threads than you've got cores, within bounds.
For your application I'd probably look at spinning up one thread per core, and have each thread process one file at a time. This will help keep your algorithm simple. If you had multiple threads working on one file then you'd have to synchronize between them in order to ensure that the blocks they processed were written out to the correct location in the uncompressed file, which will cause needless headaches.
C++11 includes a thread library which you can use simplify working with threads.
No, you can use an API that keeps that transparent, for example POSIX threads on Linux (pthread library).
This answer probably depends on what API you use, though many APIs share threading basics like mutexes. Here, however, is a pthreads example (since that's the only C/C++ threading API I know).
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
// Whatever other headers you need for your code.
#define MAX_NUM_THREADS 12
// Each thread will run this function.
void *worker( void *arg )
{
// Do stuff here and it will be 'in parallel'.
// Note: Threads can read from the same location concurrently
// without issue, but writing to any shared resource that has not been
// locked with, for example, a mutex, can cause pernicious bugs.
// Call this when you're done.
pthread_exit( NULL );
}
int main()
{
// Each is a handle for one thread, with 12 in total.
pthread_t myThreads[MAX_NUM_THREADS];
// Create the worker threads.
for(unsigned long i = 0; i < numThreads; i++)
{
// NULL thread attributes struct.
// This initializes the threads with the default PTHREAD_CREATE_JOINABLE
// attribute; we know a thread is finished when it joins, see below.
pthread_create(&myThreads[i], NULL, worker, (void *)i);
}
void *status;
// Wait for the threads to finish.
for(unsigned int i = 0; i < numThreads; i++)
{
pthread_join(myThreads[i], &status);
}
// That's all, folks.
pthread_exit(NULL);
}
Without too much detail, that's a pretty basic skeleton for a simple threaded application using pthreads.
Regarding your questions on the best way to go about applying this to your program:
I suggest one thread per file, using a Threadpool Pattern, and here's why:
Single thread per file is much simpler because there's no sharing, hence no synchronization. You can change the worker function to a decompressFile function, passing a filename each time you call pthread_create. That's basically it. Your threadpool pattern sort of falls into place here.
Multiple threads per file means synchronization, which means complexity because you have to manage access to shared resources. In order to speed up your algorithm, you'd have to isolate portions of it that can run in parallel. However, I would actually expect this method to run slower:
Imagine Thread A has File A open, and Thread B has File B open, but File A and File B are in completely different sectors of your disk. As your OS's scheduling algorithm switches between Thread A and Thread B, your hard drive has to spin like mad to keep up, making the CPU (hence your program) wait.
Since you are seemingly new to threading/parallelism, and you just want to get more performance out of multiple processors/cores, I suggest you look for libraries that deal with threading and allow you to enable parallelism without getting into thread management, work distribution etc.
It sounds all you need now is a parallel loop execution. Nowadays there is a plenty of C++ libraries that can ease this task for you, e.g. Intel's TBB, Microsoft's PPL, AMD's Bolt, Quallcomm's MARE to name a few. You may compare licensing terms, supported platforms, functionality and make a choice that best fits your needs.
To be more specific and answer your questions:
1) Generally, you should have no need to know/consider the number of processors or cores. Choose a library that abstracts this detail away from you and your program. On the other hand, if you see that with default settings CPU is not fully utilized (e.g. due to a significant number of I/O operations), you may find it useful to ask for more threads, e.g. by multiplying the default by a certain factor.
2) A sketch of a for loop made parallel with tbb::parallel_for and C++11 lambda functions:
#include <tbb/tbb.h>
void ParallelFoo( std::vector<MyDataType>& v ) {
tbb::parallel_for( size_t(0), v.size(), [&](int i){
Foo( v[i] );
} );
}
Note that it is not guaranteed that each iteration is executed by a separate thread; but you should not actually worry about such details; all you need is available cores being busy with useful work.
Disclaimer: I'm a developer of Intel's TBB library.
If you're on Windows, you could take a look at Thread Pools, a good description can be found here: http://msdn.microsoft.com/en-us/magazine/cc163327.aspx. An interesting feature of this facility is that it promises to manage the threads for you. It also selects the optimal number of threads depending on demand as well as on the available cores.
I want to split some CPU intensive jobs to multiple threads. I want to make a thread pool with, let's say, 4 threads.
I want to know very fast ways to do following:
Check if one thread is free for receiving processing
Signalize one thread to start specific function
Wait for all the threads to finish their jobs
This should be as fast as possible. I use C++ in Visual Studio 2010 on Windows 7. Any Win7/VS2010 specific solution would be preferred if it's faster than portable approach.
EDIT:
I found on MSDN this sample:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686946(v=vs.85).aspx
Is there any faster way to do this?
The stuff from the Boost thread library is pretty fast. You can start 4 threads that end up waiting for a boost::condition_variable. In the main thread you can add stuff to a task-queue and then call boost::condition_variable::notify_one in order to start one free thread, if any. As soon as one of the working threads is notified, it takes stuff out of the task queue and continues to do so until the queue is empty. In order to wait for the task queue to finish, let the thread that makes the task queue empty call boost::condition_variable::notify_all and wait in the main thread for that signal. Obviously you need to protect the shared data for this stuff with a mutex.
This technique works fine if you have medium to large size tasks and several thousand or less should execute in a second. I don't have experience with smaller tasks using this technique.
The parallel patterns library (PPL) is really good at that stuff too, it does a lot of stuff for you, but you don't have as much control. It's Windows only, but that seems to be fine with you. ;)
EDIT: Your link seems to be a good solution. Using the WINAPI is often the fastest thing you can do, since other APIs are usually build upon it. The WINAPI does not, however, provide very good abstraction. Thus I would prefer PPL, futures, etc. to perform tasks like that. How big are your tasks? If they take more than a few milliseconds, then you shouldn't worry about the api you're using, since that's not the bottleneck.
First way: Asynchronous Procedure Calls.
Another way: I/O Completion Ports, which can be used for your task.
I don't know about Visual C++ specific Thread Pools however I've heard about existance of some ppl.h. there an unofficial boost threadpool that I've use one. and just as all other boost It compiles well in Visual Studio
try tbb
class SimpleTask: public tbb::task {
public:
SimpleTask(const char *c ) {}
task* execute() {
//do task
return 0;
}
};
//execute tasks and wait
tbb::task_scheduler_init init(50);//initialize pool
tbb::task_list list;
for(int i=0;i<30;i++){//create 30 task
list.push_back(*new(tbb::task::allocate_root()) SimpleTask());
}
tbb::task::spawn_root_and_wait(list);//execute and wait for all task or call spawn without wait
I have a program with a main thread and a diagnostics thread. The main thread is basically a while(1) loop that performs various tasks. One of these tasks is to provide a diagnostics engine with information about the system and then check back later (i.e. in the next loop) to see if there are any problems that should be dealt with. An iteration of the main loop should take no longer than 0.1 seconds. If all is well, then the diagnostic engine takes almost no time to come back with an answer. However, if there is a problem, the diagnostic engine can take seconds to isolate the problem. For this reason each time the diagnostic engine receives new information it spins up a new diagnostics thread.
The problem we're having is that the diagnostics thread is stealing time away from the main thread. Effectively, even though we have two threads, the main thread is not able to run as often as I would like because the diagnostic thread is still spinning.
Using Boost threads, is it possible to limit the amount of time that a thread can run before moving on to another thread? Also of importance here is that the diagnostic algorithm we are using is blackbox, so we can't put any threading code inside of it. Thanks!
If you run multiple threads they will indeed consume CPU time. If you only have a single processor, and one thread is doing processor intensive work then that thread will slow down the work done on other threads. If you use OS-specific facilities to change the thread priority then you can make the diagnostic thread have a lower priority than the main thread. Also, you mention that the diagnostic thread is "spinning". Do you mean it literally has the equivalent of a spin-wait like this:
while(!check_done()) ; // loop until done
If so, I would strongly suggest that you try and avoid such a busy-wait, as it will consume CPU time without achieving anything.
However, though multiple threads can cause each other to slow-down, if you are seeing an actual delay of several seconds this would suggest there is another problem, and that the main thread is actually waiting for the diagnostic thread to complete. Check that the call to join() for the diagnostic thread is outside the main loop.
Another possibility is that the diagnostic thread is locking a mutex needed by the main thread loop. Check which mutexes are locked and where.
To really help, I'd need to see some code.
looks like your threads are interlocked, so your main thread waits until background thread finished its work. check any multithreading sychronization that can cause this.
to check that it's nothing related to OS scheduling run you program on double-core system, so both threads can be executed really in parallel
From the way you've worded your question, it appears that you're not quite sure how threads work. I assume by "the amount of time that a thread can run before moving on to another thread" you mean the number of cpu cycles spent per thread. This happens hundreds of thousands of times per second.
Boost.Thread does not have support for thread priorities, although your OS-specific thread API will. However, your problem seems to indicate the necessity for a fundamental redesign -- or at least heavy profiling to find bottlenecks.
You can't do this generally at the OS level, so I doubt boost has anything specific for limiting execution time. You can kinda fake it with small-block operations and waits, but it's not clean.
I would suggest looking into processor affinity, either at a thread or process level (this will be OS-specific). If you can isolate your diagnostic processing to a limited subset of [logical] processors on a multi-core machine, it will give you a very course mechanism to control maximum execution amount relative to the main process. That's the best solution I have found when trying to do a similar type of thing.
Hope that helps.