Spawn multiple std::thread and reuse it - c++

I am a noob when it comes to threading and need some help / advice.
First of all, can you check if my understanding is correct in the following code:
std::vector<std::unique_ptr<Object>> totalObjects(512);
std::vector<Object*> objectsToUpdate(32);
std::vector<std::thread> threadsPool(32);
int nrObjectsToUpdate; //Varies between 1 and 32 for each update.
findObjectsToUpdate(totalObjects, objectsToUpdate, nrObjectsToUpdate);
for(int i = 0; i < nrObjectsToUpdate; i++)
threadsPool[i] = std::thread(objectsToUpdate[i]->updateTask1());
//All tasks in this step must be completed before
//we can move on to the next, i.e. updateTask2();.
for(int i = 0; i < nrObjectsToUpdate; i++)
threadsPool[i].join();
for(int i = 0; i < nrObjectsToUpdate; i++)
threadsPool[i] = std::thread(objectsToUpdate[i]->updateTask2());
for(int i = 0; i < nrObjectsToUpdate; i++)
threadsPool[i].join();
Should I spawn one thread for each updateTask1() and updateTask2()?
For each update, do I need to create std::thread() all over again? or can I simply reuse it again with some member function?
If I create threads for updateTask1(), is it possible to reuse all thread objects for updateTask2()?, i.e. switching function pointer with some std::thread member function?
Let us say that we create 100 threads and we have a quadcore CPU(4 cores),
will all the CPU cores be busy until all the threads is completed?
I know at least that 4 cores means 4 threads.
Grateful for all the help and explanations that can be given.

The optimal number of threads to use is both application and hardware dependant, therefore how many threads you should spawn depends on your application.
For example, some applications might run well with multiple threads per core because the threads do not interfere with each other (thread X and thread Y on core 1, for example, don't fight for compute resources so there is an advantage gained with multiple threads per core). However, other applications might perform worse with multiple threads per core because using only one thread might require most of the core's resources, so then when using additional threads per core, the threads interfere. You should do some testing to find out what is the best thread configuration for your application. Multithreading is often not straightforward, and the performance results may be surprising.
There are a number of things which you can use to help with determining the number of threads and thread scheduling (you should still do the performance tests though).
You can use unsigned num_cpus = std::thread::hardware_concurrency(); to get the number of available CPUs. While you may know the number of cores for the CPU you're using, maybe you want to run it on another machine for which you don't know the number of cores.
Additionally there is processor affinity, which is essentially pinning certain threads to specific CPUs. By default the OS is allowed to schedule any of the spawned threads to any of the CPUs. Sometimes this results in multiple threads per CPU, and some CPUs not being utilised for some portion of the multi-threaded component. You can explicitly set specific threads to use specific CPUs using pthread_setaffinity_np as follows (do this for each thread you want to pin to a core):
cpu_set_t cpu_set;
CPU_ZERO(&cpu_set);
CPU_SET(i, &cpu_set);
int rc = pthread_setaffinity_np(threadsPool[i].native_handle(),
sizeof(cpu_set_t), &cpu_set);
// Check for error
if (rc != 0)
std::cerr << "pthread_setaffinity_np error: " << rc << "\n";
If I create threads for updateTask1(), is it possible to reuse all thread objects for updateTask2()?, i.e. switching function pointer with some std::thread member function?
Yes you can do this. The logic in your program regarding the use of threads for updateTask1() and updateTask2() is correct, however, syntactically you have made errors when assigning the threads.
threadsPool[i] = std::thread(objectsToUpdate[i]->updateTask1());
Is incorrect. You are wanting to use a member function as the function to spawn for each thread, so you need to pass a reference to the function, as well as the object to bind to, followed by any additional arguments (for the sake of example, I'll add that the updateTask1 function takes the object id i). The assignment of the threads should then look like this:
threadsPool[i] = std::thread(&Object::updateTask1, // Reference to function
objectsToUpdate[i] , // Object to bind to
i ); // Additional argument -- thread number
You can then use the same syntax for updateTask2. Here is a live demo for demonstration, which includes processor affinity.

Related

What is the correct way to use queue.flush() and queue.finish() after calling a Kernel?

I am using opencl 1.2 c++ wrapper for my project. I want to know what is the correct method to call my kernel. In my case, I have 2 devices and the data should be sent simultaneously to them.
I am dividing my data into two chunks and both the devices should be able to perform computations on them separately. They have no interconnection and they don't need to know what is happening in the other device.
When the data is sent to both the devices, I want to wait for the kernels to finish before my program goes further. Because I will be using results returned from both of the kernels. So I don't want to start reading the data before the kernels have returned.
I have 2 methods. Which one is programmatically correct in my case:
Method 1:
for (int i = 0; i < numberOfDevices; i++) {
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
arguments etc...);
queue[i].flush();
}
// Wait for the kernels to return.
for (int i = 0; i < numberOfDevices; i++) {
queue[i].finish();
}
Method 2:
for (int i = 0; i < numberOfDevices; i++) {
// Enqueue the kernel.
kernelGA(cl::EnqueueArgs(queue[iter],
arguments etc...);
}
for (int i = 0; i < numberOfDevices; i++) {
queue[i].flush();
}
// Wait for the kernels to return.
for (int i = 0; i < numberOfDevices; i++) {
queue[i].finish();
}
Or none of them are correct and there is a better way to wait for my kernels to return?
Assuming each device Computes in its own memory:
I would go for multi threaded (for) loop version of your method-1. Because opencl doesnt force vendors to do asynchronous enqueuing. Nvidia for example, does synchronous enqueuing for some drivers and hardware while amd has asynchronous enqueuing.
When each device is driven by a separate thread, they should enqueue Write+Compute together before synchronising for reading partial results(second threaded loop)
Having multiple threads also advantageous for spin-wait type synchronization (clfinish) because multiple spin-wait loops are worked in parallel. This should save time in Order of a millisecond.
Flush helps some vendors like amd to start enqueueing Early.
To have correct input and correct output for all devices, only two finish commands are enough. One After Write+Compute then one After read(results). So each device get same time step data and produce results at same time step. Write and Compute doesnt need finish between them if queue type is in-order because it Computes one by one. Also this doesnt need read operations to be blocking.
Trivial finish commands always Kill performance.
Note: I already wrote a load balancer using all this, and it performs better When using event-based synchronization instead of finish. Finish is easier but has bigger synchronization times than an event based one.
Also single queue doesnt always push a gpu to its limits. Using at Least 4 queues per device ensures Latency hiding of Write and Compute on my amd system. Sometimes even 16 queues help a bit more. But for io bottlenecked situations May need even more.
Example:
thread1
Write
Compute
Synchronization with other thread
Thread2
Write
Compute
Synchronization with other thread
Thread 1
Read
Synchronization with other thread
Thread2
Read
Synchronization with other thread
Trivial synchronization Kills performance because drivers dont know your intention and they leave it as it is So you should elliminate unnecessary finish commands and convert blocking Writes to nonblocking ones where you can.
Zero synchronization is also wrong because opencl doesnt force vendors to start computing After several enqueues. It May indefinitely grow to gifabytes of memory in minutes or even seconds.
You should use Method 1. clFlush is the only way of guaranteeing that commands are issued to the device (and not just buffered somewhere before sending).

How to get multithreads working properly using pthreads and not boost in class using C++

I have a project that I have summarized here with some pseudo code to illustrate the problem.
I do not have a compiler issue and my code compiles well whether it be using boost or pthreads. Remember this is pseudo code designed to illustrate the problem and not directly compilable.
The problem I am having is that for a multithreaded function the memory usage and processing time is always greater than if the same function is acheived using serial programming e.g for/while loop.
Here is a simplified version of the problem I am facing:
class aproject(){
public:
typedef struct
{
char** somedata;
double output,fitness;
}entity;
entity **entity_array;
int whichthread,numthreads;
pthread_mutex_t mutexdata;
aproject(){
numthreads = 100;
*entity_array=new entity[numthreads];
for(int i;i<numthreads;i++){
entity_array[i]->somedata[i] = new char[100];
}
/*.....more memory allocations for entity_array.......*/
this->initdata();
this->eval_thread();
}
void initdata(){
/**put zeros and ones in entity_array**/
}
float somefunc(char *somedata){
float output=countzero(); //someother function not listed
return output;
}
void* thread_function()
{
pthread_mutex_lock (&mutexdata);
int currentthread = this->whichthread;
this->whichthread+=1;
pthread_mutex_unlock (&mutexdata);
entity *ent = this->entity_array[currentthread];
double A=0,B=0,C=0,D=0,E=0,F=0;
int i,j,k,l;
A = somefunc(ent->somedata[0]);
B = somefunc(ent->somedata[1]);
t4 = anotherfunc(A,B);
ent->output = t4;
ent->fitness = sqrt(pow(t4,2));
}
static void* staticthreadproc(void* p){
return reinterpret_cast<ga*>(p)->thread_function();
}
void eval_thread(){
//use multithreading to evaluate individuals in parallel
int i,j,k;
nthreads = this->numthreads;
pthread_t threads[nthreads];
//create threads
pthread_mutex_init(&this->mutexdata,NULL);
this->whichthread=0;
for(i=0;i<nthreads;i++){
pthread_create(&threads[i],NULL,&ga::staticthreadproc,this);
//printf("creating thread, %d\n",i);
}
//join threads
for(i=0;i<nthreads;i++){
pthread_join(threads[i],NULL);
}
}
};
I am using pthreads here because it works better than boost on machines with less memory.
Each thread is started in eval_thread and terminated there aswell. I am using a mutex to ensure every thread starts with the correct index for the entity_array, as each thread only applys its work to its respective entity_array indexed by the variable this->whichthread. This variable is the only thing that needs to be locked by the mutex as it is updated for every thread and must not be changed by other threads. You can happily ignore everything else apart from the thread_function, eval_threads, and the staticthreadproc as they are the only relevent functions assume that all the other functions apart from init to be both processor and memory intensive.
So my question is why is it that using multithreading in this way is IT more costly in memory and speed than the traditional method of not using threads at all?
I MUST REITERATE THE CODE IS PSEUDO CODE AND THE PROBLEM ISNT WHETHER IT WILL COMPILE
Thanks, I would appreciate any suggestions you might have for pthreads and/or boost solutions.
Each thread requires it's own call-stack, which consumes memory. Every local variable of your function (and all other functions on the call-stack) counts to that memory.
When creating a new thread, space for its call-stack is reserved. I don't know what the default-value is for pthreads, but you might want to look into that. If you know you require less stack-space than is reserved by default, you might be able to reduce memory-consumption significantly by explicitly specifying the desired stack-size when spawning the thread.
As for the performance-part - it could depend on several issues. Generally, you should not expect a performance boost from parallelizing number-crunching operations onto more threads than you have cores (don't know if that is the case here). This might end up being slower, due to the additional overhead of context-switches, increased amount of cache-misses, etc. There are ways to profile this, depending on your platform (for instance, the Visual Studio profiler can count cache-misses, and there are tools for Linux as well).
Creating a thread is quite an expensive operation. If each thread only does a very small amount of work, then your program may be dominated by the time taken to create them all. Also, a large number of active threads can increase the work needed to schedule them, degrading system performance. And, as another answer mentions, each thread requires its own stack memory, so memory usage will grow with the number of threads.
Another issue can be cache invalidation; when one thread writes its results in to its entity structure, it may invalidate nearby cached memory and force other threads to fetch data from higher-level caches or main memory.
You may have better results if you use a smaller number of threads, each processing a larger subset of the data. For a CPU-bound task like this, one thread per CPU is probably best - that means that you can keep all CPUs busy, and there's no need to waste time scheduling multiple threads on each. Make sure that the entities each thread works on are located together in the array, so it does not invalidate those cached by other threads.

Multithreading using threadpool

I'm currently using the boost threadpool with the number of threads equal to the number of cores. I have scheduled, say 10 tasks using the pool's schedule function. For example,
suppose I have the function
void my_fun(std::vector<double>* my_vec){
// Do something here
}
The argument 'my_vec' here is just used to do some temporary calculations. The main reason I passing it the function is that I would like to reuse this vector when I call the function again.
Currently, I have the following
// Create a vector of 10 vectors called my_vecs
// Create threadpool
boost::threadpool::pool tp(num_threads);
// Schedule tasks
for (int m = 0; m < 10; m++){
tp.schedule(boost::bind(my_fun, my_vecs.at(m)));
}
This is my problem: I would like to replace the vector of 10 vectors with only 2 vectors. If I want to schedule 10 tasks and I have 2 cores, a maximum of 2 threads (tasks) will be running at any time. So I only want to use two vectors (one assigned to each thread) and use it to carry out my 10 tasks. How can I do this?
I hope this is clear. Thank You!
Probably boost::thread_specific_ptr is what you need. Below is how you may use it in your function:
#include <boost/thread/tss.hpp>
boost::thread_specific_ptr<std::vector<double> > tls_vec;
void my_fun()
{
std::vector<double>* my_vec = tls_vec.get();
if( !my_vec ) {
my_vec = new std::vector<double>();
tls_vec.reset(my_vec);
}
// Do something here with my_vec
}
It will reuse vector instances between tasks scheduled to the same thread. There might be more than 2 instances if there are more threads in the pool, but due to preemption mentioned in other answers you really need an instance per running thread, not per core.
You should not need to delete vector instances stored in thread_specific_ptr; those will be automatically destroyed when corresponding threads finish.
I wouldn't limit the number of threads to the number of cores. Remember that multi-threaded programming has been going on long before we had multi-core processors. This is because the threads will likely block for some resource and the next thread can jump in and use the CPU.
Java has a FixedThreadPool.
it looks like Boost might have something similar
http://deltavsoft.com/w/RcfUserGuide/1.2/rcf_user_guide/Multithreading.html
Basically a fixed thread pool spawned a fixed number of threads and then you can queue tasks in the manager queue.
While it's two that only two threads can be scheduled at the same time, on many threading systems the threads get time-sliced, so a thread gets pre-empted during the execution of its task. Hence a third (fourth, ...) thread will get a chance to work while the processing of the first and second are still incomplete.
I don't know about this particular threading implementation, but my guess is that it will allow (or run in environments supporting) pre-emptive scheduling. My way of thinking for threads is to try to keep it really simple, let each threads have its own resoruces.

Boost.Thread no speedup?

I have a small program that implements a monte carlo simulation of BlackJack using various card counting strategies. My main function basically does this:
int bankroll = 50000;
int hands = 100;
int tests = 10000;
Simulation::strategy = hi_lo;
for(int i = 0; i < simulations; ++i)
runSimulation(bankroll, hands, tests, strategy);
The entire program run in a single thread on my machine takes about 10 seconds.
I wanted to take advantage of the 3 cores my processor has so I decided to rewrite the program to simply execute the various strategies in separate threads like this:
int bankroll = 50000;
int hands = 100;
int tests = 10000;
Simulation::strategy = hi_lo;
boost::thread threads[simulations];
for(int i = 0; i < simulations; ++i)
threads[i] = boost::thread(boost::bind(runSimulation, bankroll, hands, tests, strategy));
for(int i = 0; i < simulations; ++i)
threads[i].join();
However, when I ran this program, even though I got the same results it took around 24 seconds to complete. Did I miss something here?
If the value of simulations is high, then you end up creating a lot of threads, and the overhead of doing so can end up destroying any possible performance gains.
EDIT: One approach to this might be to just start three threads and let them each run 1/3 of the desired simulations. Alternatively, using a thread pool of some kind could also help.
This is a good candidate for a work queue with thread pool. I have used Intel Threading Blocks (TBB) for such requirements. Use handcrafted thread pools for quick hacks too. On Windows, the OS provides you with a nice thread pool backed work queue
"QueueUserWorkItem()"
Read these articles from Herb Sutter. You are probably victim of "false sharing".
http://drdobbs.com/go-parallel/article/showArticle.jhtml?articleID=214100002
http://drdobbs.com/go-parallel/article/showArticle.jhtml?articleID=217500206
I agree with dlev . If your function runSimulation is not changing anything which will be required for the next call to "runSimulation" to work properly then you can do something like:
. Divide "simulations" by 3.
. Now you will be having 3 counters "0 to simulation/3" "(simulation/3 + 1) to 2simulation/3" and "(2*simulation)/3 + 1 to simulation".
All these 3 counters can be used in three different threads simultaneously.
**NOTE ::** Your requirement might not be suitable for this type of checkup at all in case you have to do shared data lockup and all
I'm late to this party, but wanted to note two things for others who come across this post:
1) Definitely see the second Herb Sutter link that David points out (http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206). It solved the problem that brought me to this question, outlining a struct data object wrapper that ensures separate parallel threads aren't competing for resources headquartered on the same memory cache-line (hardware controls will prevent multiple threads from accessing the same memory cache-line simultaneously).
2) Re the original question, dlev points out a large part of the problem, but since it's a simulation I bet there's a deeper issue slowing things down. While none of your program's high-level variables are shared you probably have one critical system variable that's shared: the system-level "last random number" that's stored under-the-hood and used to create the next random number. You might even be initializing dedicated generator objects for each simulation, but if they're making calls to a function like rand() then they, and by extension their threads, are making repeated calls to the same shared system resource and subsequently blocking one another.
Solutions to issue #2 would depend on the structure of the simulation program itself. For instance if calls to a random generator are fragmented then I'd probably batch into one upfront call which retrieves and stores what the simulation will need. And this has me wondering now about more sophisticated approaches that'd deal with the underlying random generation shared-resource issue...

boost::thread: How to start all threads, but have only up to n running at a time?

In the boost::thread library, is there any mechanism to control how many threads (at most) are running at a time?
In my case, it would be most convenient to start N threads all at the same time (N may be hundreds or a few thousand):
std::vector<boost::thread*> vec;
for (int i = 0; i < N; ++i) {
vec.push_back(new boost::thread(my_fct));
}
// all are running, now wait for them to finish:
for (int i = 0; i < N; ++i) {
vec[i]->join();
delete vec[i];
}
But I want Boost to transparently set a maximum of, say, 4 threads running at a time. (I'm sharing an 8-core machine, so I'm not supposed to run more than 4 at a time.)
Of course, I could take care of starting only 4 at a time myself, but the solution I'm asking about would be more transparent and most convenient.
Don't think Boost.Thread has this built in but you can overlay Boost.Threadpool (not an official library) onto Boost.Thread, and that does allow you to control the thread count via SizePolicy.
The default is a fixed-size pool which is what you want - specify the initial (and ongoing) thread count on the threadpool constructor.
What you really want, it seems, would be to only ever have 4 threads, each of which would process many jobs.
One way to implement this would be to spawn as many threads as you like, and then for the run-loop of each threads to take tasks (typically function objects or pointers) from a thread-safe queue structure where you store everything that needs to be done.
This way you avoid the overhead from creating lots of threads, and still maintain the same amount of concurrency.
You could create a lock that you could only obtain n times. Then each thread should have to obtain the lock (blocking) before processing.