Windows API Thread Pool simple example - c++

[EDIT: thanks to MSalters answer and Raymond Chen's answer to InterlockedIncrement vs EnterCriticalSection/counter++/LeaveCriticalSection, the problem is solved and the code below is working properly. This should provide an interesting simple example of Thread Pool use in Windows]
I don't manage to find a simple example of the following task. My program, for example, needs to increment the values in a huge std::vector by one, so I want to do that in parallel. It needs to do that a bunch of times across the lifetime of the program. I know how to do that using CreateThread at each call of the routine but I don't manage to get rid of the CreateThread with the ThreadPool.
Here is what I do :
class Thread {
public:
Thread(){}
virtual void run() = 0 ; // I can inherit an "IncrementVectorThread"
};
class IncrementVectorThread: public Thread {
public:
IncrementVectorThread(int threadID, int nbThreads, std::vector<int> &vec) : id(threadID), nb(nbThreads), myvec(vec) { };
virtual void run() {
for (int i=(myvec.size()*id)/nb; i<(myvec.size()*(id+1))/nb; i++)
myvec[i]++; //and let's assume myvec is properly sized
}
int id, nb;
std::vector<int> &myvec;
};
class ThreadGroup : public std::vector<Thread*> {
public:
ThreadGroup() {
pool = CreateThreadpool(NULL);
InitializeThreadpoolEnvironment(&cbe);
cleanupGroup = CreateThreadpoolCleanupGroup();
SetThreadpoolCallbackPool(&cbe, pool);
SetThreadpoolCallbackCleanupGroup(&cbe, cleanupGroup, NULL);
threadCount = 0;
}
~ThreadGroup() {
CloseThreadpool(pool);
}
PTP_POOL pool;
TP_CALLBACK_ENVIRON cbe;
PTP_CLEANUP_GROUP cleanupGroup;
volatile long threadCount;
} ;
static VOID CALLBACK runFunc(
PTP_CALLBACK_INSTANCE Instance,
PVOID Context,
PTP_WORK Work) {
ThreadGroup &thread = *((ThreadGroup*) Context);
long id = InterlockedIncrement(&(thread.threadCount));
DWORD tid = (id-1)%thread.size();
thread[tid]->run();
}
void run_threads(ThreadGroup* thread_group) {
SetThreadpoolThreadMaximum(thread_group->pool, thread_group->size());
SetThreadpoolThreadMinimum(thread_group->pool, thread_group->size());
TP_WORK *worker = CreateThreadpoolWork(runFunc, (void*) thread_group, &thread_group->cbe);
thread_group->threadCount = 0;
for (int i=0; i<thread_group->size(); i++) {
SubmitThreadpoolWork(worker);
}
WaitForThreadpoolWorkCallbacks(worker,FALSE);
CloseThreadpoolWork(worker);
}
void main() {
ThreadGroup group;
std::vector<int> vec(10000, 0);
for (int i=0; i<10; i++)
group.push_back(new IncrementVectorThread(i, 10, vec));
run_threads(&group);
run_threads(&group);
run_threads(&group);
// now, vec should be == std::vector<int>(10000, 3);
}
So, if I understood well :
- the command CreateThreadpool creates a bunch of Threads (hence, the call to CreateThreadpoolWork is cheap as it doesn't call CreateThread)
- I can have as many thread pools as I want (if I want to do a thread pool for "IncrementVector" and one for my "DecrementVector" threads, I can).
- if I need to divide my "increment vector" task into 10 threads, instead of calling 10 times CreateThread, I create a single "worker", and Submit it 10 times to the ThreadPool with the same parameter (hence, I need the thread ID in the callback to know which part of my std::vector to increment). Here I couldn't find the thread ID, since the function GetCurrentThreadId() returns the real ID of the thread (ie., something like 1528, not something between 0..nb_launched_threads).
Finally, I am not sure I understood the concept well : do I really need a single worker and not 10 if I split my std::vector into 10 threads ?
Thanks!

You're roughly right up to the last point.
The whole idea about a thread pool is that you don't care how many threads it has. You just throw a lot of work into the thread pool, and let the OS determine how to execute each chunk.
So, if you create and submit 10 chunks, the OS may use between 1 and 10 threads from the pool.
You should not care about those thread identities. Don't bother with thread ID's, minimum or maximum number of threads, or stuff like that.
If you don't care about thread identities, then how do you manage what part of the vector to change? Simple. Before creating the threadpool, initialize a counter to zero. In the callback function, call InterlockedIncrement to retrieve and increment the counter. For each submitted work item, you'll get a consecutive integer.

Related

Synchronization technique to wait till all objects have been processed

In this code, I am first creating a thread that keeps running always. Then I am creating objects and adding them one by one to a queue. The thread picks up object from queue one by one processes them and deletes them.
class MyClass
{
public:
MyClass();
~MyClass();
Process();
};
std::queue<class MyClass*> MyClassObjQueue;
void ThreadFunctionToProcessAndDeleteObjectsFromQueue()
{
while(1)
{
// Get and Process and then Delete Objects one by one from MyClassObjQueue.
}
}
void main()
{
CreateThread (ThreadFunctionToProcessAndDeleteObjectsFromQueue);
int N = GetNumberOfObjects(); // Call some function that gets value of number of objects
// Create objects and queue them
for (int i=0; i<N; i++)
{
try
{
MyClass* obj = NULL;
obj = new MyClass;
MyClassObjQueue.push(obj);
}
catch(std::bad_alloc&)
{
if(obj)
delete obj;
}
}
// Wait till all objects have been processed and destroyed (HOW ???)
}
PROBLEM:
I am not sure how to wait till all objects have been processed before I quit. One way is to keep on checking size of queue periodically by using while(1) loop and Sleep. But I think it's novice way to do the things. I really want to do it in elegant way by using thread synchronization objects (e.g. semaphore etc.) so that synchronization function will wait for all objects to finish. But not sure how to do that. Any input will be appreciated.
(Note: I've not used synchronization objects to add/delete from queue in the code above. This is only to keep the code simple & readable. I know STL containers are not thread safe)

Static Class variable for Thread Count in C++

I am writing a thread based application in C++. The following is sample code showing how I am checking the thread count. I need to ensure that at any point in time, there are only 20 worker threads spawned from my application:
#include<stdio.h>
using namespace std;
class ThreadWorkerClass
{
private:
static int threadCount;
public:
void ThreadWorkerClass()
{
threadCount ++;
}
static int getThreadCount()
{
return threadCount;
}
void run()
{
/* The worker thread execution
* logic is to be written here */
//Reduce count by 1 as worker thread would finish here
threadCount --;
}
}
int main()
{
while(1)
{
ThreadWorkerClass twObj;
//Use Boost to start Worker Thread
//Assume max 20 worker threads need to be spawned
if(ThreadWorkerClass::getThreadCount() <= 20)
boost::thread *wrkrThread = new boost::thread(
&ThreadWorkerClass::run,&twObj);
else
break;
}
//Wait for the threads to join
//Something like (*wrkrThread).join();
return 0;
}
Will this design require me to take a lock on the variable threadCount? Assume that I will be running this code in a multi-processor environment.
The design is not good enough. The problem is that you exposed the constructor, so whether you like it or not, people will be able to create as many instances of your object as they want. You should do some sort of threads pooling. i.e. You have a class maintaining a set of pools and it gives out threads if available. something like
class MyThreadClass {
public:
release(){
//the method obtaining that thread is reponsible for returning it
}
};
class ThreadPool {
//create 20 instances of your Threadclass
public:
//This is a blocking function
MyThreadClass getInstance() {
//if a thread from the pool is free give it, else wait
}
};
So everything is maintaned internally by the pooling class. Never give control over that class to the others. you can also add query functions to the pooling class, like hasFreeThreads(), numFreeThreads() etc...
You can also enhance this design through giving out smart pointer so you can follow how many people are still owning the thread.
Making the people obtaining the thread responsible for releasing it is sometimes dangerous, as processes crashes and they never give the tread back, there are many solutions to that, the simplest one is to maintain a clock on each thread, when time runs out the thread is taken back by force.

Safe multi-thread counter increment

For example, I've got a some work that is computed simultaneously by multiple threads.
For demonstration purposes the work is performed inside a while loop. In a single iteration each thread performs its own portion of the work, before the next iteration begins a counter should be incremented once.
My problem is that the counter is updated by each thread.
As this seems like a relatively simple thing to want to do, I presume there is a 'best practice' or common way to go about it?
Here is some sample code to illustrate the issue and help the discussion along.
(Im using boost threads)
class someTask {
public:
int mCounter; //initialized to 0
int mTotal; //initialized to i.e. 100000
boost::mutex cntmutex;
int getCount()
{
boost::mutex::scoped_lock lock( cntmutex );
return mCount;
}
void process( int thread_id, int numThreads )
{
while ( getCount() < mTotal )
{
// The main task is performed here and is divided
// into sub-tasks based on the thread_id and numThreads
// Wait for all thread to get to this point
cntmutex.lock();
mCounter++; // < ---- how to ensure this is only updated once?
cntmutex.unlock();
}
}
};
The main problem I see here is that you reason at a too-low level. Therefore, I am going to present an alternative solution based on the new C++11 thread API.
The main idea is that you essentially have a schedule -> dispatch -> do -> collect -> loop routine. In your example you try to reason about all this within the do phase which is quite hard. Your pattern can be much more easily expressed using the opposite approach.
First we isolate the work to be done in its own routine:
void process_thread(size_t id, size_t numThreads) {
// do something
}
Now, we can easily invoke this routine:
#include <future>
#include <thread>
#include <vector>
void process(size_t const total, size_t const numThreads) {
for (size_t count = 0; count != total; ++count) {
std::vector< std::future<void> > results;
// Create all threads, launch the work!
for (size_t id = 0; id != numThreads; ++id) {
results.push_back(std::async(process_thread, id, numThreads));
}
// The destruction of `std::future`
// requires waiting for the task to complete (*)
}
}
(*) See this question.
You can read more about std::async here, and a short introduction is offered here (they appear to be somewhat contradictory on the effect of the launch policy, oh well). It is simpler here to let the implementation decides whether or not to create OS threads: it can adapt depending on the number of available cores.
Note how the code is simplified by removing shared state. Because the threads share nothing, we no longer have to worry about synchronization explicitly!
You protected the counter with a mutex, ensuring that no two threads can access the counter at the same time. Your other option would be using Boost::atomic, c++11 atomic operations or platform-specific atomic operations.
However, your code seems to access mCounter without holding the mutex:
while ( mCounter < mTotal )
That's a problem. You need to hold the mutex to access the shared state.
You may prefer to use this idiom:
Acquire lock.
Do tests and other things to decide whether we need to do work or not.
Adjust accounting to reflect the work we've decided to do.
Release lock. Do work. Acquire lock.
Adjust accounting to reflect the work we've done.
Loop back to step 2 unless we're totally done.
Release lock.
You need to use a message-passing solution. This is more easily enabled by libraries like TBB or PPL. PPL is included for free in Visual Studio 2010 and above, and TBB can be downloaded for free under a FOSS licence from Intel.
concurrent_queue<unsigned int> done;
std::vector<Work> work;
// fill work here
parallel_for(0, work.size(), [&](unsigned int i) {
processWorkItem(work[i]);
done.push(i);
});
It's lockless and you can have an external thread monitor the done variable to see how much, and what, has been completed.
I would like to disagree with David on doing multiple lock acquisitions to do the work.
Mutexes are expensive and with more threads contending for a mutex , it basically falls back to a system call , which results in user space to kernel space context switch along with the with the caller Thread(/s) forced to sleep :Thus a lot of overheads.
So If you are using a multiprocessor system , I would strongly recommend using spin locks instead [1].
So what i would do is :
=> Get rid of the scoped lock acquisition to check the condition.
=> Make your counter volatile to support above
=> In the while loop do the condition check again after acquiring the lock.
class someTask {
public:
volatile int mCounter; //initialized to 0 : Make your counter Volatile
int mTotal; //initialized to i.e. 100000
boost::mutex cntmutex;
void process( int thread_id, int numThreads )
{
while ( mCounter < mTotal ) //compare without acquiring lock
{
// The main task is performed here and is divided
// into sub-tasks based on the thread_id and numThreads
cntmutex.lock();
//Now compare again to make sure that the condition still holds
//This would save all those acquisitions and lock release we did just to
//check whther the condition was true.
if(mCounter < mTotal)
{
mCounter++;
}
cntmutex.unlock();
}
}
};
[1]http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock

Boost threads running serially, not in parallel

I'm a complete newbie to multi-threading in C++, and decided to start with the Boost Libraries. Also, I'm using Intel's C++ Compiler (from Parallel Studio 2011) with VS2010 on Vista.
I'm coding a genetic algorithm, and want to exploit the benefits of multi-threading: I want to create a thread for each individual (object) in the population, in order for them to calculate their fitness (heavy operations) in parallel, to reduce total execution time.
As I understand it, whenever I launch a child thread it stars working "in the background", and the parent thread continues to execute the next instruction, right? So, I thought of creating and launching all the child threads I need (in a for loop), and then wait for them to finish (call each thread's join() in another for loop) before continuing.
The problem I'm facing is that the first loop won't continue to the next iteration until the newly created thread is done working. Then, the second loop is as good as gone, since all the threads are already joined by the time that loop is hit.
Here are (what I consider to be) the relevant code snippets. Tell me if there is anything else you need to know.
class Poblacion {
// Constructors, destructor and other members
// ...
list<Individuo> _individuos;
void generaInicial() { // This method sets up the initial population.
int i;
// First loop
for(i = 0; i < _tamano_total; i++) {
Individuo nuevo(true);
nuevo.Start(); // Create and launch new thread
_individuos.push_back(nuevo);
}
// Second loop
list<Individuo>::iterator it;
for(it = _individuos.begin(); it != _individuos.end(); it++) {
it->Join();
}
_individuos.sort();
}
};
And, the threaded object Individuo:
class Individuo {
private:
// Other private members
// ...
boost::thread _hilo;
public:
// Other public members
// ...
void Start() {
_hilo = boost::thread(&Individuo::Run, this);
}
void Run() {
// These methods operate with/on each instance's own attributes,
// so they *can't* be static
generaHoc();
calculaAptitud();
borraArchivos();
}
void Join() {
if(_hilo.joinable()) _hilo.join();
}
};
Thank you! :D
If that's your real code then you have a problem.
for(i = 0; i < _tamano_total; i++) {
Individuo nuevo(true);
nuevo.Start(); // Create and launch new thread
_individuos.push_back(nuevo);
}
void Start() {
_hilo = boost::thread(&Individuo::Run, this);
}
This code creates a new Individuo object on the stack, then starts a thread that runs, passing the thispointer of that stack object to the new thread. It then copies that object into the list, and promptly destroys the stack object, leaving a dangling pointer in the new thread. This gives you undefined behaviour.
Since list never moves an object in memory once it has been inserted, you could start the thread after inserting into the list:
for(i = 0; i < _tamano_total; i++) {
_individuos.push_back(Individuo(true)); // add new entry to list
_individuos.back().Start(); // start a thread for that entry
}

Avoding multiple thread spawns in pthreads

I have an application that is parallellized using pthreads. The application has a iterative routine call and a thread spawn within the rountine (pthread_create and pthread_join) to parallelize the computation intensive section in the routine. When I use an instrumenting tool like PIN to collect the statistics the tool reports statistics for several threads(no of threads x no of iterations). I beleive it is because it is spawning new set of threads each time the routine is called.
How can I ensure that I create the thread only once and all successive calls use the threads that have been created first.
When I do the same with OpenMP and then try to collect the statistics, I see that the threads are created only once. Is it beacause of the OpenMP runtime ?
EDIT:
im jus giving a simplified version of the code.
int main()
{
//some code
do {
compute_distance(objects,clusters, &delta); //routine with pthread
} while (delta > threshold )
}
void compute_distance(double **objects,double *clusters, double *delta)
{
//some code again
//computation moved to a separate parallel routine..
for (i=0, i<nthreads;i++)
pthread_create(&thread[i],&attr,parallel_compute_phase,(void*)&ip);
for (i=0, i<nthreads;i++)
rc = pthread_join(thread[i], &status);
}
I hope this clearly explains the problem.
How do we save the thread id and test if was already created?
You can make a simple thread pool implementation which creates threads and makes them sleep. Once a thread is required, instead of "pthread_create", you can ask the thread pool subsystem to pick up a thread and do the required work.. This will ensure your control over the number of threads..
An easy thing you can do with minimal code changes is to write some wrappers for pthread_create and _join. Basically you can do something like:
typedef struct {
volatile int go;
volatile int done;
pthread_t h;
void* (*fn)(void*);
void* args;
} pthread_w_t;
void* pthread_w_fn(void* args) {
pthread_w_t* p = (pthread_w_t*)args;
// just let the thread be killed at the end
for(;;) {
while (!p->go) { pthread_yield(); }; // yields are good
p->go = 0; // don't want to go again until told to
p->fn(p->args);
p->done = 1;
}
}
int pthread_create_w(pthread_w_t* th, pthread_attr_t* a,
void* (*fn)(void*), void* args) {
if (!th->h) {
th->done = 0;
th->go = 0;
th->fn = fn;
th->args = args;
pthread_create(&th->h,a,pthread_w_fn,th);
}
th->done = 0; //make sure join won't return too soon
th->go = 1; //and let the wrapper function start the real thread code
}
int pthread_join_w(pthread_w_t*th) {
while (!th->done) { pthread_yield(); };
}
and then you'll have to change your calls and pthread_ts, or create some #define macros to change pthread_create to pthread_create_w etc....and you'll have to init your pthread_w_ts to zero.
Messing with those volatiles can be troublesome though. you'll probably need to spend some time getting my rough outline to actually work properly.
To ensure something that several threads might try to do only happens once, use pthread_once(). To ensure something only happens once that might be done by a single thread, just use a bool (likely one in static storage).
Honestly, it would be far easier to answer your question for everyone if you would edit your question – not comment, since that destroys formatting – to contain the real code in question, including the OpenMP pragmas.