Creating parallel threads using Win 32 API - c++

Here is the problem:
I have two sparse matrices described as vector of triplets.
The task is to write multiplication function for them using parallel processing with Win 32 API. So I need to know how do I:
1) Create a thread in Win 32 API
2) Pass input parameters for it
3) Get return value.
Thanks in advance!
Edit: "Process" changed for "Thread"

Well, the answer to your question is CreateProcess and GetExitCodeProcess.
But the solution to your problem isn't another process at all, it's more threads. And probably OpenMP is a much more suitable mechanism than creating your own threads.
If you have to use the Win32 API directly for threads, the process is something like:
Build a work item descriptor by allocating some memory, storing pointers to the real data, indexes for what this thread is going to work on, etc. Use a structure to keep this organized.
Call CreateThread and pass the address of the work item descriptor.
In your thread procedure, cast the pointer back to a structure pointer, access your work item descriptor, and process the data.
In your main thread, call WaitForMultipleObjects to join with the worker threads.
For even greater efficiency, you can use the Windows thread pool and call QueueUserWorkItem. But while you won't have to create threads yourself, you'd then need event handles to join tasks back to the main thread. It's about the same amount of code I suspect.

Related

C++, how to make efficient, 1 shared array of resources for multiple threads

Description:
I have multiple threads (4-32). These threads can all access an array: int resources[1024].
Resources array contains different values (0-1023). There can only be one instance of a single resource(int). Each thread requires different number of resources, which are at some point returned back to the array. Threads can ask for resources more than once and return only portion of acquired resources at once. Each thread is accessing this array via 2 methods: GetElement(), ReturnElement(int element)
GetElement(): this method locks section, deletes last element from resources array and returns it to the calling thread. Each thread is calling the method in loop for n resources.
ReturnElement(int element): this method locks section, appends resource element in parameter after last one in array. Each thread is calling the method in loop for n resources.
Current implementation is flawed in a way that when multiple threads are acquiring resources at once, none of them might get required amount. I was thinking about locking the access to the array for a single thread while it gets or returns the resources and then unlocking it. This approach is blocking all the other threads which might be an issue.
Is there a more efficient approach?
This problem is a variant of the dining philosophers problem, which is an area of active research. There are some solutions, but none are perfect and generally in such situations avoiding a deadlock in a generic and constant-time way is problematic.
Some general solution directions:
Use an arbitrator. This can be as simple as locking the entire resource allocation process globally or per-node (in case of NUMA), but can also be done in a lock-free way with prioritizing workers with a lower ID, for example.
Implement work stealing: reclaim partially allocated resources from another waiting worker, but with some unfairness built-in (e.g. never take from worker 0). This requires keeping track of partially allocated resources and some inter-thread (inter-process) communication.
Try-and-release: try to acquire resources, and if that fails, release the partially acquired ones and yield execution (wait a short period of time). Howard Hinnant did some research on this which can be found here.
Given your problem description, I would recommend a simple solution with central locking, or, better yet, try to avoid the problem in the first place. Given that you're developing a virtual memory allocator, there's some research specific to that, e.g. this one from from jemalloc.

Thread Safe Integer Array?

I have a situation where I have a legacy multi-threaded application I'm trying to move to a linux platform and convert into C++.
I have a fixed size array of integers:
int R[5000];
And I perform a lot of operations like:
R[5] = (R[10] + R[20]) / 50;
R[5]++;
I have one Foreground task that mostly reads the values....but on occasion can update one. And then I have a background worker that is updating the values constantly.
I need to make this structure thread safe.
I would rather only update the value if the value has actually changed. The worker is constantly collecting data and doing calculation and storing the data whether it changes or not.
So should I create a custom class MyInt which has the structure and then include an array of mutexes to lock for updating/reading each value and then overload the [], =, ++, +=, -=, etc? Or should I try to implement anatomic integer array?
Any suggestions as to what that would look like? I'd like to try and keep the above notation for doing the updates...but I get that it might not be possible.
Thanks,
WB
The first thing to do is make the program work reliably, and the easiest way to do that is to have a Mutex that is used to control access to the entire array. That is, whenever either thread needs to read or write to anything in the array, it should do:
the_mutex.lock();
// do all the array-reads, calculations, and array-writes it needs to do
the_mutex.unlock();
... then test your program and see if it still runs fast enough for your needs. If so, you're done; that's all you need to do.
If you find that the program isn't fast enough due to contention on the mutex, you can start trying optimizations to make things faster. For example, if you know that your threads' operations will only need to work on local segments of the array at one time, you could create multiple mutexes, and assign different subsets of the array to each mutex (e.g. mutex #1 is used to serialize access to the first 100 array items, mutex #2 for the second 100 array items, etc). That will greatly decrease the chances of one thread having to wait for the other thread to release a mutex before it can continue.
If things still aren't fast enough for you, you could then look in to having two different arrays, one for each thread, and occasionally copying from one array to the other. That way each thread could safely access its own private array without any serialization needed. The copying operation would need to be handled carefully, probably using some sort of inter-thread message-passing protocol.

Producer and Consumer Optimization

I am writing a C++ program in Qt that has an OnReceive(int value) event. It captures and push_back integer values into the std::vector. On another worker thread I have access to this vector and I can set a semaphore to wait for 20 values and then I can process them.
I want to do some optimization.
My question is how can I segment my buffer or vector into 3 parts of 0-4, 5-10, 11-19 so for example, as soon as 5 values are available in the vector (e.g 0 to 4), the second worker start to process them while the first thread still continue to get the rest of values?
by this way I wanna have an overlap between my threads. so they don't need to be run in serial.
Thank you.
Use a wait-free ring buffer.
Boost claims to have one
Note it is in the lock free folder but all methods claim to be thread safe and wait-free.

thread local and context switch

I'm got some C++ code making use of thread local storage, each thread has a vector it can push data into.
I use TLS to store an index ID per thread, this can be used to look up which vector to push data into. It then executes a fair amount of code which pushes data into the vector.
What I'm wondering is if it is possible that the OS might reschedule my code to execute on a different thread after it has acquired the pointer to the thread local object. (So far the code executes fine and I have not seen this happen). But if it were possible, this would seem certain to break my program, since it would now be possible for two threads to have the same object.
Assuming this is true, this seems like it would be a problem even for any code that uses TLS of any complexity, is TLS only intended for simple objects where you don't take the address?
Thanks!
Thread Local Storage is just that - storage per thread. Each thread has it's own private data structure. That thread, whichever processor it runs on, is the same thread. The OS doesn't schedule work WITHIN threads, it schedules which of the threads runs.
The thread local storage is acomplished by having some sort of indirection, which is changed along with the thread itself. There are several ways to do this, for example, the OS may have a particular page at a particular offset from the start of virtual memory in the process, and when a thread is scheduled, the page-table is updated to match the thread.
In x86 processors, FS or GS is typically used for "per-thread" data, so the OS will switch the FS register [or the content of the base-address of the register in case of 64-bit processors]. When reading the TLS, the compiler will use the FS or GS segment register to prefix the memory read/write operations, and thus you always get "your private data", not some other threads.
Of course, OS's may have bugs, but this is something quite a few things will rely on, so if it's broken, it would show up pretty soon (unless it's very subtle, and you have to be standing just in the right place, with the moon in the right phase, wearing the right colour clothes, and the wind in the right direction, the date divisibly by both 3 and 7, etc,etc).
TLS means thread local, from your description, each thread access a shared vector of vector through TLS(I'm not sure), you should use some kind of lock. Any sample codes?

thread building block combined with pthreads

I have a queue with elements which needs to be processed. I want to process these elements in parallel. The will be some sections on each element which need to be synchronized. At any point in time there can be max num_threads running threads.
I'll provide a template to give you an idea of what I want to achieve.
queue q
process_element(e)
{
lock()
some synchronized area
// a matrix access performed here so a spin lock would do
unlock()
...
unsynchronized area
...
if( condition )
{
new_element = generate_new_element()
q.push(new_element) // synchonized access to queue
}
}
process_queue()
{
while( elements in q ) // algorithm is finished condition
{
e = get_elem_from_queue(q) // synchronized access to queue
process_element(e)
}
}
I can use
pthreads
openmp
intel thread building blocks
Top problems I have
Make sure that at any point in time I have max num_threads running threads
Lightweight synchronization methods to use on queue
My plan is to the intel tbb concurrent_queue for the queue container. But then, will I be able to use pthreads functions ( mutexes, conditions )? Let's assume this works ( it should ). Then, how can I use pthreads to have max num_threads at one point in time? I was thinking to create the threads once, and then, after one element is processes, to access the queue and get the next element. However it if more complicated because I have no guarantee that if there is not element in queue the algorithm is finished.
My question
Before I start implementing I'd like to know if there is an easy way to use intel tbb or pthreads to obtain the behaviour I want? More precisely processing elements from a queue in parallel
Note: I have tried to use tasks but with no success.
First off, pthreads gives you portability which is hard to walk away from. The following appear to be true from your question - let us know if these aren't true because the answer will then change:
1) You have a multi-core processor(s) on which you're running the code
2) You want to have no more than num_threads threads because of (1)
Assuming the above to be true, the following approach might work well for you:
Create num_threads pthreads using pthread_create
Optionally, bind each thread to a different core
q.push(new_element) atomically adds new_element to a queue. pthreads_mutex_lock and pthreads_mutex_unlock can help you here. Examples here: http://pages.cs.wisc.edu/~travitch/pthreads_primer.html
Use pthreads_mutexes for dequeueing elements
Termination is tricky - one way to do this is to add a TERMINATE element to the queue, which upon dequeueing, causes the dequeuer to queue up another TERMINATE element (for the next dequeuer) and then terminate. You will end up with one extra TERMINATE element in the queue, which you can remove by having a named thread dequeue it after all the threads are done.
Depending on how often you add/remove elements from the queue, you may want to use something lighter weight than pthread_mutex_... to enqueue/dequeue elements. This is where you might want to use a more machine-specific construct.
TBB is compatible with other threading packages.
TBB also emphasizes scalability. So when you port over your program to from a dual core to a quad core you do not have to adjust your program. With data parallel programming, program performance increases (scales) as you add processors.
Cilk Plus is also another runtime that provides good results.
www.cilkplus.org
Since pThreads is a low level theading library you have to decide how much control you need in your application because it does offer flexibility, but at a high cost in terms of programmer effort, debugging time, and maintenance costs.
My recommendation is to look at tbb::parallel_do. It was designed to process elements from a container in parallel, even if the container itself is not concurrent; i.e. parallel_do works with an std::queue correctly without any user synchronization (of course you would still need to protect your matrix access inside process_element(). Moreover, with parallel_do you can add more work on the fly, which looks like what you need, as process_element() creates and adds new elements to the work queue (the only caution is that the newly added work will be processed immediately, unlike putting in a queue which would postpone processing till after all "older" items). Also, you don't have to worry about termination: parallel_do will complete automatically as soon as all initial queue items and new items created on the fly are processed.
However, if, besides the computation itself, the work queue can be concurrently fed from another source (e.g. from an I/O processing thread), then parallel_do is not suitable. In this case, it might make sense to look at parallel_pipeline or, better, the TBB flow graph.
Lastly, an application can control the number of active threads with TBB, though it's not a recommended approach.