Producer and Consumer Optimization - c++

I am writing a C++ program in Qt that has an OnReceive(int value) event. It captures and push_back integer values into the std::vector. On another worker thread I have access to this vector and I can set a semaphore to wait for 20 values and then I can process them.
I want to do some optimization.
My question is how can I segment my buffer or vector into 3 parts of 0-4, 5-10, 11-19 so for example, as soon as 5 values are available in the vector (e.g 0 to 4), the second worker start to process them while the first thread still continue to get the rest of values?
by this way I wanna have an overlap between my threads. so they don't need to be run in serial.
Thank you.

Use a wait-free ring buffer.
Boost claims to have one
Note it is in the lock free folder but all methods claim to be thread safe and wait-free.

Related

Mutex granularity

I have a question regarding threads. It is known that basically when we call for mutex(lock) that means that thread keeps on executing the part of code uninterrupted by other threads until it meets mutex(unlock). (At least that's what they say in the book) So my question is if it is actually possible to have several scoped WriteLocks which do not interfere with each other. For example something like this:
If I have a buffer with N elements without any new elements coming, however with high frequency updates (like change value of Kth element) is it possible to set a different lock on each element so that the only time threads would stall and wait is if actually 2 or more threads are trying to update the same element?
To answer your question about N mutexes: yes, that is indeed possible. What resources are protected by a mutex depends entirely on you as the user of that mutex.
This leads to the first (statement) part of your question. A mutex by itself does not guarantee that a thread will work uninterrupted. All it guarantees is MUTual EXclusion - if thread B attempts to lock a mutex which thread A has locked, thread B will block (execute no code) until thread A unlocks the mutex.
This means mutexes can be used to guarantee that a thread executes a block of code uninterrupted; but this works only if all threads follow the same mutex-locking protocol around that block of code. Which means it is your responsibility to assign semantics (or meaning) to each individual mutex, and correctly adhere to those semantics in your code.
If you decide for the semantics to be "I have an array a of N data elements and an array m of N mutexes, and accessing a[i] can only be done when m[i] is locked," then that's how it will work.
The need to consistently stick to the same protocol is why you should generally encapsulate the mutex and the code/data protected by it in a class in some way or another, so that outside code doesn't need to know the details of the protocol. It just knows "call this member function, and the synchronisation will happen automagically." This "automagic" will be the class correcrtly implementing the protocol.
A crucial consideration when deciding between a mutex per array and a mutex per element is whether there are operations - like tracking the number of "in-use" array elements, the "active" element, or moving a pointer-to-array to a larger buffer - that can only be done safely by one thread while all the others are blocked.
A lesser but sometimes important consideration is the amount of extra memory more mutexes use.
If you genuinely need to do this kind of update as quickly as possible in a highly contested multi-threaded program, you may also want to learn about lock-free atomic types and their compare-and-swap / exchange operations, but I'd recommend against considering that unless profiling the existing locking is significant in your overall program performance.
A mutex does not stop other threads from running completely, it only stops other threads from locking the same mutex. I.e. while one thread is keeping the mutex locked, the operating system continues to do context switches letting other threads run also, but if any other thread is trying to lock the same mutex its execution will be halted until the mutex is unlocked.
So yes, you can indeed have several different mutexes and lock/unlock them independently. Just beware of deadlocks, i.e. if one thread can lock more than one mutex at a time you can run into a situation where thread 1 has locked mutex A and is trying to lock mutex B but blocks because thread 2 already has mutex B locked and it is trying to lock mutex A..
Its not completely clear that your use case is:
the threads gets a buffer assigned on that they have to work
the threads have some results and request a special buffer to update.
On the first variant you need some assignment logic that assigns a buffer to a thread.
This logic has to be exectued in an atomic way. so the best is to use a mutex to protect the assignment logic.
On the other variant it may be the best to have a vector of mutexes, one for each buffer element.
In Both cases the buffer does not need a protection because it (or better each field of it) is only accessed from one thread at a time.
You also may inform yourself about 'semaphores'. These contain a counter that allows to manage ressources that have a limited amount but more than one. Mutexes are a special case of semaphores with n=1.
You can have mutex per entry, C++11 mutex can be easily converted into an adaptive-spinlock, so you can achieve good CPU/Latency tradeoff.
Or, if you need very low latency yet have enough CPUs you can use an atomic "busy" flag per entry and spin in a tight compare-exchange loop on contention.
From experience, though, the best performance and scalability are achieved when concurrent writes are serialized via a command queue (or a queue of smaller immutable buffers to be concatenated at destination) and a single thread processing the queue.

Thread synchronization in qt

I have a program that have 3 threads.All of them take data from ethernet on different ports.The frequencies of the data coming for 3 of the threads may be different. But all of the incoming data must be processed at the same time.
So if one data comes for one thread, it must wait the others data to come. How can I get it?
Boost.Thread has a barrier class, whose purpose is to block multiple threads until a specified number have reached the barrier.
You would just create a boost::barrier initialized with 3, meaning that it blocks until three threads are waiting on the barrier. When each of your threads is done waiting for data, you have them call wait() on the barrier. When the third thread calls wait(), all three threads will continue execution.
boost::barrier barrier(3);
void thread_function()
{
read_data();
barrier.wait(); // Threads will block here until all three are ready.
process_data();
}
If you only want one thread to process the data, you can check the return value of wait(); the function will only return true for one of the threads at the barrier.
You need a barrier. Barrier has preset capacity N and blocks N-1 threads until N-th arrives. After the N-th arrives, all N threads are released simultaneously.
Unfortunately Qt has no direct support for barriers, but there is simple implementation using Qt primitives here: https://stackoverflow.com/a/9639624/1854587
Not as simple as boost's barrier as answered by #dauphic, but this can be done with Qt alone, using slots, signals and another class on a 4th thread.
Create a class on a separate thread that coordinates the other 3, the network threads can send a signal to the 'coordinator' class when they receive data. Once this coordinator class has received messages from all 3 network threads, it can then signal the threads to process the data.

thread building block combined with pthreads

I have a queue with elements which needs to be processed. I want to process these elements in parallel. The will be some sections on each element which need to be synchronized. At any point in time there can be max num_threads running threads.
I'll provide a template to give you an idea of what I want to achieve.
queue q
process_element(e)
{
lock()
some synchronized area
// a matrix access performed here so a spin lock would do
unlock()
...
unsynchronized area
...
if( condition )
{
new_element = generate_new_element()
q.push(new_element) // synchonized access to queue
}
}
process_queue()
{
while( elements in q ) // algorithm is finished condition
{
e = get_elem_from_queue(q) // synchronized access to queue
process_element(e)
}
}
I can use
pthreads
openmp
intel thread building blocks
Top problems I have
Make sure that at any point in time I have max num_threads running threads
Lightweight synchronization methods to use on queue
My plan is to the intel tbb concurrent_queue for the queue container. But then, will I be able to use pthreads functions ( mutexes, conditions )? Let's assume this works ( it should ). Then, how can I use pthreads to have max num_threads at one point in time? I was thinking to create the threads once, and then, after one element is processes, to access the queue and get the next element. However it if more complicated because I have no guarantee that if there is not element in queue the algorithm is finished.
My question
Before I start implementing I'd like to know if there is an easy way to use intel tbb or pthreads to obtain the behaviour I want? More precisely processing elements from a queue in parallel
Note: I have tried to use tasks but with no success.
First off, pthreads gives you portability which is hard to walk away from. The following appear to be true from your question - let us know if these aren't true because the answer will then change:
1) You have a multi-core processor(s) on which you're running the code
2) You want to have no more than num_threads threads because of (1)
Assuming the above to be true, the following approach might work well for you:
Create num_threads pthreads using pthread_create
Optionally, bind each thread to a different core
q.push(new_element) atomically adds new_element to a queue. pthreads_mutex_lock and pthreads_mutex_unlock can help you here. Examples here: http://pages.cs.wisc.edu/~travitch/pthreads_primer.html
Use pthreads_mutexes for dequeueing elements
Termination is tricky - one way to do this is to add a TERMINATE element to the queue, which upon dequeueing, causes the dequeuer to queue up another TERMINATE element (for the next dequeuer) and then terminate. You will end up with one extra TERMINATE element in the queue, which you can remove by having a named thread dequeue it after all the threads are done.
Depending on how often you add/remove elements from the queue, you may want to use something lighter weight than pthread_mutex_... to enqueue/dequeue elements. This is where you might want to use a more machine-specific construct.
TBB is compatible with other threading packages.
TBB also emphasizes scalability. So when you port over your program to from a dual core to a quad core you do not have to adjust your program. With data parallel programming, program performance increases (scales) as you add processors.
Cilk Plus is also another runtime that provides good results.
www.cilkplus.org
Since pThreads is a low level theading library you have to decide how much control you need in your application because it does offer flexibility, but at a high cost in terms of programmer effort, debugging time, and maintenance costs.
My recommendation is to look at tbb::parallel_do. It was designed to process elements from a container in parallel, even if the container itself is not concurrent; i.e. parallel_do works with an std::queue correctly without any user synchronization (of course you would still need to protect your matrix access inside process_element(). Moreover, with parallel_do you can add more work on the fly, which looks like what you need, as process_element() creates and adds new elements to the work queue (the only caution is that the newly added work will be processed immediately, unlike putting in a queue which would postpone processing till after all "older" items). Also, you don't have to worry about termination: parallel_do will complete automatically as soon as all initial queue items and new items created on the fly are processed.
However, if, besides the computation itself, the work queue can be concurrently fed from another source (e.g. from an I/O processing thread), then parallel_do is not suitable. In this case, it might make sense to look at parallel_pipeline or, better, the TBB flow graph.
Lastly, an application can control the number of active threads with TBB, though it's not a recommended approach.

Lists and multithreaded environments

I'm fairly new to C++ standard library and have been using standard library lists for specific multithreaded implementation. I noticed that there might be a trick with using lists that I have not seen on any tutorial/blog/forum posts and though seems obvious to me does not seem to be considered by anyone. So maybe I'm too new and possibly missing something so hopefully someone smarter than me can perhaps validate what I am trying to achieve or explain to me what I am doing wrong.
So we know that in general standard library containers are not thread safe - but this seems like a guiding statement more than a rule. With lists it seems that there is a level of tolerance for thread safety. Let me explain, we know that lists do not get invalidated if we add/delete from the list. The only iterator that gets invalidated is the deleted item - which you can fix with the following line of code:
it = myList.erase(it)
So now lets say we have two threads and call them thread 1 and thread 2.
Thread 1's responsibility is to add to the list. It treats it as a queue, so it uses the std::list::push_back() function call.
Thread 2's responsibility is to process the data stored in the list as a queue and then after processing it will remove elements from the list.
Its guaranteed that Thread 2 will not remove elements in the list that were just added during its processing and Thread 1 guarantees that it will queue up the necessary data well ahead for Thread 2's processing. However, keep in mind elements can be added during Thread 2's processing.
So it seems that this is a reasonable use of lists in this multithreaded environment without the use of a locks for data protection. The reason why I say its reasonable is because, essentially, Thread 2 will only process data up to now such that it can retreive the current end iterator shown by the following pseudocode:
Thread 2 {
iter = myList.begin();
lock();
iterEnd = myList.end(); // lock data temporarily in order to get the current
// last element in the list
unlock();
// perform necessary processing
while (iter != iterEnd) {
// process data
// ...
// remove element
iter = myList.erase(iter);
}
}
Thread 2 uses a lock for a very short amount of time just to know where to stop processing, but for the most part Thread 1 and Thread 2 don't require any other locking. In addition, Thread 2 can possibly avoid locking too if its scope to know the current last element is flexible.
Does anyone see anything wrong with my suggestion?
Thanks!
Your program is racy. As an example of one obvious data race: std::list is more than just a collection of doubly-linked nodes. It also has, for example, a data member that stores the number of nodes in the list (it needs not be a single data member, but it has to store the count somewhere).
Both of your threads will modify this data member concurrently. Because there is no synchronization of those modifications, your program is racy.
Instances of the Standard Library containers cannot be mutated from multiple threads concurrently without external synchronization.

Multi-thread brute-force password breaking algorithm

I made a multi-threaded program in C++ to break passwords of 7 characters long (lower case characters only) using a brute-force algorithm.
My algorithm is mostly 7 nested for-loops going from a to z and testing every possible combination.
Right now, I'm dividing my work this way :
If I have 3 working threads,
Thread 1 : axxxxxx to ixxxxxx
Thread 2 : jxxxxxx to rxxxxxx
Thread 3 : sxxxxxx to zxxxxxx
So the 3 threads will go on and loop until they find a match.
The main thread will wait for the first thread to return.
My question is : Is this the best way to divide the work between my threads? Do you have any idea on how I could be more efficient?
Also, even if it's not the main part of my interrogation, can you think of a better way than the 7 for-loop iteration?
(Please note that this program is for a school project and not for really cracking passwords)
If all keys are equally likely, and if the cost to evaluate a key is the same for every key, and if each thread can expect to be assigned to one CPU without very many interruptions (e.g. your process is the only CPU intensive one running), evenly partitioning the keyspace as you have done will be very efficient.
If some of those assumptions are invalid, a more flexible way to structure the program would be to have one thread (producer thread) hand out key ranges to 1 or more consumer threads for processing. Once a given thread completes its chunk of work, it would go back to the producer and request a new key range to analyze.
There's some overhead in the producer/consumer pattern, but it is more flexible.
I would take a look at intel TBB
I would use a parallel_for construct on the outerloop and have an atomic variable to signal it being found.
This is pretty trivail using lambdas.
tbb::blocked_range<char> rng('a', 'z');
tbb::parallel_for(rng, [&](tbb::blocked_range<char> rng){
for(char a=rng.begin(); a!=rng.end(); ++a)
{
//a is your top level character
}
});
The advantage of using TBB is that as mentioned in another answer is that if one thread finishes before another TBB has a work stealing mechanism built it to allow for the fast thread to take work off a slower thread.
You should use a producer consumer pattern.
Having a ( Thread safe ) queue to produce the password candidate , and consumer threads
That should be more flexible.
For producing the passwords there is nothing wrong with your method, but can be tedious with longer password.
You can use a recursive scheme to produce it.
or an iterative scheme with one loop, a-z character on the ascii table are sequential, so you can use a base 26 conversion to produce your candidate.