C++ synced stages multi thread pipeline - c++

Sorry if this has been asked, I did my best to search for a dup before asking...
I am implementing a video processing application that is supposed to run real time. The processing it does can be easily divided into 4 stages, each one operating on the intermediate values generated from the previous stage. The processing has become heavier than what can be processed in 1/30th of a second, but if I can split this application in 4 threads and turn it into a pipeline, each stage takes less than that and the whole thing would run realtime (with a 4 frame lag, which is completely acceptable).
I'm fairly new to multithreading programming, and the problem I'm having is, I can't find a mechanism to start/stop each thread at the beginning of each frame, so they all march together, delivering one finished frame every "cycle" at the end. All the frameworks/libraries I found seem to worry about load balancing using queues and worker threads, but this is not what I need here. Four threads will do, assuming I can keep them synced.
Can anybody point me to a starting point, using C++?
Thanks.

Assuming a 4 frame lag is acceptable, you could use a pool of list nodes, each with a pointer to a frame buffer and a pointer to the intermediate values (a NULL pointer could be used to indicate end of a stream). Each thread would have it's own list as part of multithread messaging system. The first thread would get a frame node from a free pool, do it's processing and send the node to the next threads list, and so on, with the last thread returning nodes back to the free pool.
Here is a link to an example file copy program that spawns a thread to do the writes. It uses Windows threading, mutexes, and semaphores in the messaging functions, but the messaging functions are simple and could be changed internally to use generic equivalents without changing their interface. The main() function could be changed to use generic threading and setup of the mutexes and semaphores or something equivalent.
mtcopy.zip

Related

multithreaded one read one write time_t

I'm writing a multithreaded application that have 2 threads.
One of the threads receives data from a queue and aggregates it and the other one send the aggregated data to a server.
I want to be able to know the last time that a data was received so I use:
time_t last_data = time(NULL)
to get the correct time on each event (I dont need it to be super accurate but I need it to be fast) and then the other send this value with the aggregated data.
My questions are:
Do I have to synchronize this even if this is not very important that I get the most recent update?
I tested it with std::atomic<time_t> and it seems to have some performance issues, is there any other faster way?
What would be the worst case that can happen if I will not synchronize the read/write?
Is there a faster way to get the current time then time(NULL) (don't have to be super accurate)?
UPDATE
Here is an explanation of my application workflow.
Application needs:
1. Consume data from external sources using IPC (currently nanomsg).
2. Aggregate the data to bulks.
3. Send the aggregated data to remote server every given interval (1 second).
Current implementation:
Create 2 buffers to hold the aggregated data (one for receiving and one for sending).
Create a consumer thread to consume data from IPC and fill the receiving buffer.
Create a sending thread that will send the data to the server.
Every iteration of the interval the sending thread will swap the buffers (swap pointers and locking using mutex) and send the data to the server.
I don't want that the consumer will wait on network IO so I have created this flow.
Can I use event driven here instead of this complex mechanism without all the locking (currently it is working fine but i'm sure it can be better)?
Don't do it that way. You only need one thread. You can use select/poll/epoll. These can wait on your inputs, and at the same time for you output to finish. You will be doing event driven programming, and non-blocking output. It is something worth learning. It is a bit harder at first, but soon makes life easier i.e. not having the problem that you have now. Also the program will be faster.
Supposing one thread executes:
last_data = time(NULL);
And the other uses last_data but there is no synchronization event between the two then there are no guarantees when or if the revised value of last_data will become visible to the reading thread.
However the most serious possibility is that write of time_t (maybe long) isn't atomic and another thread could read a corrupt 'part written' value.
That could cause glitches in delay and time calculations that might foul downstream process.
You might analyse your program and find that because the two interact there is a sufficient memory fence at some point that guarantees eventual update.
NB: This is an odd situation where I'm suspecting you think something isn't synchronized but it is! The usual experience is the other way around...
Basically there's not really enough information to understand what problem you're having.
For example, if the reader thread is the only process to read the time I'd expect to see code like:
Thread 1:
If data received, lock L, update time, add to queue, unlock L.
Thread 2:
If items in queue L, read queue and update time, unlock L .. process item.
In which case the time will be synchronized already.
Please provide a minimum, complete, verifiable example...

Multirate threads

I ran recently into a requirement in which there is a need for multithreaded application whose threads run at different rates.
The questions then become, since i am still learning multithreading:
A scenario is given to put things into perspective:
Say 1st thread runs at 100 Hz "real time"
2nd runs at 10 Hz
and say that the 1st thread provides data "myData" to the 2nd thread.
How is myData going to be provided to the 2nd thread, is the common practice to just read whatever is available from the first thread, or there need to be some kind of decimation to reduce the rate.
Does the myData need to be some kind of Singleton with locking mechanism. Although myData isn't shared, but rather updated by the first thread and used in the second thread.
How about the opposite case, when the data used in one thread need to be used at higher rate in a different thread.
How is myData going to be provided to the 2nd thread
One common method is to provide a FIFO queue -- this could be a std::dequeue or a linked list, or whatever -- and have the producer thread push data items onto one end of the queue while the consumer thread pops the data items off of the other end of the queue. Be sure to serialize all accesses to the FIFO queue (using a mutex or similar locking mechanism), to avoid race conditions.
Alternatively, instead of a queue you could have a single shared data object (essentially a queue of length one) and have your producer thread overwrite the object every time it generates new data. This could be done in cases where it's not important that the consumer thread sees every piece of data that was generated, but rather it's only important that it sees the most recent data. You'd still need to do the locking, though, to avoid the risk of the consumer thread reading from the data object at the same time the producer thread is in the middle of writing to it.
or does there need to be some kind of decimation to reduce the rate.
There doesn't need to be any decimation -- the second thread can just read in as much data as there is available to read, whenever it wakes up.
Does the myData need to be some kind of Singleton with locking
mechanism.
Singleton isn't necessary (although it's possible to do it that way). The locking mechanism is necessary, unless you have some kind of lock-free synchronization mechanism (and if you're asking this level of question, you don't have one and you don't want to try to get one either -- keep things simple for now!)
How about the opposite case, when the data used in one thread need to
be used at higher rate in a different thread.
It's the same -- if you're using a proper inter-thread communications mechanism, the rates at which the threads wake up doesn't matter, because the communications mechanism will do the right thing regardless of when or how often the the threads wake up.
Any multithreaded program has to cope with the possibility that one of the threads will work faster than another - by any ratio - even if they're executing on the same CPU with the same clock frequency.
Your choices include:
producer-consumer container than lets the first thread enqueue data, and the second thread "pop" it off for processing: you could let the queue grow as large as memory allows, or put some limit on the size after which either data would be lost or the 1st thread would be forced to slow down and wait to enqueue further values
there are libraries available (e.g. boost), or if you want to implement it yourself google some tutorials/docs on mutex and condition variables
do something conceptually similar to the above but where the size limit is 1 so there's just the single myData variable rather than a "container" - but all the synchronisation and delay choices remain the same
The Singleton pattern is orthogonal to your needs here: the two threads do need to know where the data is, but that would normally be done using e.g. a pointer argument to the function(s) run in the threads. Singleton's easily overused and best avoided unless reasons stack up high....

Thread or timer to read sensor data out?

My Linux C++ application is periodically reading sensor data. Readout is done by simple file I/O operation (OS is writing to file, application is reading from this file).
Some information about my platform:
I have single core processor with hyper-threading
sensor data update frequency is 1 second
application GUI runs in main thread and shouldn't be blocked
I considered two approaches for sensor data read out:
timer running in main application thread
separate thread with infinite loop which does sensor data readout and then sleeps
Which approach makes more sens, are there any other alternatives ? What are the costs of both solution (e.g. blocking of main thread in first or context switching in second approach) ?
I don't know anything about your application or the hardware, but here are a few things to consider:
If you use a thread, you will have to create a communication channel of some sort to tell the main thread that data has been updated. Usually this would be a pipe(), as signals are inherently unreliable and condition locks don't work with I/O multiplexing (i.e. select()/poll()).
Can you get the entire set of data without blocking? If so, then just reading it in the main thread is probably easier. However, if your read can block you'll probably need some more "keep track of my read state to incorporate it into my central select()", whereas a thread can just block until more data is available.
Thus, neither solution is automatically "easier" to do.
I wouldn't worry about "context switching" for a read that only occurs once per second; that's irrelevant.
What else does the main thread have to do? Is it ok if it blocks? If so, then you dont need to do the timer, etc in a separate thread.
If the main thread cant block waiting for the periodic timer, then a separate thread must be created. The communication of data between the threads can be via an object that is accessible to both threads and protected via a mutex (look up pthread_mutex_t), which is quite simple to do.
As for which solution would be better and what are the costs, it depends on what else the main thread is doing. But for something this simple, either way should be about the same, and the context switching shouldnt affect anything. What should affect performance the most is how performance intensive the reads are.
I believe that cost of the context switch once a second is not an issue even for single-core CPU without hyper-threading especially taking to the account that the application is running in user space, thus is not really time-critical. The polling of your sensor in the main thread complicates the logic of the application. So, I would recommend you to start a thread for that purpose.
A sleep-loop will skew the timing because each iteration is going to take longer than 1sec. Timers don't have that problem, and they are made for this scenario. So choose a timer.
Performance-wise there is no difference because you are only triggering once a second.
If the Linux driver is reading a sensor data and writing it to a device file every second, you shouldn't duplicate the timer logic in your application. It may happen that after 1 second sleep your application will still read the same data as 1 second ago. A better approach would be to have a thread that would call a blocking read on a device file. When new sensor data is available, blocking read returns, the thread can process the data and call read again.

TBB for a workload that keeps changing?

I'm sorry I don't seem to get intel's TBB it seems great & supported but I can't wrap my head around how to use it since I guess I'm not used to thinking of parallelism in terms of tasks but instead saw it as threads.
My current workload has a job that sends work to a queue to keep processing(think of a recursion but instead of calling itself it sends work to a queue). The way I got this working in Java was to create a concurrent queue(non-blocking queue) and threadpoolexecutor that worked the queue/send work back to it. But now I'm trying to do something similar in c++, I found TBB can create pools but its approach is very different(Java threads seem to just keep working as long as their is work in the queue but TBB seems to break the task down at the beginning).
Here's a simple Java example of what I do(before this I set how many threads I want,etc..):
static class DoWork implements Callable<Void> {
// queue with contexts to process
private Queue<int> contexts;
DoWork(Context request) {
contexts = new ArrayDeque<int>();
contexts.add(request);
}
public Void call() {
while(!contexts.isEmpty()) {
//do work
contexts.add(new int(data)); //if needs to be send back to the queue to do more work
}
}
}
I sure its possible to do this in TBB, but I'm just not sure how because it seems to break up my work at the time I send it. So if it there's 2 items in the queue it may only launch 2 threads but won't grow as more work comes in(even if I have 8 cores).
Can someone help me understand how to achieve my tasks and also maybe suggest a better way to think about TBB coming from using Java's threading environment(also I have no allegiance to TBB, so if there's something easier/better then I'm happy to learn it. I just don't like c++ threadpool because it doesn't seem actively developed)?
The approach based on having a queue of items for parallel processing, where each thread just pops one item from the queue and proceeds (and possibly adds a new item to the end of the queue at some point) is fundamentally wrong since it limits parallelism of the application. The queue becomes a single point of synchronization, and threads need to wait in order to get access to the next item to process. In practice this approach works when tasks (each items' processing job) are quite large and take different times to complete, allowing queue to be less contended as opposed to when (most of the) threads finish at the same time and come to the queue for their next items to process.
If you're writing a somewhat reusable piece of code you can not guarantee that tasks are either large enough or that they vary in size (time to execute).
I assume that you application scales, which means that you start with some significant number of items (much larger than the number of threads) in your queue and while threads do the processing they add enough tasks to the end, so that there's enough job for everyone until application finishes.
If that's the case I would rather suggested that you kept two thread-safe vectors of your items (TBB's concurrent_vectors for instance) for interchangeability. You start with one vector (your initial set of items) and you enque() a task (I think it's described somewhere in chapter 12 of the TBB reference manual), which executes a parallel_for over the initial vector of items. While the first batch is being processed you would push_back the new items onto the second concurrent_vector and when you're done with the first one you enque() a task with a parallel_for over the second vector and start pushing new items back into the first one. You can try and overlap parallel processing of items better by having three vectors instead of two and gradually moving between them while there's still enough work for all thread to be kept busy.
What you're trying to do is exactly the sort of thing TBB's parallel_do is designed for. The "body" invoked by parallel_do is passed a "feeder" argument which you can do feeder.add(...some new task...) on while processing tasks to create new tasks for execution before the current parallel_do completes.

How do I tell a multi-core / multi-CPU machine to process function calls in a loop in parallel?

I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances.
Many of the more intensive operations behave deterministically and would lend themselves to parallel processing.
Provided I have a loop that iterates over a large number of data chunks arriving from the db and for each one call a deterministic function without side effects, how would I make it so that the program does not wait for the function to return but rather sets the next calls going, so they could be processed in parallel? A naive approach to demonstrate the principle would do me for now.
I have read Google's MapReduce paper and while I could use the overall principle in a number of places, I won't, for now, target large clusters, rather it's going to be a single multi-core or multi-CPU machine for version 1.0. So currently, I'm not sure if I can actually use the library or would have to roll a dumbed-down basic version myself.
I am at an early stage of the design process and so far I am targeting C-something (for the speed critical bits) and Python (for the productivity critical bits) as my languages. If there are compelling reasons, I might switch, but so far I am contented with my choice.
Please note that I'm aware of the fact that it might take longer to retrieve the next chunk from the database than to process the current one and the whole process would then be I/O-bound. I would, however, assume for now that it isn't and in practice use a db cluster or memory caching or something else to be not I/O-bound at this point.
Well, if .net is an option, they have put a lot of effort into Parallel Computing.
If you still plan on using Python, you might want to have a look at Processing. It uses processes rather than threads for parallel computing (due to the Python GIL) and provides classes for distributing "work items" onto several processes. Using the pool class, you can write code like the following:
import processing
def worker(i):
return i*i
num_workers = 2
pool = processing.Pool(num_workers)
result = pool.imap(worker, range(100000))
This is a parallel version of itertools.imap, which distributes calls over to processes. You can also use the apply_async methods of the pool and store lazy result objects in a list:
results = []
for i in range(10000):
results.append(pool.apply_async(worker, i))
For further reference, see the documentation of the Pool class.
Gotchas:
processing uses fork(), so you have to be careful on Win32
objects transferred between processes need to be pickleable
if the workers are relatively fast, you can tweak chunksize, i.e.
the number of work items send to a worker process in one batch
processing.Pool uses a background thread
You can implement the algorithm from Google's MapReduce without having physically separate machines. Just consider each of those "machines" to be "threads." Threads are automatically distributed on multi-core machines.
I might be missing something here, but this this seems fairly straight forward using pthreads.
Set up a small threadpool with N threads in it and have one thread to control them all.
The master thread simply sits in a loop doing something like:
Get data chunk from DB
Find next free thread If no thread is free then wait
Hand over chunk to worker thread
Go back and get next chunk from DB
In the meantime the worker threads they sit and do:
Mark myself as free
Wait for the mast thread to give me a chunk of data
Process the chunk of data
Mark myself as free again
The method by which you implement this can be as simple as two mutex controlled arrays. One has the worked threads in it (the threadpool) and the other indicated if each corresponding thread is free or busy.
Tweak N to your liking ...
If you're working with a compiler that will support it, I would suggest taking a look at http://www.openmp.org for a way of annotating your code in such a way that
certain loops will be parallelized.
It does a lot more as well, and you might find it very helpful.
Their web page reports that gcc4.2 will support openmp, for example.
The same thread pool is used in java. But the threads in threadpools are serialisable and sent to other computers and deserialised to run.
I have developed a MapReduce library for multi-threaded/multi-core use on a single server. Everything is taken care of by the library, and the user just has to implement Map and Reduce. It is positioned as a Boost library, but not yet accepted as a formal lib. Check out http://www.craighenderson.co.uk/mapreduce
You may be interested in examining the code of libdispatch, which is the open source implementation of Apple's Grand Central Dispatch.
Intel's TBB or boost::mpi might be of interest to you also.