I'm working with on real-time application divided in two parts (threads):
processing
graphics
the output of processing is an array of fixed size (of float) and to maintain a real-time performance I want to send such data to another thread that will draw, for example graphs, at its own pace.
I've looked into atomic and lock but I can't figure out how to make the application thread-safe given that the two processes are completely independent.
Sample code:
class A {
float data[n];
processData() {
data = ... ;
}
}
class B {
void draw() {
// requires data[] from class A
}
}
Both classes are initialized in the main thread and I've tried to define a float* pointer there and pass it to the other two threads, processing assigns it to data[] and graphics is able to read it but there's obviously errors when one is reading and the other is modifying it at the same time.
You could create a queue of your float values for the graphics output and a mutex.
Whenever the processing has generated some output, lock the common mutex, append the data to the queue, unlock the mutex.
On the other side, periodically lock the mutex from the graphics thread, see if there is new data to be displayed, if so then remove that data from the queue, temporarily copying it to a thread-local data buffer to ensure that the mutex is not locked while doing graphics output, and unlock the mutex right after the data is copied. Then display the graphics in that thread, using the local copy.
The simplest solution is to use an std::mutex to prevent both threads from accessing the data at the same time.
Of course, that means only 1 thread at a time can do something with the data. If that is a bottleneck (i.e. you want to generate new data while drawing previous data), consider double buffering. That way, both reading and writing can happen simultaneously. Note that you'll still need some sort of synchronization using e.g. a mutex to make sure the writer doesn't start writing into the buffer that's currently being used by the reader (or vice versa). You can improve on that by using triple buffering.
Related
So I am using a QQuickFramebufferObject and QQuickFramebufferObject::Renderer in my Qt application. As mentioned here:
To avoid race conditions and read/write issues from two threads it is important that the renderer and the item never read or write shared variables. Communication between the item and the renderer should primarily happen via the QQuickFramebufferObject::Renderer::synchronize() function.
So I have to synchronize whatever data I render when QQuickFramebufferObject::Renderer::synchronize() is called. However, because many times the data that is sent to the render thread can be quite large I would like to avoid copying that data (which is stored in a DataObject), so for now I am passing a std::shared_ptr<DataObject> in the function and assigning that to a private member of my QQuickFramebufferObject::Renderer class. This approaches works fine, but I am not sure if this is the "correct" way of doing things. What approach can I take in order to share/transfer the data between the GUI thread and the rendering thread?
For data that is too big to copy in the synchronize() method, use a synchronization object to manage access to the data; lock it when writing, release it when finished and lock it in when rendering and access the data directly. You are safe as long as only one thread is accessing the data at a time.
The risk of skipped frames increases the longer the synchronization object is locked. Locking for writing longer than half the optimal render quantum (8.5ms = ~16.7ms/2) will incur dropped frames, but you probably have much more happening in your app so the real number is lower.
Alternatively, you could use a circular buffer for your large data structures with a protected index variable so you can simultaneously write to one structure while reading from another. Increment the index variable when all data is ready to display and call QQuickItem::update().
I have a case where there is an unordered_map of structs. The struct contains int(s), bool(s) and a vector. My program will fetch data for each item in the map either through a https call to a server or using websocket (seperate https calls are required for each item in map). When using websocket, data for all items in the map is returned together. The fetched data is processed and stored in respective vectors.
The websocket is running in a seperate thread and should run throughout the lifetime of the program.
My program has a delete function which can "empty" the entire map. There is also a addItem() function, which will add new struct to my map.
Whenever "updatesOn" member of struct is false, no data is pushed into the vector.
My current implementation has 3 threads:
main thread will add new items to the map. Another function of main thread is to fetch data from vector in struct. Main thread has a function to empty the map and start again. It has another function which only empties the vector.
second thread will run websocket client and fills up vector in struct as new data arrives. There is a while loop which checks for exit flag. Once exit flag is set in main thread, this thread terminates.
third thread is the manager thread. It looks for new entries in map and does http download and then add this item to websocket for subsequent data updates. It also runs http downloads at regular interval, empties vector and refills it.
Right now I have two mutex.
One for locking before data is written/read to/from the vector.
Second mutex is when new data is added or removed from the map. Also to use when the map is emptied.
I sense this is wrong usage of mutex. As I may empty the map when one of the vector element of its struct is being read or written to. This brings me to use one mutex for all.
The problem is this is a realtime stock data program, i.e. new data pops in every second, sometimes even faster. I am afraid one mutex lock for all could slow down my entire app.
As described above, all 3 threads have write access to this map, with the main thread capable of emptying it complete.
Keeping in mind speed and thread safety, What would be a good way to implement this?
My data members:
unordered_map<string, tickerDiary> tDiaries;
struct tickerDiary {
tickerDiary() : name(""), ohlcPeriodicity("minute"), ohlcStatus(false), updatesOn(true), ohlcDayBarIndex(0), rtStatus(false) {}
string name;
string ohlcPeriodicity;
bool ohlcStatus;
bool rtStatus;
bool updatesOn;
int32 ohlcDayBarIndex;
vector<Quotation> data;
};
struct Quotation {
union AmiDate DateTime;
float Price;
float Open;
float High;
float Low;
float Volume;
float OpenInterest;
float AuxData1;
float AuxData2;
};
Note: I am using C++11.
If I understand your question correctly, your map itself is primarily written in the main thread, and the other threads are only used to operate on the data contained within entries in the map.
Given that, for the non-main threads there are two concerns:
The item that they work on should not randomly disappear
They should be the only one working on their item.
The first of these can most efficiently be solved by decoupling the storage from the map. So for each item, storage is allocated separately (either through the default allocator, or some pooling scheme if you add/remove items a lot), and the map only stores a shared ptr. Then each thread working on an item just needs to keep around a shared ptr to make sure that the storage will not disappear out from under them. Then acquiring the map's associated mutex/shared_mutex is only necessary for the duration of the fetch/store/remove of the pointers. This will then work okay so long as it is acceptable that some threads may waste some time doing actions on items already removed from the map. Using shared_ptrs will make sure you wont leak memory by using reference counters, and they will also do the locking/unlocking for these refcounts (or rather, try to use more efficient platform primitives for these). If you want to know more on shared_ptr, and smart pointers in general, this is a reasonable introduction to the c++ system of smart pointers.
That leaves the second problem, which is probably most easily resolved by keeping a mutex in the data struct (tickerDiary) itself, that threads acquire when starting to do operations that require predictable behavior from the struct, and can be released after they have done what they should do.
Separating the locking this way should reduce the amount of contention on the global lock for the map. However, you should probably benchmark your code to see whether that reduction is worth it given the extra costs of the allocations and refcounts for the individual items.
I don't think using std::vector is the right collection here. But if you insist on using it you should just have one mutex for each collection.
I would recommend concurrent_vector from INTEL TBB or a synchronized data structure from boost.
A third solution could be implementing your own concurrent vector
I have an application which has a couple of processing levels like:
InputStream->Pre-Processing->Computation->OutputStream
Each of these entities run in separate thread.
So in my code I have the general thread, which owns the
std::vector<ImageRead> m_readImages;
and then it passes this member variable to each thread:
InputStream input{&m_readImages};
std::thread threadStream{&InputStream::start, &InputStream};
PreProcess pre{&m_readImages};
std::thread preStream{&PreProcess::start, &PreProcess};
...
And each of these classes owns a pointer member to this data:
std::vector<ImageRead>* m_ptrReadImages;
I also have a global mutex defined, which I lock and unlock on each read/write operation to that shared container.
What bothers me is that this mechanism is pretty obscure and sometimes I get confused whether the data is used by another thread or not.
So what is the more straightforward way to share this container between those threads?
The process you described as "Input-->preprocessing-->computation-->Output" is sequential by design: each step depends on the previous one so parallelization in this particular manner is not beneficial as each thread just has to wait for another to complete. Try to find out which step takes most time and parallelize that. Or try to set up multiple parallel processing pipelines that operate sequentially on independent, individual data sets. A usual approach for that would employ a processing queue which distributes the tasks among a set of threads.
It would seem to me that your reading and preprocessing could be done independently of the container.
Naively, I would structure this as a fan-out and then fan-in network of tasks.
First, make dispatch task (a task is a unit of work that is given to a thread to actually operate) that will create input-and-preprocess tasks.
Use futures as a means for the sub-tasks to communicate back a pointer to the completely loaded image.
Make a second task, the std::vector builder task that just calls join on the futures to get the results when they are done and adds them to the std::vector array.
I suggest you structure things this way because I suspect that any IO and preprocessing you are doing will take longer than setting a value in the vector. Using tasks instead of threads directly lets you tune the parallel portion of your work.
I hope that's not too abstracted away from the concrete elements. This is a pattern I find to be well balanced between saturating available hardware, reducing thrash / lock contention, and is understandable by future-you debugging it later.
I would use 3 separate queues, ready_for_preprocessing which is fed by InputStream and consumed by Pre-processing, ready_for_computation which is fed by Pre-Processing and consumed by Computation, and ready_for_output which is fed by Computation and consumed by OutputStream.
You'll want each queue to be in a class, which has an access mutex (to control actually adding and removing items from the queue) and an "image available" semaphore (to signal that items are available) as well as the actual queue. This would allow multiple instances of each thread. Something like this:
class imageQueue
{
std::deque<ImageRead> m_readImages;
std::mutex m_changeQueue;
Semaphore m_imagesAvailable;
public:
bool addImage( ImageRead );
ImageRead getNextImage();
}
addImage() takes the m_changeQueue mutex, adds the image to m_readImages, then signals m_imagesAvailable;
getNextImage() waits on m_imagesAvailable. When it becomes signaled, it takes m_changeQueue, removes the next image from the list, and returns it.
cf. http://en.cppreference.com/w/cpp/thread
Ignoring the question of "Should each operation run in an individual thread", it appears that the objects that you want to process move from thread to thread. In effect, they are uniquely owned by only one thread at a time (no thread ever needs to access any data from other threads, ). There is a way to express just that in C++: std::unique_ptr.
Each step then only works on its owned image. All you have to do is find a thread-safe way to move the ownership of your images through the process steps one by one, which means the critical sections are only at the boundaries between tasks. Since you have multiple of these, abstracting it away would be reasonable:
class ProcessBoundary
{
public:
void setImage(std::unique_ptr<ImageRead> newImage)
{
while (running)
{
{
std::lock_guard<m_mutex> guard;
if (m_imageToTransfer == nullptr)
{
// Image has been transferred to next step, so we can place this one here.
m_imageToTransfer = std::move(m_newImage);
return;
}
}
std::this_thread::yield();
}
}
std::unique_ptr<ImageRead> getImage()
{
while (running)
{
{
std::lock_guard<m_mutex> guard;
if (m_imageToTransfer != nullptr)
{
// Image has been transferred to next step, so we can place this one here.
return std::move(m_imageToTransfer);
}
}
std::this_thread::yield();
}
}
void stop()
{
running = false;
}
private:
std::mutex m_mutex;
std::unique_ptr<ImageRead> m_imageToTransfer;
std::atomic<bool> running; // Set to true in constructor
};
The process steps would then ask for an image with getImage(), which they uniquely own once that function returns. They process it and pass it to the setImage of the next ProcessBoundary.
You could probably improve on this with condition variables, or adding a queue in this class so that threads can get back to processing the next image. However, if some steps are faster than others they will necessarily be stalled by the slower ones eventually.
This is a design pattern problem. I suggest to read about concurrency design pattern and see if there is anything that would help you out.
If you wan to add concurrency to the following sequential process.
InputStream->Pre-Processing->Computation->OutputStream
Then I suggest to use the active object design pattern. This way each process is not blocked by the previous step and can run concurrently. It is also very simple to implement(Here is an implementation:
http://www.drdobbs.com/parallel/prefer-using-active-objects-instead-of-n/225700095)
As to your question about each thread sharing a DTO. This is easily solved with a wrapper on the DTO. The wrapper will contain write and read functions. The write functions blocks with a mutext and the read returns const data.
However, I think your problem lies in design. If the process is sequential as you described, then why are each process sharing the data? The data should be passed into the next process once the current one completes. In other words, each process should be decoupled.
You are correct in using mutexes and locks. For C++11, this is really the most elegant way of accessing complex data between threads.
Assume that I have code like:
void InitializeComplexClass(ComplexClass* c);
class Foo {
public:
Foo() {
i = 0;
InitializeComplexClass(&c);
}
private:
ComplexClass c;
int i;
};
If I now do something like Foo f; and hand a pointer to f over to another thread, what guarantees do I have that any stores done by InitializeComplexClass() will be visible to the CPU executing the other thread that accesses f? What about the store writing zero into i? Would I have to add a mutex to the class, take a writer lock on it in the constructor and take corresponding reader locks in any methods that accesses the member?
Update: Assume I hand a pointer over to a bunch of other threads once the constructor has returned. I'm not assuming that the code is running on x86, but could be instead running on something like PowerPC, which has a lot of freedom to do memory reordering. I'm essentially interested in what sorts of memory barriers the compiler has to inject into the code when the constructor returns.
In order for the other thread to be able to know about your new object, you have to hand over the object / signal other thread somehow. For signaling a thread you write to memory. Both x86 and x64 perform all memory writes in order, CPU does not reorder these operations with regards to each other. This is called "Total Store Ordering", so CPU write queue works like "first in first out".
Given that you create an object first and then pass it on to another thread, these changes to memory data will also occur in order and the other thread will always see them in the same order. By the time the other thread learns about the new object, the contents of this object was guaranteed to be available for that thread even earlier (if the thread only somehow knew where to look).
In conclusion, you do not have to synchronise anything this time. Handing over the object after it has been initialised is all the synchronisation you need.
Update: On non-TSO architectures you do not have this TSO guarantee. So you need to synchronise. Use MemoryBarrier() macro (or any interlocked operation), or some synchronisation API. Signalling the other thread by corresponding API causes also synchronisation, otherwise it would not be synchronisation API.
x86 and x64 CPU may reorder writes past reads, but that is not relevant here. Just for better understanding - writes can be ordered after reads since writes to memory go through a write queue and flushing that queue may take some time. On the other hand, read cache is always consistent with latest updates from other processors (that have went through their own write queue).
This topic has been made so unbelievably confusing for so many, but in the end there is only a couple of things a x86-x64 programmer has to be worried about:
- First, is the existence of write queue (and one should not at all be worried about read cache!).
- Secondly, concurrent writing and reading in different threads to same variable in case of non-atomic variable length, which may cause data tearing, and for which case you would need synchronisation mechanisms.
- And finally, concurrent updates to same variable from multiple threads, for which we have interlocked operations, or again synchronisation mechanisms.)
If you do :
Foo f;
// HERE: InitializeComplexClass() and "i" member init are guaranteed to be completed
passToOtherThread(&f);
/* From this point, you cannot guarantee the state/members
of 'f' since another thread can modify it */
If you're passing an instance pointer to another thread, you need to implement guards in order for both threads to interact with the same instance. If you ONLY plan to use the instance on the other thread, you do not need to implement guards. However, do not pass a stack pointer like in your example, pass a new instance like this:
passToOtherThread(new Foo());
And make sure to delete it when you are done with it.
I have inherited an application which I'm trying to improve the performance of and it currently uses mutexes (std::lock_guard<std::mutex>) to transfer data from one thread to another. One thread is a low-frequency (slow) one which simply modifies the data to be used by the other.
The other thread (which we'll call fast) has rather stringent performance requirements (it needs to do maximum number of cycles per second possible) and we believe this is being impacted by the use of the mutexes.
Basically, the current logic is:
slow thread: fast thread:
occasionally: very-often:
claim mutex claim mutex
change data use data
release mutex release mutex
In order to get the fast thread running at maximum throughput, I'd like to experiment with removing the number of mutex locks it has to do.
I suspect a variation of the double locking check pattern may be of use here. I know it has serious issues with bi-directional data flow (or singleton creation) but the areas of responsibility in my case are a little more limited in terms of which thread performs which operations (and when) on the shared data.
Basically, the slow thread sets up the data and never reads or writes to it again unless a new change comes in. The fast thread uses and changes the data but never expects to pass any information back to the other thread. In other words, ownership mostly flows strictly one way.
I wanted to see if anyone could pick any holes in the strategy I'm thinking of.
The new idea is to have two sets of data, one current and one pending. There is no need for a queue in my case as incoming data overwrites previous data.
The pending data will only ever be written to by the slow thread under the control of the mutex and it will have an atomic flag to indicate that it has written and relinquished control (for now).
The fast thread will continue to use current data (without the mutex) until such time as the atomic flag is set. Since it is responsible for transferring pending to current, it can ensure the current data is always consistent.
At the point where the flag is set, it will lock the mutex and, transfer pending to current, clear the flag, unlock the mutex and carry on.
So, basically, the fast thread runs at full speed and only does mutex locks when it knows the pending data needs to be transferred.
Getting into more concrete details, the class will have the following data members:
std::atomic_bool m_newDataReady;
std::mutex m_protectData;
MyStruct m_pendingData;
MyStruct m_currentData;
The method for receiving new data in the slow thread would be:
void NewData(const MyStruct &newData) {
std::lock_guard<std::mutex> guard(m_protectData);
m_newDataReady = false;
Transfer(m_newData, 'to', m_pendingData);
m_newDataReady = true;
}
Clearing the flag prevents the fast thread from even trying to check for new data until the immediate transfer operation is complete.
The fast thread is a little trickier, using the flag to keep mutex locks to a minimum:
while (true) {
if (m_newDataReady) {
std::lock_guard<std::mutex> guard(m_protectData);
if (m_newDataReady) {
Transfer(m_pendingData, 'to', m_currentData);
m_newDataReady = false;
}
}
Use (m_currentData);
}
Now it appears to me that the use of this method in the fast thread could improve performance quite a bit:
There is only one place where the atomic flag is used outside the control of the mutex and the fact that it's an atomic means its state should be consistent there.
Even if it's not consistent, the second check inside the mutex-locked area should provide a safety valve (it's rechecked when we know it's consistent).
The transfer of data is only ever performed under the control of the mutex so that should always be consistent.
The outer loop in the fast thread means that unnecessary mutex locks will be avoided - they'll only be done if the flag is true (or "half-true", a possibly inconsistent state).
The inner if will take care of that "half-true" possibility that, between checking the and locking the mutex, the flag has been cleared.
I can't see any holes in this strategy but, given I'm only just getting into atomics/threading in the standard-C++ world, it may be I'm missing something.
Are there any clear problems in using this method?