Multi threading with third party library in C++

Multi threading with third party library in C++ - c++

I'm using two functions from a third party library currently in my application. The first function, aka .SourceMeasure basically collecting data from some hardware while the second function, aka .ComputeErrors is purely calculation based on the data collected from the first function. And the measure-calculate executions will be looped for 5 times.
I'm thinking create a multi thread to move the .ComputeErrors to the worker thread to save some times.
Will there be a issue if the.SourceMeasure is in main thread and the .ComputeErrors in the worker thread and both of them coming from the same library?
//The execution is something like this..
for (int i=0; z < 5; z++)
{
Lib.SourceMeasure (data)
Lib.ComputeErorrs (data) //Want to put this in a separate thread
}

I don't know, which library you're using, but it's almost sure that you can not start Lib.ComputeErorrs() until Lib.SourceMeasure() is still running, on the same data set.
What you can do is to set up a queue and two threads:
The "measure thread":
create a data item
call Lib.SourceMeasure() on it
push data to a FIFO queue
The "compute thread":
hang on some if the queue is empty
pick a data item from the queue
call Lib.ComputeErorrs() with the data
On result, the measurement and the computation will run parallel (but not the same items, measurement will be some ahead). All you need to find is a thread-safe queue.

Related

Hard Realtime C++ for Robot Control

I am trying to control a robot using a template-based controller class written in c++. Essentially I have a UDP connection setup with the robot to receive the state of the robot and send new torque commands to the robot. I receive new observations at a higher frequency (say 2000Hz) and my controller takes about 1ms (1000Hz) to calculate new torque commands to send to the robot. The problem I am facing is that I don't want my main code to wait to send the old torque commands while my controller is still calculating new commands to send. From what I understand I can use Ubuntu with RT-Linux kernel, multi-thread the code so that my getTorques() method runs in a different thread, set priorities for the process, and use mutexes and locks to avoid data race between the 2 threads, but I was hoping to learn what the best strategies to write hard-realtime code for such a problem are.
// main.cpp
#include "CONTROLLER.h"
#include "llapi.h"
void main{
...
CONTROLLERclass obj;
...
double new_observation;
double u;
...
while(communicating){
get_newObs(new_observation); // Get new state of the robot (2000Hz)
obj.getTorques(new_observation, u); // Takes about 1ms to calculate new torques
send_newCommands(u); // Send the new torque commands to the robot
}
...
}
Thanks in advance!

Okay, so first of all, it sounds to me like you need to deal with the fact that you receive input at 2 KHz, but can only compute results at about 1 KHz.
Based on that, you're apparently going to have to discard roughly half the inputs, or else somehow (in a way that makes sense for your application) quickly combine the inputs that have arrived since the last time you processed the inputs.
But as the code is structured right now, you're going to fetch and process older and older inputs, so even though you're producing outputs at ~1 KHz, those outputs are constantly being based on older and older data.
For the moment, let's assume you want to receive inputs as fast as you can, and when you're ready to do so, you process the most recent input you've received, produce an output based on that input, and repeat.
In that case, you'd probably end up with something on this general order (using C++ threads and atomics for the moment):
std::atomic<double> new_observation;
std::thread receiver = [&] {
double d;
get_newObs(d);
new_observation = d;
};
std::thread sender = [&] {
auto input = new_observation;
auto u = get_torques(input);
send_newCommands(u);
};
I've assumed that you'll always receive input faster than you can consume it, so the processing thread can always process whatever input is waiting, without receiving anything to indicate that the input has been updated since it was last processed. If that's wrong, things get a little more complex, but I'm not going to try to deal with that right now, since it sounds like it's unnecessary.
As far as the code itself goes, the only thing that may not be obvious is that instead of passing a reference to new_input to either of the existing functions, I've read new_input into variable local to the thread, then passed a reference to that.

Do I need dedicated fences/semaphores per swap chain image, per frame or per command pool in Vulkan?

I've read several articles on the CPU-GPU (using fences) and GPU-GPU (using semaphores) synchronization mechanisms, but still got trouble to understand how I should implement a simple render-loop.
Please take a look at the simple render() function below. If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
However, is this really safe? All operations are asynchronous. So, is it really safe to "reuse" the image_available semaphore in a subsequent call of render() again even though the signal request from the previous call hasn't fired yet? I would think it's not, but, on the other hand, we're using the same queues (don't know if it matters where the graphics and presentation queue are actually the same) and operations inside a queue should be sequentially consumed ... But if I got it right, they might not be consumed "as a whole" and could be reordered ...
The second thing is that (again, unless I'm missing something) I clearly should use one fence per swap chain image to ensure that the operation on the image corresponding to the image_index of the call to render() has finished. But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR? And do I then need dedicated image_available and rendering_finished semaphores per swap chain image? Or maybe per frame? Or maybe per command buffer/pool? I'm really confused ...
void render()
{
std::uint32_t image_index;
switch (vkAcquireNextImageKHR(device(), swap_chain().handle(),
std::numeric_limits<std::uint64_t>::max(), m_image_available, VK_NULL_HANDLE, &image_index))
{
case VK_SUBOPTIMAL_KHR:
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkAcquireNextImageKHR");
}
static VkPipelineStageFlags constexpr wait_destination_stage_mask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.waitSemaphoreCount = 1;
submit_info.pWaitSemaphores = &m_image_available;
submit_info.signalSemaphoreCount = 1;
submit_info.pSignalSemaphores = &m_rendering_finished;
submit_info.pWaitDstStageMask = &wait_destination_stage_mask;
if (vkQueueSubmit(graphics_queue().handle, 1, &submit_info, VK_NULL_HANDLE) != VK_SUCCESS)
throw std::runtime_error("vkQueueSubmit");
VkPresentInfoKHR present_info{};
present_info.sType = VK_STRUCTURE_TYPE_PRESENT_INFO_KHR;
present_info.waitSemaphoreCount = 1;
present_info.pWaitSemaphores = &m_rendering_finished;
present_info.swapchainCount = 1;
present_info.pSwapchains = &swap_chain().handle();
present_info.pImageIndices = &image_index;
switch (vkQueuePresentKHR(presentation_queue().handle, &present_info))
{
case VK_SUCCESS:
break;
case VK_ERROR_OUT_OF_DATE_KHR:
case VK_SUBOPTIMAL_KHR:
on_resized();
return;
default:
throw std::runtime_error("vkQueuePresentKHR");
}
}
EDIT: As suggested in the answers below, assume we have k "frames in flight" and hence k instances of the semaphores and the fence used in the code above, which I will denote by m_image_available[i], m_rendering_finished[i] and m_fence[i] for i = 0, ..., k - 1. Let i denote the current index of the frame in flight, which is increased by 1 after each invocation of render(), and j denote the number of invocations of render(), starting from j = 0.
Now, assume the swap chain contains three images.
If j = 0, then i = 0 and the first frame in flight is using swap chain image 0
In the same way, if j = a, then i = a and the ath frame in flight is using swap chain image a, for a= 2, 3
Now, if j = 3, then i = 3, but since the swap chain image only has three images, the fourth frame in flight is using swap chain image 0 again. I wonder whether this is problematic or not. I guess it's not, since the wait/signal semaphores m_image_available[3]/m_rendering_finished[3], used in the calls of vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR in this invocation of render(), are dedicated to this particular frame in flight.
If we reach j = k, then i = 0 again, since there are only k frames in flight. Now we potentially wait at the beginning of render(), if the call to vkQueuePresentKHR from the first invocation (i = 0) of render() hasn't signaled m_fence[0] yet.
So, besides my doubts described in the third bullet point above, the only question which remains is why I shouldn't take k as large as possible? What I theoretically could imagine is that if we are submitting work to the GPU in a quicker fashion than the GPU is able to consume, the used queue(s) might continually grow and eventually overflow (is there some kind of "max commands in queue" limit?).

If I got it right, the minimal requirement is that we ensure the GPU-GPU synchronization between vkAcquireNextImageKHR, vkQueueSubmit and vkQueuePresentKHR by a single set of semaphores image_available and rendering_finished as I've done in the example code below.
Yes, you got it right. You submit the desire to get a new image to render into via vkAcquireNextImageKHR. The presentation engine will signal the m_image_available semaphore as soon as an image to render into has become available. But you have already submitted the instruction.
Next, you submit some commands to the graphics queue via submit_info. I.e. they are also already submitted to the GPU and wait there until the m_image_available semaphore receives its signal.
Furthermore, a presentation instruction is submitted to the presentation engine that expresses the dependency that it needs to wait until the submit_info-commands have completed by waiting on the m_rendering_finished semaphore.
I.e. everything has been submitted. If nothing has been signalled yet, everything just sits there in some GPU buffers and waits for signals.
Now, if your code loops right back into the render() function and re-uses the same m_image_available and m_rendering_finished semaphores, it will only work if you are very lucky, namely if all the semaphores have already been signalled before you use them again.
The specifications says the following for vkAcquireNextImageKHR:
If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending
and furthermore, it says under 7.4.2. Semaphore Waiting
the act of waiting for a binary semaphore also unsignals that semaphore.
I.e. indeed, you need to wait on the CPU until you know for sure that the previous vkAcquireNextImageKHR that uses the same m_image_available semaphore has completed.
And yes, you already got it right: You need to use a fence for that which you pass to vkQueueSubmit. If you do not synchronize on the CPU, you'll shovel ever more work to the GPU (which is a problem) and the semaphores that you are re-using might not get properly unsignalled in time (which is a problem).
What is often done is that the semaphores and fences are multiplied, e.g. to 3 each, and these sets of synchronization objects are used in sequence, so that more work can be parallelized on the GPU. The Vulkan Tutorial describes this quite nicely in its Rendering and presentation chapter. It is also explained with animation in this lecture starting at 7:59.

So first of all, as you mentioned correctly, semaphores are strictly for GPU-GPU synchronization, e.g. to make sure that one batch of commands (one submit) has finished before another one starts. This is here used to synchronize the rendering commands with the present command such that the presenting engine knows when to present the rendered image.
Fences are the main utility for CPU-GPU synchronization. You place a fence in a queue submit and then on the CPU side wait for it before you want to proceed. This is usually done here such that we do not queue any new rendering/present commands while the previous frame hasn't finished.
But does that mean that I necessarily need to do a
if (vkWaitForFences(device(), 1, &fence[image_index_of_last_call], VK_FALSE, std::numeric_limits<std::uint64_t>::max()) != VK_SUCCESS)
throw std::runtime_error("vkWaitForFences");
vkResetFences(device(), 1, &fence[image_index_of_last_call]);
before my call to vkAcquireNextImageKHR?
Yes, you definitely need this in your code, otherwise your semaphores would not be safe and you would probably get validation errors.
In general, if you want your CPU to wait until your GPU has finished rendering of the previous frame, you would have only a single fence and a single pair of semaphores. You could also replace the fence by a waitIdle command of the queue or device.
However, in practice you do not want to stall the CPU and in the meantime record commands for the next frame. This is done via frames in flight. This simply means that for every frame in flight (i.e. number of frames that can be recorded in parallel to the execution on the GPU), you have one fence and one pair of semaphores which synchronize that particular frame.
So in essence for your render loop to work properly you need a pair of semaphores + fence per frame in flight, independent of the number of swapchain images. However, do note that the current frame index (frame in flight) and image index (swapchain) will generally not be the same except you use the same amount of swapchain images as frames in flight. This is because the presenting engine might give you swapchain images out of order depending on your presenting mode.

Multithreading Implementation in C++

I am a beginner using multithreading in C++, so I'd appreciate it if you can give me some recommendations.
I have a function which receives the previous frame and current frame from a video stream (let's call this function, readFrames()). The task of that function is to compute Motion Estimation.
The idea when calling readFrames() would be:
Store the previous and current frame in a buffer.
I want to compute the value of Motion between each pair of frames from the buffer but without blocking the function readFrames(), because more frames can be received while computing that value. I suppose I have to write a function computeMotionValue() and every time I want to execute it, create a new thread and launch it. This function should return some float motionValue.
Every time the motionValue returned by any thread is over a threshold, I want to +1 a common int variable, let's call it nValidMotion.
My problem is that I don't know how to "synchronize" the threads when accessing motionValue and nValidMotion.
Can you please explain to me in some pseudocode how can I do that?

and every time I want to execute it, create a new thread and launch it
That's usually a bad idea. Threads are usually fairly heavy-weight, and spawning one is usually slower than just passing a message to an existing thread pool.
Anyway, if you fall behind, you'll end up with more threads than processor cores and then you'll fall even further behind due to context-switching overhead and memory pressure. Eventually creating a new thread will fail.
My problem is that I don't know how to "synchronize" the threads when accessing motionValue and nValidMotion.
Synchronization of access to a shared resource is usually handled with std::mutex (mutex means "mutual exclusion", because only one thread can hold the lock at once).
If you need to wait for another thread to do something, use std::condition_variable to wait/signal. You're waiting-for/signalling a change in state of some shared resource, so you need a mutex for that as well.
The usual recommendation for this kind of processing is to have at most one thread per available core, all serving a thread pool. A thread pool has a work queue (protected by a mutex, and with the empty->non-empty transition signalled by a condvar).
For combining the results, you could have a global counter protected by a mutex (but this is relatively heavy-weight for a single integer), or you could just have each task added to added to the thread pool return a bool via the promise/future mechanism, or you could just make your counter atomic.

Here is a sample pseudo code you may use:
// Following thread awaits notification from worker threads, detecting motion
nValidMotion_woker_Thread()
{
while(true) { message_recieve(msg_q); ++nValidMotion; }
}
// Worker thread, computing motion on 2 frames; if motion detected, notify uysing message Q to nValidMotion_woker_Thread
WorkerThread(frame1 ,frame2)
{
x = computeMotionValue(frame1 ,frame2);
if x > THRESHOLD
msg_q.send();
}
// main thread
main_thread()
{
// 1. create new message Q for inter-thread communication
msg_q = new msg_q();
// start listening thread
Thread a = new nValidMotion_woker_Thread();
a.start();
while(true)
{
// collect 2 frames
frame1 = readFrames();
frame2 = readFrames();
// start workre thread
Thread b = new WorkerThread(frame1 ,frame2);
b.start();
}
}

How does one update the GTK+ GUI in C++ with time consuming operations?

I am using OpenMP to perform a time consuming operation. I am unable to update a ProgressBar from GTK+ from within the time consuming loop at the same time the operations are carried out. The code I have updtates the ProgressBar, but it does so after everything is done. Not as the code progresses.
This is my dummy code that doesn't update the ProgressBar until everything is done:
void largeTimeConsumingFunction (GtkProgressBar** progressBar) {
int extensiveOperationSize = 1000000;
#pragma omp parallel for ordered schedule(dynamic)
for (int i = 0; i < extensiveOperationSize; i++) {
// Do something that will take a lot of of time with data
#pragma omp ordered
{
// Update the progress bar
gtk_progress_bar_set_fraction(*progressBar, i/(double)extensiveOperationSize);
}
}
}
When I do the same, but without using OpenMP, the same happens. It doesn't get updated until the end.
How could I get that GTK+ Widget to update at the same time the loop is working?
Edit: This is just a dummy code to keep it short and readable. It has the same structure as my actual code, but in my actual code I don't know before hand the size of the items I will be processing. It could be 10 or more than 1 million items and I will have to perform some action for each of them.

There are two potential issues here:
First, if you are performing long running computations that might block main thread, you have to call
while (gtk_events_pending ())
gtk_main_iteration ();
every now and then to keep UI responsive (which includes redrawing itself).
Second, you should call GTK+ functions only from main thread.

Parse very large CSV files with C++

My goal is to parse large csv files with C++ in a QT project in OSX environment.
(When I say csv I mean tsv and other variants 1GB ~ 5GB ).
It seems like a simple task , but things get complicated when file sizes get bigger. I don't want to write my own parser because of the many edge cases related to parsing csv files.
I have found various csv processing libraries to handle this job, but parsing 1GB file takes about 90 ~ 120 seconds on my machine which is not acceptable. I am not doing anything with the data right now, I just process and discard the data for testing purposes.
cccsvparser is one of the libraries I have tried . But the the only fast enough library was fast-cpp-csv-parser which gives acceptable results: 15 secs on my machine, but it works only when the file structure is known.
Example using: fast-cpp-csv-parser
#include "csv.h"
int main(){
io::CSVReader<3> in("ram.csv");
in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
std::string vendor; int size; double speed;
while(in.read_row(vendor, size, speed)){
// do stuff with the data
}
}
As you can see I cannot load arbitrary files and I must specifically define variables to match my file structure. I'm not aware of any method that allows me to create those variables dynamically in runtime .
The other approach I have tried is to read csv file line by line with fast-cpp-csv-parser LineReader class which is really fast (about 7 secs to read whole file), and then parse each line with cccsvparser lib that can process strings, but this takes about 40 seconds until done, it is an improvement compared to the first attempts but still unacceptable.
I have seen various Stack Overflow questions related to csv file parsing none of them takes large file processing in to account.
Also I spent a lot of time googling to find a solution to this problem, and I really miss the freedom that package managers like npm or pip offer when searching for out of the box solutions.
I will appreciate any suggestion about how to handle this problem.
Edit:
When using #fbucek's approach, processing time reduced to 25 seconds, which is a great improvement.
can we optimize this even more?

I am assuming you are using only one thread.
Multithreading can speedup your process.
Best accomplishment so far is 40 sec. Let's stick to that.
I have assumed that first you read then you process -> ( about 7 secs to read whole file)
7 sec for reading
33 sec for processing
First of all you can divide your file into chunks, let's say 50MB.
That means that you can start processing after reading 50MB of file. You do not need to wait till whole file is finished.
That's 0.35 sec for reading ( now it is 0.35 + 33 second for processing = cca 34sec )
When you use Multithreading, you can process multiple chunks at a time. That can speedup process theoretically up to number of your cores. Let's say you have 4 cores.
That's 33/4 = 8.25 sec.
I think you can speed up you processing with 4 cores up to 9 s. in total.
Look at QThreadPool and QRunnable or QtConcurrent
I would prefer QThreadPool
Divide task into parts:
First try to loop over file and divide it into chunks. And do nothing with it.
Then create "ChunkProcessor" class which can process that chunk
Make "ChunkProcessor" a subclass of QRunnable and in reimplemented run() function execute your process
When you have chunks, you have class which can process them and that class is QThreadPool compatible, you can pass it into
It could look like this
loopoverfile {
whenever chunk is ready {
ChunkProcessor *chunkprocessor = new ChunkProcessor(chunk);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
}
}
You can use std::share_ptr to pass processed data in order not to use QMutex or something else and avoid serialization problems with multiple thread access to some resource.
Note: in order to use custom signal you have to register it before use
qRegisterMetaType<std::shared_ptr<ProcessedData>>("std::shared_ptr<ProcessedData>");
Edit: (based on discussion, my answer was not clear about that)
It does not matter what disk you use or how fast is it. Reading is single thread operation.
This solution was suggested only because it took 7 sec to read and again does not matter what disk it is. 7 sec is what's count. And only purpose is to start processing as soon as possible and not to wait till reading is finished.
You can use:
QByteArray data = file.readAll();
Or you can use principal idea: ( I do not know why it take 7 sec to read, what is behind it )
QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
return;
QByteArray* data = new QByteArray;
int count = 0;
while (!file.atEnd()) {
++count;
data->append(file.readLine());
if ( count > 10000 ) {
ChunkProcessor *chunkprocessor = new ChunkProcessor(data);
QThreadPool::globalInstance()->start(chunkprocessor);
connect(chunkprocessor, SIGNAL(finished(std::shared_ptr<ProcessedData>)), this, SLOT(readingFinished(std::shared_ptr<ProcessedData>)));
data = new QByteArray;
count = 0;
}
}
One file, one thread, read almost as fast as read by line "without" interruption.
What you do with data is another problem, but has nothing to do with I/O. It is already in memory.
So only concern would be 5GB file and ammout of RAM on the machine.
It is very simple solution all you need is subclass QRunnable, reimplement run function, emit signal when it is finished, pass processed data using shared pointer and in main thread joint that data into one structure or whatever. Simple thread safe solution.

I would propose a multi-thread suggestion with a slight variation is that one thread is dedicated to reading file in predefined (configurable) size of chunks and keeps on feeding data to a set of threads (more than one based cpu cores). Let us say that the configuration looks like this:
chunk size = 50 MB
Disk Thread = 1
Process Threads = 5
Create a class for reading data from file. In this class, it holds a data structure which is used to communicate with process threads. For example this structure would contain starting offset, ending offset of the read buffer for each process thread. For reading file data, reader class holds 2 buffers each of chunk size (50 MB in this case)
Create a process class which holds a pointers (shared) for the read buffers and offsets data structure.
Now create driver (probably main thread), creates all the threads and waiting on their completion and handles the signals.
Reader thread is invoked with reader class, reads 50 MB of the data and based on number of threads creates offsets data structure object. In this case t1 handles 0 - 10 MB, t2 handles 10 - 20 MB and so on. Once ready, it notifies processor threads. It then immediately reads next chunk from disk and waits for processor thread to completion notification from process threads.
Processor threads on the notification, reads data from buffer and processes it. Once done, it notifies reader thread about completion and waits for next chunk.
This process completes till the whole data is read and processed. Then reader thread notifies back to the main thread about completion which sends PROCESS_COMPLETION, upon all threads exits. or main thread chooses to process next file in the queue.
Note that offsets are taken for easy explanation, offsets to line delimiter mapping needs to be handled programmatically.

If the parser you have used is not distributed obviously the approach is not scalable.
I would vote for a technique like this below
chunk up the file into a size that can be handled by a machine / time constraint
distribute the chunks to a cluster of machines (1..*) that can meet your time/space requirements
consider dealing at block sizes for a given chunk
Avoid threads on same resource (i.e given block) to save yourself from all thread related issues.
Use threads to achieve non competing (on a resource) operations - such as reading on one thread and writing on a different thread to a different file.
do your parsing (now for this small chunk you can invoke your parser).
do your operations.
merge the results back / if can distribute them as they are..
Now, having said that, why can't you use Hadoop like frameworks?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js