Moving from thread-based pipelining to task-based parallelism? (C++)

Moving from thread-based pipelining to task-based parallelism? (C++) - c++

I'm looking at how to migrate some existing C++ code from thread-based to task-based parallelism, and whether that migration is desirable. Here's my scenario:
Suppose I have some function to execute on an event. Say I have a camera and each time a frame arrives I want to do some heavy processing and save the results. Some of the processing is serial, so if I just process each frame serially in the same thread, I don't get full CPU utilization. Say the frames arrive every 33ms and the processing latency for a frame is close to 100ms.
So in my current implementation I create say 3 threads that process frames and assign each new frame to one of these worker thread in a round-robin. So thread T0 might process frames F0, F3, F6, etc. Now I get full CPU utilization and I don't have to drop frames to maintain real-time rates.
Since the processing requires various big, temporary resources, I can allocate those up-front for each worker thread. So they do not have to be re-allocated for every frame. This strategy of per-thread resources works well for granularity: if they were being allocated per-frame this would take too long, but with many more worker threads we would run out of resources.
I do not see a way to replace this thread-based parallelism with task-based parallelism using standard C++11 or Microsoft's PPL library. If there is a pattern for doing so that could be sketched below, I would be very happy to learn it.
The question is where to store the state - the allocated temporary resources (e.g. GPU memory) - which can be re-used for subsequent frames but must not conflict with the resources for a currently processing frame.
Is it even desirable to migrate to task-based parallelism in this kind of case?

I figured this out. Here is an example solution:
#include <iostream>
#include <ppltasks.h>
#include <thread>
#include <vector>
using PipelineState = int;
using PipelineStateArg = std::shared_ptr<PipelineState>;
using FrameState = int;
struct Pipeline
{
PipelineStateArg state;
concurrency::task<void> task;
};
std::vector<Pipeline> pipelines;
void proc(const FrameState& fs, PipelineState& ps)
{
std::cout << "Process frame " << fs << " in pipeline " << ps << std::endl;
}
void on_frame(int index)
{
FrameState frame = index;
if (index < 2)
{
// Start a new pipeline
auto state = std::make_shared<PipelineState>(index);
pipelines.push_back({state, concurrency::create_task([=]()
{
proc(frame, *state);
})});
}
else
{
// Use an existing pipeline
auto& pipeline = pipelines[index & 1];
auto state = pipeline.state;
pipeline.task = pipeline.task.then([=]()
{
proc(frame, *state);
});
}
}
void main()
{
for (int i = 0; i < 100; ++i)
{
on_frame(i);
std::this_thread::sleep_for(std::chrono::milliseconds(33));
}
for (auto& pipeline : pipelines)
pipeline.tail.wait();
}

Related

recursive threading with C++ gives a Resource temporarily unavailable

So I'm trying to create a program that implements a function that generates a random number (n) and based on n, creates n threads. The main thread is responsible to print the minimum and maximum of the leafs. The depth of hierarchy with the Main thread is 3.
I have written the code below:
#include <iostream>
#include <thread>
#include <time.h>
#include <string>
#include <sstream>
using namespace std;
// a structure to keep the needed information of each thread
struct ThreadInfo
{
long randomN;
int level;
bool run;
int maxOfVals;
double minOfVals;
};
// The start address (function) of the threads
void ChildWork(void* a) {
ThreadInfo* info = (ThreadInfo*)a;
// Generate random value n
srand(time(NULL));
double n=rand()%6+1;
// initialize the thread info with n value
info->randomN=n;
info->maxOfVals=n;
info->minOfVals=n;
// the depth of recursion should not be more than 3
if(info->level > 3)
{
info->run = false;
}
// Create n threads and run them
ThreadInfo* childInfo = new ThreadInfo[(int)n];
for(int i = 0; i < n; i++)
{
childInfo[i].level = info->level + 1;
childInfo[i].run = true;
std::thread tt(ChildWork, &childInfo[i]) ;
tt.detach();
}
// checks if any child threads are working
bool anyRun = true;
while(anyRun)
{
anyRun = false;
for(int i = 0; i < n; i++)
{
anyRun = anyRun || childInfo[i].run;
}
}
// once all child threads are done, we find their max and min value
double maximum=1, minimum=6;
for( int i=0;i<n;i++)
{
// cout<<childInfo[i].maxOfVals<<endl;
if(childInfo[i].maxOfVals>=maximum)
maximum=childInfo[i].maxOfVals;
if(childInfo[i].minOfVals< minimum)
minimum=childInfo[i].minOfVals;
}
info->maxOfVals=maximum;
info->minOfVals=minimum;
// we set the info->run value to false, so that the parrent thread of this thread will know that it is done
info->run = false;
}
int main()
{
ThreadInfo info;
srand(time(NULL));
double n=rand()%6+1;
cout<<"n is: "<<n<<endl;
// initializing thread info
info.randomN=n;
info.maxOfVals=n;
info.minOfVals=n;
info.level = 1;
info.run = true;
std::thread t(ChildWork, &info) ;
t.join();
while(info.run);
info.maxOfVals= max<unsigned long>(info.randomN,info.maxOfVals);
info.minOfVals= min<unsigned long>(info.randomN,info.minOfVals);
cout << "Max is: " << info.maxOfVals <<" and Min is: "<<info.minOfVals;
}
The code compiles with no error, but when I execute it, it gives me this :
libc++abi.dylib: terminating with uncaught exception of type
std::__1::system_error: thread constructor failed: Resource
temporarily unavailable Abort trap: 6

You spawn too many threads. It looks a bit like a fork() bomb. Threads are a very heavy-weight system resource. Use them sparingly.
Within the function void Childwork I see two mistakes:
As someone already pointed out in the comments, you check the info level of a thread and then you go and create some more threads regardless of the previous check.
Within the for loop that spawns your new threads, you increment the info level right before you spawn the actual thread. However you increment a freshly created instance of ThreadInfo here ThreadInfo* childInfo = new ThreadInfo[(int)n]. All instances within childInfo hold a level of 0. Basically the level of each thread you spawn is 1.
In general avoid using threads to achieve concurrency for I/O bound operations (*). Just use threads to achieve concurrency for independent CPU bound operations. As a rule of thumb you never need more threads than you have CPU cores in your system (**). Having more does not improve concurrency and does not improve performance.
(*) You should always use direct function calls and an event based system to run pseudo concurrent I/O operations. You do not need any threading to do so. For example a TCP server does not need any threads to serve thousands of clients.
(**) This is the ideal case. In practice your software is composed of multiple parts, developed by independent developers and maintained in different modes, so it is ok to have some threads which could be theoretically avoided.
Multithreading is still rocket science in 2019. Especially in C++. Do not do it unless you know exactly what you are doing. Here is a good series of blog posts that handle threads.

Shouldn't I see a difference in CPU usage between a single-threaded vs a multi-threaded websocketpp server?

I'm using a mulithreaded websocketpp server that I configured like this:
Server::Server(int ep) {
using websocketpp::lib::placeholders::_1;
using websocketpp::lib::placeholders::_2;
using websocketpp::lib::bind;
Server::wspp_server.clear_access_channels(websocketpp::log::alevel::all);
Server::wspp_server.init_asio();
Server::wspp_server.set_open_handler(bind(&Server::on_open, this, _1));;
Server::wspp_server.set_close_handler(bind(&Server::on_close, this, _1));
Server::wspp_server.set_message_handler(bind(&Server::on_message, this, _1, _2));
try {
Server::wspp_server.listen(ep);
} catch (const websocketpp::exception &e){
std::cout << "Error in Server::Server(int): " << e.what() << std::endl;
}
Server::wspp_server.start_accept();
}
void Server::run(int threadCount) {
boost::thread_group tg;
for (int i = 0; i < threadCount; i++) {
tg.add_thread(new boost::thread(
&websocketpp::server<websocketpp::config::asio>::run,
&Server::wspp_server));
std::cout << "Spawning thread " << (i + 1) << std::endl;
}
tg.join_all();
}
void Server::updateClients() {
/*
run updates
*/
for (websocketpp::connection_hdl hdl : Server::conns) {
try {
std::string message = "personalized message for this client from the ran update above";
wspp_server.send(hdl, message, websocketpp::frame::opcode::text);
} catch (const websocketpp::exception &e) {
std::cout << "Error in Server::updateClients(): " << e.what() << std::endl;
}
}
}
void Server::on_open(websocketpp::connection_hdl hdl) {
boost::lock_guard<boost::shared_mutex> lock(Server::conns_mutex);
Server::conns.insert(hdl);
//do stuff
//when the first client connects, start the update routine
if (conns.size() == 1) {
Server::run = true;
bool *run = &(Server::run);
std::thread([run] () {
while (*run) {
auto nextTime = std::chrono::steady_clock::now() + std::chrono::milliseconds(15);
Server::updateClients();
std::this_thread::sleep_until(nextTime);
}
}).detach();
}
}
void Server::on_close(websocketpp::connection_hdl hdl) {
boost::lock_guard<boost::shared_mutex> lock(Server::conns_mutex);
Server::conns.erase(hdl);
//do stuff
//stop the update loop when all clients are gone
if (conns.size() < 1)
Server::run = false;
}
void Server::on_message(
websocketpp::connection_hdl hdl,
websocketpp::server<websocketpp::config::asio>::message_ptr msg) {
boost::lock_guard<boost::shared_mutex> lock(Server::conns_mutex);
//do stuff
}
I start the server with:
int port = 9000;
Server server(port);
server.run(/* number of threads */);
The only substantial difference when you add connections is in the message emission [wssp.send(...)]. The increasing number of clients doesn't really add anything to the internal computation. It's only the amount of message to be emitted that augments.
My problem is that the CPU usage doesn't seem to be that much different whether I use 1 or more threads.
It doesn't matter that I start the server with server.run(1) or server.run(4) (both on a 4 core CPU dedicated server). For a similar load, the CPU usage graph shows approximately the same percentage. I was expecting the usage to be lower with 4 threads running in parallel. Am I thinking of this the wrong way?
At some point, I got the sense that the parallelism really applies to the listening part more than the emission. So, I tried enclosing the send within a new thread (that I detach) so it's independent of the sequence that requires it, but it didn't change anything on the graph.
Am I not supposed to see a difference in the work that the CPU produces? Otherwise, what am I doing wrong? Is there another step that I'm missing in order to force the messages to be emitted from different threads?

"My problem is that the CPU usage doesn't seem to be that much different whether I use 1 or more threads."
That's not a problem. That's a fact. It just means that the whole thing isn't CPU bound. Which should be quite obvious, since it's network IO. In fact, high-performance servers often dedicate only 1 thread to all IO tasks, for this reason.
"I was expecting the usage to be lower with 4 threads running in parallel. Am I thinking of this the wrong way?"
Yes, it seems to. You don't expect to pay less if you split the bill 4 ways either.
In fact, much like at the diner, you often end up paying more due the overhead of splitting the load (cost/tasks). Unless you require more CPU capacity/lower reaction times than a single thread can deliver, a single IO thread is (obviously) more efficient because there is no scheduling overhead and/or context switch penalty.
Another mental exercise:
if you run 100 threads, the processor will schedule them all across your available cores, in the optimal case
Likewise, if there are other processes running on your system (which there, obviously, always are) then the processor might schedule your 4 threads all on the same logical core. Do you expect the CPU load to be lower? Why? (Hint: of course not).
Background: What is the difference between concurrency, parallelism and asynchronous methods?

Boost Thread_Group in a loop is very slow

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.

The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)

I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

what is the fastest way to notify another thread that data is available? any alternativies to spinning?

One my thread writes data to circular-buffer and another thread need to process this data ASAP. I was thinking to write such simple spin. Pseudo-code!
while (true) {
while (!a[i]) {
/* do nothing - just keep checking over and over */
}
// process b[i]
i++;
if (i >= MAX_LENGTH) {
i = 0;
}
}
Above I'm using a to indicate that data stored in b is available for processing. Probaly I should also set thread afinity for such "hot" process. Of course such spin is very expensive in terms of CPU but it's OK for me as my primary requirement is latency.
The question is - am I should really write something like that or boost or stl allows something that:
Easier to use.
Has roughly the same (or even better?) latency at the same time occupying less CPU resources?
I think that my pattern is so general that there should be some good implementation somewhere.
upd It seems my question is still too complicated. Let's just consider the case when i need to write some items to array in arbitrary order and another thread should read them in right order as items are available, how to do that?
upd2
I'm adding test program to demonstrate what and how I want to achive. At least on my machine it happens to work. I'm using rand to show you that I can not use general queue and I need to use array-based structure:
#include "stdafx.h"
#include <string>
#include <boost/thread.hpp>
#include "windows.h" // for Sleep
const int BUFFER_LENGTH = 10;
int buffer[BUFFER_LENGTH];
short flags[BUFFER_LENGTH];
void ProcessorThread() {
for (int i = 0; i < BUFFER_LENGTH; i++) {
while (flags[i] == 0);
printf("item %i received, value = %i\n", i, buffer[i]);
}
}
int _tmain(int argc, _TCHAR* argv[])
{
memset(flags, 0, sizeof(flags));
boost::thread processor = boost::thread(&ProcessorThread);
for (int i = 0; i < BUFFER_LENGTH * 10; i++) {
int x = rand() % BUFFER_LENGTH;
buffer[x] = x;
flags[x] = 1;
Sleep(100);
}
processor.join();
return 0;
}
Output:
item 0 received, value = 0
item 1 received, value = 1
item 2 received, value = 2
item 3 received, value = 3
item 4 received, value = 4
item 5 received, value = 5
item 6 received, value = 6
item 7 received, value = 7
item 8 received, value = 8
item 9 received, value = 9
Is my program guaranteed to work? How would you redesign it, probably using some of existent structures from boost/stl instead of array? Is it possible to get rid of "spin" without affecting latency?

If the consuming thread is put to sleep it takes a few microseconds for it to wake up. This is the process scheduler latency you cannot avoid unless the thread is busy-spinning as you do. The thread also needs to be real-time FIFO so that it is never put to sleep when it is ready to run but exhausted its time quantum.
So, there is no alternative that could match latency of busy spinning.
(Surprising you are using Windows, it is best avoided if you are serious about HFT).

This is what Condition Variables were designed for. std::condition_variable is defined in the C++11 standard library.
What exactly is fastest for your purposes depends on your problem; You can attack it from several angles, but CVs (or derivative implementations) are a good starting point for understanding the subject better and approaching an implementation.

Consider using C++11 library if your compiler supports it. Or boost analog if not. And in your case especially std::future with std::promise.
There is a good book about threading and C++11 threading library:
Anthony Williams. C++ Concurrency in Action (2012)
Example from cppreference.com:
#include <iostream>
#include <future>
#include <thread>
int main()
{
// future from a packaged_task
std::packaged_task<int()> task([](){ return 7; }); // wrap the function
std::future<int> f1 = task.get_future(); // get a future
std::thread(std::move(task)).detach(); // launch on a thread
// future from an async()
std::future<int> f2 = std::async(std::launch::async, [](){ return 8; });
// future from a promise
std::promise<int> p;
std::future<int> f3 = p.get_future();
std::thread( [](std::promise<int>& p){ p.set_value(9); },
std::ref(p) ).detach();
std::cout << "Waiting..." << std::flush;
f1.wait();
f2.wait();
f3.wait();
std::cout << "Done!\nResults are: "
<< f1.get() << ' ' << f2.get() << ' ' << f3.get() << '\n';
}

If you want a fast method then simply drop to making OS calls. Any C++ library wrapping them is going to be slower.
e.g. On Windows your consumer can call WaitForSingleObject(), and your data-producing thread can wake the consumer using SetEvent(). http://msdn.microsoft.com/en-us/library/windows/desktop/ms687032(v=vs.85).aspx
For Unix, here is a similar question with answers: Windows Event implementation in Linux using conditional variables?

Do you really need threading?
A single threaded app is trivially simple and eliminates all the issues with thread safety and the overhead of launching threads. I did a study of threaded vs non threaded code to append text to a log file. The non threaded code was better in every measure of performance.

Problem with pipeline implementation

I have created an application which uses the pipeline pattern to do some processing.
However, I noticed that when the pipeline is run multiple times in a row it tends to get slower and slower.
This is also the case when no actual processing is done in the pipeline stages - so I am curious if maybe my pipeline implementation has a problem.
This is a simple test program which repoduces the effect:
#include <iostream>
#include <boost/thread.hpp>
class Pipeline {
void processStage(int i) {
return;
}
public:
void run() {
boost::thread_group threads;
for (int i=0; i< 8; ++i) {
threads.add_thread(new boost::thread(&Pipeline::processStage, this, i));
}
threads.join_all();
}
};
int main() {
Pipeline pipeline;
int n=2000;
for (int i=0;i<n; ++i) {
pipeline.run();
if (((i+1)*100)/n > (i*100)/n)
std::cout << "\r" << ((i+1)*100)/n << " %";
}
}
In my understanding the threads are created in run() and at the end of run() they are terminated. So the state of the program at the beginning of the outer loop in the main program should always be the same...
But what I observe is an increating slowdown when processing this loop.
I know that it would be more efficient to keep the pipeline threads alive thoughout the whole program - but I need to know if there is a problem with my pipeline implementation.
Thanks!
Constantin

I do not know the exact reason for the slowdown in run(), but when I use the code obove and insert a little sleep (500ms) at the end of the loop in main() then the slowdown of run() is gone. So the system seems to need some "recover time" until it is able to create new threads.

Since you do new boost::thread() did you try to clean them up? If you run on windows, see the Task Manager about the number of threads opened by the process and if required close the thread handles. I suspect, The number of threads created by the system is keep increasing..

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js