One asynchronous thread per task?

One asynchronous thread per task? - c++

My application currently has a list of "tasks" (Each task consists of a function - That function can be as simple as printing something out but also way more complex) which gets looped through. (Additional note: Most tasks send a packet after having been executed) As some of these tasks could take quite some time, I thought about using a different, asynchronous thread for each task, thus letting run all the tasks concurrently.
Would that be a smart thing to do or not?One problem is that I can't possibly know the amount of tasks beforehand, so it could result in quite a few threads being created, and I read somewhere that each different hardware has it's limitations. I'm planing to run my application on a raspberry pi, and I think that I will always have to run between 5 and 20 tasks.
Also, some of the tasks have a lower "priority" of running that others.
Should I just run the important tasks first and then the less important ones? (Problem here is that if the sum of the time needed for all tasks together exceeds the time that some specific, important task should be run, my application won't be accurate anymore) Or implement everything in asynchronous threads? Or just try to make everything a little bit faster by only having the "packet-sending" in an asynchronous thread, thus not having to wait until the packets actually get sent?

There are number of questions you will need to ask yourself before you can reasonably design a solution.
Do you wish to limit the number of concurrent tasks?
Is it possible that in future the number of concurrent tasks will increase in a way that you cannot predict today?
... and probably a whole lot more.
If the answer to any of these is "yes" then you have a few options:
A producer/consumer queue with a fixed number of threads draining the queue (not recommended IMHO)
Write your tasks as asynchronous state machines around an event dispatcher such as boost::io_service (this is much more scalable).
If you know it's only going to be 20 concurrent tasks, you'll probably get away with std::async, but it's a sloppy way to write code.

Related

How to limit Boost.Asio memory

I'm having trouble managing the work .post()'ed to Boost.Asio's io_context, having multiple questions about it (newbie warning).
Background: I'm writing a library that connects to a large number of different hosts for shorts periods at a time each (connect, send data, receive answer, close), and I figured using Boost.Asio. The documentation is scarce (too DRY?)
My current approach is this: (assuming a quad core machine): two physical cores run CPU bound sync operations, and post() additional work items to io_context. Two other threads are .run()ing and performing completion handlers.
1- The work scheduler
As per this amazing answer,
Boost.Asio may start some of the work as soon as it has been told about it, and other times it may wait to do the work at a later point in time.
When does boost.asio do what? On what basis is the queued work later processed?
2- Multiple Producers/ Multiple Consumers
As per This article,
At its core, Boost Asio provides a task execution framework that you can use to perform operations of any kind. You create your tasks as function objects and post them to a task queue maintained by Boost Asio. You enlist one or more threads to pick these tasks (function objects) and invoke them. The threads keep picking up tasks, one after the other till the task queues are empty at which point the threads do not block but exit.
I am failing to find a way to put a cap on the length of this task queue. This answer gives a couple of solutions, but they both involve locking, something I'd like to avoid as much as possible.
3- Are strands really necessary? How do I "disable them"
As detailed in this answer, boost uses an implicit strand per connection. Making potentially millions of connections, the memory savings by "bypassing" strands make sense to me. As the requests I make are independent (different host to each request), operations I make within a single connection is already serialized (callback chain) so I have no overlapping reads & writes, and no synchronization is expected from Boost.Asio. Does it make sense for me to try and bypass strands? If so, how?
4- Scaling design approach (A bit vague because I have no clue)
As stated in my background section, I'm running two io_contexts on two physical cores, each with two threads one for writing and one for reading. My goal here is to spew packets as fast as I can, and I have already
Compiled asio with BoringSSL (OpenSSL is a serious bottleneck)
Wrote my own c-ares resolver service to avoid async-ish DNS queries running in a thread loop.
But it still happens that my network driver starts timing out when multiple connections are opened. So how do I dynamically adjust boost.asio's throughput, the network adapter can cope with it?
My question(s) is most likely ill-informed as I'm no expert in network programming, and I know this a complex problem, I'd appreciate it if someone left pointers for me to look before closing the question or making it "dead".
Thank you.

How do I multiplex many asynchronous state machines over a fixed number of threads with boost::statechart?

Suppose I have many asynchronous state machines defined with boost::statechart. The clearly documented mechanism for running multiple asynchronous state machines is to fix one or more of them to a thread. However, for my purpose I need to run many, many asynchronous state machines, and one per thread will not do. Moreover, the amount of work done by any given state machine is unpredictable, so assigning state machines to fixed threads will lead to imbalance.
Instead, I'd like to have a thread pool where an idle thread can pick up some amount of work off of a queue. Some care needs to be taken here so that events to a given state machine are delivered in order. Presumably the place to start would be something involving implementing the Scheduler and perhaps the FifoWorker concepts to do what I want as an alternative to the fifo_scheduler and fifo_worker classes, respectively. However, I wonder if this problem has already been solved by someone else, or if I'm just asking the wrong question.

Answering my own question, now that I've had some time to think about it. This is pretty simple:
Every state machine gets its own fifo_scheduler
When we want the state machine to start running, a function is posted to the thread pool that:
Checks scheduler.terminated() and stops if so.
Runs scheduler(n), where n is some implementation-dependent value. We need that to prevent starvation.
Posts itself back to the thread pool.
This also ensures that events are delivered in order without resorting to other means.
This isn't the greatest answer, since the service function will occupy a space in the queue and be called even when there's no work to do.

Find minimum queue size among threads

I am trying to implement a new scheduling technique with Multithreads. Each Thread has it own private local queue. The idea is, each time the task is created from the program thread, it should search the minimum queue sizes ( a queue with less number of tasks) among the queues and enqueue in it.
A way of load balancing among threads, where less busy queues enqueued more.
Can you please suggest some logics (or) idea how to find the minimum size queues among the given queues dynamically in programming point of view.
I am working on visual studio 2008, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .

As you see trying to find the less loaded queue is cumbersome and could be an inefficient method as you may add more work to queues with only one heavy task, whereas queues with small tasks will have nor more jobs and become quickly inactive.
You'd better use a work-stealing heuristic : when a thread is done with its own jobs it will look at the other threads queues and "steal" some work instead of remaining idle or be terminated.
Then the system will be auto-balanced with each thread being active until there is not enough work for everyone.
You should not have a situation with idle threads and work waiting for processing.

If you really want to try this, can each queue not just keep a public 'int count' member, updated with atomic inc/dec as tasks are pushed/popped?
Whether such a design is worth the management overhead and the occasional 'mistakes' when a task is queued to a thread that happens to be running a particularly lengthy job when another thread is just about to dequeue a very short job, is another issue.

Why aren't the threads fetching their work from a 'master' work queue ?
If you are really trying to distribute work items from a master source, to a set of workers, you are then doing load balancing, as you say. In that case, you really are talking about scheduling, unless you simply do round-robin style balancing. Scheduling is a very deep subject in Computing, you can easily spend weeks, or months learning about it.

You could synchronise a counter among the threads. But I guess this isn't what you want.
Since you want to implement everything using dataflow, everything should be queues.
Your first option is to query the number of jobs inside a queue. I think this is not easy, if you want a single reader/writer pattern, because you probably have to use lock for this operation, which is not what you want. Note: I'm just guessing, that you can't use lock-free queues here; either you have a counter or take the difference of two pointers, either way you have a lock.
Your second option (which can be done with lock-free code) is to send a command back to the dispatcher thread, telling him that worker thread x has consumed a job. Using this approach you have n more queues, each from one worker thread to the dispatcher thread.

Intel TBB tasks for serving network connections - good model?

I'm developing a backend for a networking product, that serves a dozen of clients (N = 10-100). Each connection requires 2 periodic tasks, the heartbeat, and downloading of telemetry via SSH, each at H Hz. There are also extra events of different kind coming from the frontend. By nature of every of the tasks, there is a solid part of waiting in select call on each connection's socket, which allows OS to switch between threads often to serve other clients while waiting for response.
In my initial implementation, I create 3 threads per connection (heartbeat, telemetry, extra), each waiting on a single condition variable, which is poked every time there is something to do in a workqueue. The workqueue is filled with the above-mentioned periodic events using a timer and commands from the frontend.
I have a few questions here.
Would it be a good idea to switch a worker thread pool approach to Intel TBB tasks? If so, to which value of threads do I need to initialize tbb::task_scheduler_init?
In the current approach with 300 threads waiting on a conditional variable, which is signaled N * H * 3 times per second, it is likely to become a bottleneck for scalability (especially on the side which calls signal). Are there any better approaches for waking up just one worker per task?
How is waking of a worker thread implemented in TBB?
Thanks for your suggestions!

Its difficult to say if switching to TBB would be a good approach or not. What are your performance requirements, and what are the performance numbers for the current implementation? If the current solution is good enough, than its probably not worth-while to switch.
If you want to compare the both (current impl vs TBB) to know which gives better performance, then you could do what is called a "Tracer bullet" (from the book The Pragmatic Programmer) for each implementation and compare the results. In simpler terms, do a reduced prototype of each and compare the results.
As mentioned in this answer, its typically not a good idea to try to do performance improvements without having concrete evidence that what you're going to change will improve.
Besides all of that, you could consider making a thread pool with the number of threads being some function of the number of CPU cores (maybe a factor of 1 or 1.5 threads per core) The threads would take off tasks from a common work-queue. There would be 3 types of tasks: heartbeat, telemetry, extra. This should reduce the negative impacts caused by context switching when using large numbers of threads.

More threads, better performance?

When I write a message driven app. much like a standard windows app only that it extensively uses messaging for internal operations, what would be the best approach regarding to threading?
As I see it, there are basically three approaches (if you have any other setup in mind, please share):
Having a single thread process all of the messages.
Having separate threads for separate message types (General, UI, Networking, etc...)
Having multiple threads that share and process a single message queue.
So, would there be any significant performance differences between the three?
Here are some general thoughts:
Obviously, the last two options benefit from a situation where there's more than one processor. Plus, if any thread is waiting for an external event, other threads can still process unrelated messages. But ignoring that, seems that multiple threads only add overhead (Thread switches, not to mention more complicated sync situations).
And another question: Would you recommend to implement such a system upon the standard Windows messaging system, or to implement a separate queue mechanism, and why?

The specific choice of threading model should be driven by the nature of the problem you are trying to solve. There isn't necessarily a single "correct" approach to designing the threading model for such an application. However, if we adopt the following assumptions:
messages arrive frequently
messages are independent and don't rely too heavily on shared resources
it is desirable to respond to an arriving message as quickly as possible
you want the app to scale well across processing architectures (i.e. multicode/multi-cpu systems)
scalability is the key design requirement (e.g. more message at a faster rate)
resilience to thread failure / long operations is desirable
In my experience, the most effective threading architecture would be to employ a thread pool. All messages arrive on a single queue, multiple threads wait on the queue and process messages as they arrive. A thread pool implementation can model all three thread-distribution examples you have.
#1 Single thread processes all messages => thread pool with only one thread
#2 Thread per N message types => thread pool with N threads, each thread peeks at the queue to find appropriate message types
#3 Multiple threads for all messages => thread pool with multiple threads
The benefits of this design is that you can scale the number of threads in the thread in proportion to the processing environment or the message load. The number of threads can even scale at runtime to adapt to the realtime message load being experienced.
There are many good thread pooling libraries available for most platforms, including .NET, C++/STL, Java, etc.
As to your second question, whether to use standard windows message dispatch mechanism. This mechanism comes with significant overhead and is really only intended for pumping messages through an windows application's UI loop. Unless this is the problem you are trying to solve, I would advise against using it as a general message dispatching solution. Furthermore, windows messages carry very little data - it is not an object-based model. Each windows message has a code, and a 32-bit parameter. This may not be enough to base a clean messaging model on. Finally, the windows message queue is not design to handle cases like queue saturation, thread starvation, or message re-queuing; these are cases that often arise in implementing a decent message queing solution.

We can't tell you much for sure without knowing the workload (ie, the statistical distribution of events over time) but in general
single queue with multiple servers is at least as fast, and usually faster, so 1,3 would be preferable to 2.
multiple threads in most languages add complexity because of the need to avoid contention and multiple-writer problems
long duration processes can block processing for other things that could get done quicker.
So horseback guess is that having a single event queue, with several server threads taking events off the queue, might be a little faster.
Make sure you use a thread-safe data structure for the queue.

It all depends.
For example:
Events in a GUI queue are best done by a single thread as there is an implied order in the events thus they need to be done serially. Which is why most GUI apps have a single thread to handle events, though potentially multiple events to create them (and it does not preclude the event thread from creating a job and handling it off to a worker pool (see below)).
Events on a socket can potentially by done in parallel (assuming HTTP) as each request is stateless and can thus by done independently (OK I know that is over simplifying HTTP).
Work Jobs were each job is independent and placed on queue. This is the classic case of using a set of worker threads. Each thread does a potentially long operation independently of the other threads. On completion comes back to the queue for another job.

In general, don't worry about the overhead of threads. It's not going to be an issue if you're talking about merely a handful of them. Race conditions, deadlocks, and contention are a bigger concern, and if you don't know what I'm talking about, you have a lot of reading to do before you tackle this.
I'd go with option 3, using whatever abstractions my language of choice offers.

Note that there are two different performance goals, and you haven't stated which you are targetting: throughput and responsiveness.
If you're writing a GUI app, the UI needs to be responsive. You don't care how many clicks per second you can process, but you do care about showing some response within a 10th of a second or so (ideally less). This is one of the reasons it's best to have a single thread devoted to handling the GUI (other reasons have been mentioned in other answers). The GUI thread needs to basically convert windows messages into work-items and let your worker queue handle the heavy work. Once the worker is done, it notifies the GUI thread, which then updates the display to reflect any changes. It does things like painting a window, but not rendering the data to be displayed. This gives the app a quick "snapiness" that is what most users want when they talk about performance. They don't care if it takes 15 seconds to do something hard, as long as when they click on a button or a menu, it reacts instantly.
The other performance characteristic is throughput. This is the number of jobs you can process in a specific amount of time. Usually this type of performance tuning is only needed on server type applications, or other heavy-duty processing. This measures how many webpages can be served up in an hour, or how long it takes to render a DVD. For these sort of jobs, you want to have 1 active thread per CPU. Fewer than that, and you're going to be wasting idle clock cycles. More than that, and the threads will be competing for CPU time and tripping over each other. Take a look at the second graph in this article DDJ articles for the trade-off you're dealing with. Note that the ideal thread count is higher than the number of available CPUs due to things like blocking and locking. The key is the number of active threads.

A good place to start is to ask yourself why you need multiple threads.
The well-thought-out answer to this question will lead you to the best answer to the subsequent question, "how should I use multiple threads in my application?"
And that must be a subsequent question; not a primary question. The fist question must be why, not how.

I think it depends on how long each thread will be running. Does each message take the same amount of time to process? Or will certain messages take a few seconds for example. If I knew that Message A was going to take 10 seconds to complete I would definitely use a new thread because why would I want to hold up the queue for a long running thread...
My 2 cents.

I think option 2 is the best. Having each thread doing independant tasks would give you best results. 3rd approach can cause more delays if multiple threads are doing some I/O operation like disk reads, reading common sockets and so on.
Whether to use Windows messaging framework for processing requests depends on the work load each thread would have. I think windows restricts the no. of messages that can be queued at the most to 10000. For most of the cases this should not be an issue. But if you have lots of messages to be queued this might be some thing to take into consideration.
Seperate queue gives a better control in a sense that you may reorder it the way you want (may be depending on priority)

Yes, there will be performance differences between your choices.
(1) introduces a bottle-neck for message processing
(3) introduces locking contention because you'll need to synchronize access to your shared queue.
(2) is starting to go in the right direction... though a queue for each message type is a little extreme. I'd probably recommend starting with a queue for each model in your app and adding queues where it makes since to do so for improved performance.
If you like option #2, it sounds like you would be interested in implementing a SEDA architecture. It is going to take some reading to understand what is going on, but I think the architecture fits well with your line of thinking.
BTW, Yield is a good C++/Python hybrid implementation.

I'd have a thread pool servicing the message queue, and make the number of threads in the pool easily configurable (perhaps even at runtime). Then test it out with expected load.
That way you can see what the actual correlation is - and if your initial assumptions change, you can easily change your approach.
A more sophisticated approach would be for the system to introspect its own performance traits and adapt it's use of resources, threads in particular, as it goes. Probably overkill for most custom application code, but I'm sure there are products that do that out there.
As for the windows events question - I think that's probably an application specific question that there is no right or wrong answer to in the general case. That said, I usually implement my own queue as I can tailor it to the specific characteristics of the task at hand. Sometimes that might involve routing events via the windows message queue.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js