Intel TBB tasks for serving network connections - good model?

Intel TBB tasks for serving network connections - good model? - c++

I'm developing a backend for a networking product, that serves a dozen of clients (N = 10-100). Each connection requires 2 periodic tasks, the heartbeat, and downloading of telemetry via SSH, each at H Hz. There are also extra events of different kind coming from the frontend. By nature of every of the tasks, there is a solid part of waiting in select call on each connection's socket, which allows OS to switch between threads often to serve other clients while waiting for response.
In my initial implementation, I create 3 threads per connection (heartbeat, telemetry, extra), each waiting on a single condition variable, which is poked every time there is something to do in a workqueue. The workqueue is filled with the above-mentioned periodic events using a timer and commands from the frontend.
I have a few questions here.
Would it be a good idea to switch a worker thread pool approach to Intel TBB tasks? If so, to which value of threads do I need to initialize tbb::task_scheduler_init?
In the current approach with 300 threads waiting on a conditional variable, which is signaled N * H * 3 times per second, it is likely to become a bottleneck for scalability (especially on the side which calls signal). Are there any better approaches for waking up just one worker per task?
How is waking of a worker thread implemented in TBB?
Thanks for your suggestions!

Its difficult to say if switching to TBB would be a good approach or not. What are your performance requirements, and what are the performance numbers for the current implementation? If the current solution is good enough, than its probably not worth-while to switch.
If you want to compare the both (current impl vs TBB) to know which gives better performance, then you could do what is called a "Tracer bullet" (from the book The Pragmatic Programmer) for each implementation and compare the results. In simpler terms, do a reduced prototype of each and compare the results.
As mentioned in this answer, its typically not a good idea to try to do performance improvements without having concrete evidence that what you're going to change will improve.
Besides all of that, you could consider making a thread pool with the number of threads being some function of the number of CPU cores (maybe a factor of 1 or 1.5 threads per core) The threads would take off tasks from a common work-queue. There would be 3 types of tasks: heartbeat, telemetry, extra. This should reduce the negative impacts caused by context switching when using large numbers of threads.

Related

How to choose correct number of threads for C++ multithread application?

I am C++ backend developer. I develop server side for realtime game. So, application architecture look like this:
1) I have class Client, which process requests from game client. Examples of requests: login, buy something in store (game internal store), or make some stuff. Also this Client handle user input events from game client (It's very often events, which sends ten times in second from game client to server, when player play in gameplay).
2) I have thread pool. When game client connect to server I create Client instance and bind them to one of threads from pool. So, we have relationships one to many: one thread - many Clients. Round-robin used to chose thread for binding.
3) I use Libev to manage all events inside server. It's mean when Client instance receive some data from game client through network, or handle some request, or trying to send some data through network to game client he lock hi's thread. While he make some stuff other Clients, which share same thread will be locked.
So, thread pool is bottleneck for application. To increase number concurrent players on server, who will play without lags I need increase number of threads in thread pool.
Now application work on server with 24 logical cpus ( cat /proc/cpuinfo say it). And I set thread pool size to 24 (1 processor - 1 thread). It's mean, that with current online 2000 players every thread serves about 84 Client instances. top say that processors used less then 10 percents.
Now question. If I increase number of threads in thread pool is It increase or decrease server performance (Context switching overhead vs locked Clients per thread)?
UPD
1) Server has async IO (libev + epoll), so when I says that Client locked when send and receive data I mean coping to buffers.
2) Server also has background threads for slow tasks: database operations, hard calculation operations, ...

Well few issues.
2) I have thread pool. When game client connect to server I create
Client instance and bind them to one of threads from pool. So, we have
relationships one to many: one thread - many Clients. Round-robin used
to chose thread for binding.
You didn't mention asynchronous IO in any of the points, I believe your true bottleneck here is not the thread count, but the fact that a thread is blocked because of an IO action. by using asynchronous IO (which is not an IO action on another thread) - the thoughput of your server increases by huge magnitutes.
3) I use Libev to manage all events inside server. It's mean when
Client instance receive some data from game client through network, or
handle some request, or trying to send some data through network to
game client he lock hi's thread. While he make some stuff other
Clients, which share same thread will be locked.
again, without asynchronous IO this architecture is very much 90's server-side architecture (a-la Apache style). for maximum performance, your threads should only do CPU bound tasks and should not wait for any IO actions.
So, thread pool is bottleneck for application. To increase number
concurrent players on server, who will play without lags I need
increase number of threads in thread pool.
Dead wrong. read about the 10k concurrency issue.
Now question. If I increase number of threads in thread pool is It
increase or decrease server performance (Context switching overhead vs
locked Clients per thread)?
So, the anecdote about number of threads as the number of cores is only valid when your threads do only cpu bound tasks and they are never blocked and they ae 100% staurated with cpu tasks. if your threads are also being blocked by locks or IO actions, this fact breaks.
If we take a look at common Server side architectures, we can determine what the best design we need
Apache style architecture:
having a fixed-size thread pool. assigining a thread to each connection in the connection queue. non asynchronous IO.
pros: non.
cons: extremly bad throughput
NGNix/Node.js architecture:
having mono-threaded - multi-processed application. using asynchronous IO.
pros: simple architecture that eliminates multi-threaded problems. goes extremly well with servers that serve static data.
cons: if the processes have to shrare data, huge amount of CPU time is burned on serilizeing-passing-desirilizing data between processes. also, multithreaded application can increases performance if done correctly.
Modern .Net architecure:
having multi threaded-mono processed application. using asynchronous IO.
pros: if done correctly, the performance can blast!
cons: it's somewhat tricky to tune multi-threaded applicaiton and use it without corrupting shrared data.
So to sum it up, I think that in your specific case you should defenitly use asynchronous IO only + having a threadpool with the number of threads equal to the number of cores.
If you're using Linux, Facebook's Proxygen can manage everything we talked about (multithreaded code with asynchronous IO) beautifuly for you. hey, facebook are using it!

Many factors can affect the overall performance, including how much each thread has to do per client, how much cross thread communication is required, whether there is any resource contention between threads and so on. The best thing to do is to:
Decide on the performance parameters you want to measure and ensure you have them instrumented - you mentioned lag, so you need a mechanism of measuring worst case lag and/or the distribution of lag across all clients from the server side.
Build a stress scenario. This can be as simple as a tool that replays real client behaviour or random behaviour, but the more representative of real load the better.
Benchmark the server under stress and change the number of threads (or even more radically change the design) and see which design or configuration leads to to the minimal lag.
This has the added benefit that you can use the same stress test alongside a profiler to determine whether you can extract any more performance from your implementation.

The optimal number of threads is most often equals to either the number of cores in your machine or twice the number of cores. In order to procure the maximum throughput possible, there must the minimal points of contention between the threads. This number, ie the number of contention points floats between the value of number of cores and the twice the number of cores.
I would recommend to do trials and figure out the way in which you can milk optimum performance.

Starting with the idea to have one thread per core can be nice.
In addition, in some cases, calculating the WCET (Worst Case Execution Time) is a way to define which configuration is the faster (Cores don't have always the same frequency). You can measure it easily with timers (from the beginning of the function to the end, and substract the values to obtain the result in ms.)
In my case case, I also had to work on the consumption as it was an embedded system. Some tools allows measuring the CPU consumption and thus, decide which configuration is the most interesting in this specific case.

The optimal number of threads depends on how your clients are using the cpu.
If cpu is the only bottleneck and every core running a thread is constantly at top load, then setting the number of threads to the number of cores is a good idea.
If your clients are doing I/O (network; file; even page swapping) or any other operation which blocks your thread then it will be necessary to set a higher number of threads because some of them will be locked even if cpu is available.
In your scenario I would think it is the second case. The threads are locked because 24 client events are active but are only using 10% of cpu (so events processed by a thread are wasting 90% of his cpus resource). If this is the case it would be a good idea to raise the thread count to something like 240 (number of cores * 100 / average load) so another thread could run on the idling cpu.
But be warned: If clients are linked to a single thread (thread a handles clients 1, 2, 3 and thread B handles clients 4, 5, 6) increasing the threadpool will help, but there may still be sporadic lags if two client events should be processed by the same thread.

One asynchronous thread per task?

My application currently has a list of "tasks" (Each task consists of a function - That function can be as simple as printing something out but also way more complex) which gets looped through. (Additional note: Most tasks send a packet after having been executed) As some of these tasks could take quite some time, I thought about using a different, asynchronous thread for each task, thus letting run all the tasks concurrently.
Would that be a smart thing to do or not?One problem is that I can't possibly know the amount of tasks beforehand, so it could result in quite a few threads being created, and I read somewhere that each different hardware has it's limitations. I'm planing to run my application on a raspberry pi, and I think that I will always have to run between 5 and 20 tasks.
Also, some of the tasks have a lower "priority" of running that others.
Should I just run the important tasks first and then the less important ones? (Problem here is that if the sum of the time needed for all tasks together exceeds the time that some specific, important task should be run, my application won't be accurate anymore) Or implement everything in asynchronous threads? Or just try to make everything a little bit faster by only having the "packet-sending" in an asynchronous thread, thus not having to wait until the packets actually get sent?

There are number of questions you will need to ask yourself before you can reasonably design a solution.
Do you wish to limit the number of concurrent tasks?
Is it possible that in future the number of concurrent tasks will increase in a way that you cannot predict today?
... and probably a whole lot more.
If the answer to any of these is "yes" then you have a few options:
A producer/consumer queue with a fixed number of threads draining the queue (not recommended IMHO)
Write your tasks as asynchronous state machines around an event dispatcher such as boost::io_service (this is much more scalable).
If you know it's only going to be 20 concurrent tasks, you'll probably get away with std::async, but it's a sloppy way to write code.

boost threadpool with a thread that handles an IO queue

I recently began experimenting with the pseudo-boost threadpool (pseudo because it hasn't been officially accepted yet).
As a simple exercise, I initialized the threadpool with a maximum of two threads.
Each task does two things:
a CPU-intensive calculation
writes out the result to disk
Question
How do I modify the model into a threadpool that does:
a CPU-intensive calculation
and a single I/O thread which listens for completion from the threadpool - takes the resultant memory and simply:
writes out the result to disk
Should I simply have the task communicate to the I/O thread (spawned
as std::thread) through a std::condition_variable (essentially a mutexed queue of calculation results) or is there a way to
do it all within the threadpool library?
Or is the gcc 4.6.1 implementation of future and promise mature enough for me to pull this off?
Answer
It looks like a simple mutex queue with a condition variable works fine.
By grouping read access and writes, in addition to using the threadpool, I got the following improvements:
2 core machine: 1h14m down to 33m (46% reduction in runtime)
4 core vm: 40m down to 18m (55% reduction in runtime)
Thanks to Martin James for his thoughtful answer. Before this exercise, I thought that my next computational server should have dual-processors and a ton of memory. But now, with so much processing power inherent in the multiple cores and hyperthreading, I realize that money will probably better spent dealing with the I/O bottleneck.
As Martin mentioned, having multiple drives or RAID configurations would probably help. I will also look into adjusting I/O buffer settings at the kernel level.

If there is only one local disk, one writer thread on the end of a producer-consumer queue would be my favourite. Seeks, networked-disk delays and other hiccups will not leave any pooled threads that have finsihed their calculation stuck trying to write to the disk. Other disk operations, (eg. select another location/file/folder), are also easier/quicker if only one thread is accessing it - the queue will take up the slack and allow seamless calculation during the latency.
Writing directly from the calcualtion task or submitting the result-write as a separate task would work OK but you would need more threads in the pool to achieve pause-free operation.
Everything changes if there is more than one disk. More than one writer thread would then become a worthwhile proposition because of the increased overall performance. I would then probably go with an array/list of queues/write-threads, one for each disk.

More threads, better performance?

When I write a message driven app. much like a standard windows app only that it extensively uses messaging for internal operations, what would be the best approach regarding to threading?
As I see it, there are basically three approaches (if you have any other setup in mind, please share):
Having a single thread process all of the messages.
Having separate threads for separate message types (General, UI, Networking, etc...)
Having multiple threads that share and process a single message queue.
So, would there be any significant performance differences between the three?
Here are some general thoughts:
Obviously, the last two options benefit from a situation where there's more than one processor. Plus, if any thread is waiting for an external event, other threads can still process unrelated messages. But ignoring that, seems that multiple threads only add overhead (Thread switches, not to mention more complicated sync situations).
And another question: Would you recommend to implement such a system upon the standard Windows messaging system, or to implement a separate queue mechanism, and why?

The specific choice of threading model should be driven by the nature of the problem you are trying to solve. There isn't necessarily a single "correct" approach to designing the threading model for such an application. However, if we adopt the following assumptions:
messages arrive frequently
messages are independent and don't rely too heavily on shared resources
it is desirable to respond to an arriving message as quickly as possible
you want the app to scale well across processing architectures (i.e. multicode/multi-cpu systems)
scalability is the key design requirement (e.g. more message at a faster rate)
resilience to thread failure / long operations is desirable
In my experience, the most effective threading architecture would be to employ a thread pool. All messages arrive on a single queue, multiple threads wait on the queue and process messages as they arrive. A thread pool implementation can model all three thread-distribution examples you have.
#1 Single thread processes all messages => thread pool with only one thread
#2 Thread per N message types => thread pool with N threads, each thread peeks at the queue to find appropriate message types
#3 Multiple threads for all messages => thread pool with multiple threads
The benefits of this design is that you can scale the number of threads in the thread in proportion to the processing environment or the message load. The number of threads can even scale at runtime to adapt to the realtime message load being experienced.
There are many good thread pooling libraries available for most platforms, including .NET, C++/STL, Java, etc.
As to your second question, whether to use standard windows message dispatch mechanism. This mechanism comes with significant overhead and is really only intended for pumping messages through an windows application's UI loop. Unless this is the problem you are trying to solve, I would advise against using it as a general message dispatching solution. Furthermore, windows messages carry very little data - it is not an object-based model. Each windows message has a code, and a 32-bit parameter. This may not be enough to base a clean messaging model on. Finally, the windows message queue is not design to handle cases like queue saturation, thread starvation, or message re-queuing; these are cases that often arise in implementing a decent message queing solution.

We can't tell you much for sure without knowing the workload (ie, the statistical distribution of events over time) but in general
single queue with multiple servers is at least as fast, and usually faster, so 1,3 would be preferable to 2.
multiple threads in most languages add complexity because of the need to avoid contention and multiple-writer problems
long duration processes can block processing for other things that could get done quicker.
So horseback guess is that having a single event queue, with several server threads taking events off the queue, might be a little faster.
Make sure you use a thread-safe data structure for the queue.

It all depends.
For example:
Events in a GUI queue are best done by a single thread as there is an implied order in the events thus they need to be done serially. Which is why most GUI apps have a single thread to handle events, though potentially multiple events to create them (and it does not preclude the event thread from creating a job and handling it off to a worker pool (see below)).
Events on a socket can potentially by done in parallel (assuming HTTP) as each request is stateless and can thus by done independently (OK I know that is over simplifying HTTP).
Work Jobs were each job is independent and placed on queue. This is the classic case of using a set of worker threads. Each thread does a potentially long operation independently of the other threads. On completion comes back to the queue for another job.

In general, don't worry about the overhead of threads. It's not going to be an issue if you're talking about merely a handful of them. Race conditions, deadlocks, and contention are a bigger concern, and if you don't know what I'm talking about, you have a lot of reading to do before you tackle this.
I'd go with option 3, using whatever abstractions my language of choice offers.

Note that there are two different performance goals, and you haven't stated which you are targetting: throughput and responsiveness.
If you're writing a GUI app, the UI needs to be responsive. You don't care how many clicks per second you can process, but you do care about showing some response within a 10th of a second or so (ideally less). This is one of the reasons it's best to have a single thread devoted to handling the GUI (other reasons have been mentioned in other answers). The GUI thread needs to basically convert windows messages into work-items and let your worker queue handle the heavy work. Once the worker is done, it notifies the GUI thread, which then updates the display to reflect any changes. It does things like painting a window, but not rendering the data to be displayed. This gives the app a quick "snapiness" that is what most users want when they talk about performance. They don't care if it takes 15 seconds to do something hard, as long as when they click on a button or a menu, it reacts instantly.
The other performance characteristic is throughput. This is the number of jobs you can process in a specific amount of time. Usually this type of performance tuning is only needed on server type applications, or other heavy-duty processing. This measures how many webpages can be served up in an hour, or how long it takes to render a DVD. For these sort of jobs, you want to have 1 active thread per CPU. Fewer than that, and you're going to be wasting idle clock cycles. More than that, and the threads will be competing for CPU time and tripping over each other. Take a look at the second graph in this article DDJ articles for the trade-off you're dealing with. Note that the ideal thread count is higher than the number of available CPUs due to things like blocking and locking. The key is the number of active threads.

A good place to start is to ask yourself why you need multiple threads.
The well-thought-out answer to this question will lead you to the best answer to the subsequent question, "how should I use multiple threads in my application?"
And that must be a subsequent question; not a primary question. The fist question must be why, not how.

I think it depends on how long each thread will be running. Does each message take the same amount of time to process? Or will certain messages take a few seconds for example. If I knew that Message A was going to take 10 seconds to complete I would definitely use a new thread because why would I want to hold up the queue for a long running thread...
My 2 cents.

I think option 2 is the best. Having each thread doing independant tasks would give you best results. 3rd approach can cause more delays if multiple threads are doing some I/O operation like disk reads, reading common sockets and so on.
Whether to use Windows messaging framework for processing requests depends on the work load each thread would have. I think windows restricts the no. of messages that can be queued at the most to 10000. For most of the cases this should not be an issue. But if you have lots of messages to be queued this might be some thing to take into consideration.
Seperate queue gives a better control in a sense that you may reorder it the way you want (may be depending on priority)

Yes, there will be performance differences between your choices.
(1) introduces a bottle-neck for message processing
(3) introduces locking contention because you'll need to synchronize access to your shared queue.
(2) is starting to go in the right direction... though a queue for each message type is a little extreme. I'd probably recommend starting with a queue for each model in your app and adding queues where it makes since to do so for improved performance.
If you like option #2, it sounds like you would be interested in implementing a SEDA architecture. It is going to take some reading to understand what is going on, but I think the architecture fits well with your line of thinking.
BTW, Yield is a good C++/Python hybrid implementation.

I'd have a thread pool servicing the message queue, and make the number of threads in the pool easily configurable (perhaps even at runtime). Then test it out with expected load.
That way you can see what the actual correlation is - and if your initial assumptions change, you can easily change your approach.
A more sophisticated approach would be for the system to introspect its own performance traits and adapt it's use of resources, threads in particular, as it goes. Probably overkill for most custom application code, but I'm sure there are products that do that out there.
As for the windows events question - I think that's probably an application specific question that there is no right or wrong answer to in the general case. That said, I usually implement my own queue as I can tailor it to the specific characteristics of the task at hand. Sometimes that might involve routing events via the windows message queue.

How many threads to create and when?

I have a networking Linux application which receives RTP streams from multiple destinations, does very simple packet modification and then forwards the streams to the final destination.
How do I decide how many threads I should have to process the data? I suppose, I cannot open a thread for each RTP stream as there could be thousands. Should I take into account the number of CPU cores? What else matters?
Thanks.

It is important to understand the purpose of using multiple threads on a server; many threads in a server serve to decrease latency rather than to increase speed. You don't make the cpu more faster by having more threads but you make it more likely a thread will always appear at within a given period to handle a request.
Having a bunch of threads which just move data in parallel is a rather inefficient shot-gun (Creating one thread per request naturally just fails completely). Using the thread pool pattern can be a more effective, focused approach to decreasing latency.
Now, in the thread pool, you want to have at least as many threads as you have CPUs/cores. You can have more than this but the extra threads will again only decrease latency and not increase speed.
Think the problem of organizing server threads as akin to organizing a line in a super market. Would you like to have a lot of cashiers who work more slowly or one cashier who works super fast? The problem with the fast cashier isn't speed but rather that one customer with a lot of groceries might still take up a lot of their time. The need for many threads comes from the possibility that a few request that will take a lot of time and block all your threads. By this reasoning, whether you benefit from many slower cashiers depends on whether your have the same number of groceries or wildly different numbers. Getting back to the basic model, what this means is that you have to play with your thread number to figure what is optimal given the particular characteristics of your traffic, looking at the time taken to process each request.

Classically the number of reasonable threads is depending on the number of execution units, the ratio of IO to computation and the available memory.
Number of Execution Units (XU)
That counts how many threads can be active at the same time. Depending on your computations that might or might not count stuff like hyperthreads -- mixed instruction workloads work better.
Ratio of IO to Computation (%IO)
If the threads never wait for IO but always compute (%IO = 0), using more threads than XUs only increase the overhead of memory pressure and context switching. If the threads always wait for IO and never compute (%IO = 1) then using a variant of poll() or select() might be a good idea.
For all other situations XU / %IO gives an approximation of how many threads are needed to fully use the available XUs.
Available Memory (Mem)
This is more of a upper limit. Each thread uses a certain amount of system resources (MemUse). Mem / MemUse gives you an approximation of how many threads can be supported by the system.
Other Factors
The performance of the whole system can still be constrained by other factors even if you can guess or (better) measure the numbers above. For example, there might be another service running on the system, which uses some of the XUs and memory. Another problem is general available IO bandwidth (IOCap). If you need less computing resources per transferred byte than your XUs provide, obviously you'll need to care less about using them completely and more about increasing IO throughput.
For more about this latter problem, see this Google Talk about the Roofline Model.

I'd say, try using just ONE thread; it makes programming much easier. Although you'll need to use something like libevent to multiplex the connections, you won't have any unexpected synchronisation issues.
Once you've got a working single-threaded implementation, you can do performance testing and make a decision on whether a multi-threaded one is necessary.
Even if a multithreaded implementation is necessary, it may be easier to break it into several processes instead of threads (i.e. not sharing address space; either fork() or exec multiple copies of the process from a parent) if they don't have a lot of shared data.
You could also consider using something like Python's "Twisted" to make implementation easier (this is what it's designed for).
Really there's probably not a good case for using threads over processes - but maybe there is in your case, it's difficult to say. It depends how much data you need to share between threads.

I would look into a thread pool for this application.
http://threadpool.sourceforge.net/
Allow the thread pool to manage your threads and the queue.
You can tweak the maximum and minimum number of threads used based on performance profiling later.

Listen to the people advising you to use libevent (or OS specific utilities such as epoll/kqueue). In the case of many connections this is an absolute must because, like you said, creating threads will be an enormous perfomance hit, and select() also doesn't quite cut it.

Let your program decide. Add code to it that measures throughput and increases/decreases the number of threads dynamically to maximize it.
This way, your application will always perform well, regardless of the number of execution cores and other factors

It is a good idea to avoid trying to create one (or even N) threads per client request. This approach is classically non-scalable and you will definitely run into problems with memory usage or context switching. You should look at using a thread pool approach instead and look at the incoming requests as tasks for any thread in the pool to handle. The scalability of this approach is then limited by the ideal number of threads in the pool - usually this is related to the number of CPU cores. You want to try to have each thread use exactly 100% of the CPU on a single core - so in the ideal case you would have 1 thread per core, this will reduce context switching to zero. Depending on the nature of the tasks, this might not be possible, maybe the threads have to wait for external data, or read from disk or whatever so you may find that the number of threads is increased by some scaling factor.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js