Related
This question already has answers here:
How can multithreading speed up an application (when threads can't run concurrently)?
(9 answers)
Closed 9 years ago.
My professor causally mentioned that we should program multi-thread programs even if we are using a unicore processor however because of the lack of time , he did not elaborate on it .
I would like to know what are the benefits of a multi-thread program in a unicore processor ??
It won't be as significant as a multi-core system but it can still provide some benefits.
Mainly all the benefits that you are going to get will be regarding to the context switch that will happen after a input miss to the already executing thread. Executing thread may be waiting for anything such as a hardware resource or a branch mis-prediction or even data transfer after a cache miss.
At this point the waiting thread can be executed to benefit from this "waiting time". But of course context switch will take some time. Also managing threads inside the code rather than sequential computation can create some extra complexity to your program. And as it has been said, some applications needs to be multi-threaded so there is no escape from the context switch in some cases.
Some applications need to be multi-threaded. Multi-threading isn't just about improving performance by using more cores, it's also about performing multiple tasks at once.
Take Skype for example - The GUI needs to be able to accept the text you're entering, display it on the screen, listen for new messages coming from the user you're talking to, and display them. This wouldn't be a trivial task in a single threaded application.
Even if there's only one core available, the OS thread scheduler will give you the illusion of parallelism.
Usually it is about not blocking. Running many threads on a single core still gives the illusion of concurrency. So you can have, say, a thread doing IO while another one does user interactions. The user interaction thread is not blocked while the other does IO, so the user is free to carry on interacting.
Benefits could be different.
One of the widely used examples is the application with GUI, which supposed to perform some kind of computations. If you will have a single thread - the user will have to wait the result before dealing something else with the application, but if you start it in the separate thread - user interface could be still available for user during the computation process. So, multi-thread program could emulate multi-task environment even on a unicore system. That's one of the points.
As others have already mentioned, not blocking is one application. Another one is separation of logic for unrelated tasks that are to be executed simultaneously. Using threads for that leaves handling of scheduling these tasks to the OS.
However, note that it may also be possible to implement similar behavior using asynchronous operations in a single thread. "Future" and boost::asio provide ways of doing non-blocking stuff without necessarily resorting to multiple threads.
I think it depends a bit on how exactly you design your threads and which logic is actually in the thread. Some benefits you can even get on a single core:
A thread can wrap a blocking/long-during call you can't circumvent otherwise. For some operations there are polling mechanisms, but not for all.
A thread can wrap an almost standalone part of your application that has virtually no interaction with other code. For example background polling for updates, monitoring some resource (e.g. free storage), checking internet connectivity. If you keep them in a separate thread you can keep the code relatively simple in its own 'runtime' without caring too much about the impact on the main program, the sole communication with the main logic is usually a single 'event'.
In some environments you might get more processing time. This mainly depends on how your OS scheduling system works, but if this allocates time per thread, the more threads you have the more your app will be scheduled.
Some benefits long-term:
Where it's not hard to do you benefit if your hardware evolves. You never know what's going to happen, today your app runs on a single-core embedded device, tomorrow that embedded device gets a quad core. Programming threaded from the beginning improves your future scalability.
One example is an environment where you can deterministically assign work to a thread, e.g. based on some hash all related operations end up in the same thread. The advantage for single cores is 'small' but it's not hard to do as you need little synchronization primitives so the overhead stays small.
That said, I think there are situations where it's very ill advise:
As soon as your required synchronization mechanism with other threads becomes complex (e.g. multiple locks, lots of critical sections, ...). It might still be then that multi-threading gives you a benefit when effectively moving to multiple CPUs, but the overhead is huge both for your single core and your programming time.
For instance think about operations that block because of slow peripheral devices (harddisk access etc.). While these are waiting, even the single core can do other things asyncronously.
In a lot of applications the bottleneck is not CPU processing power. So when the program flow is waiting for completion of IO requests (user input, network/disk IO), critical resources to be available, or any sort of asynchroneously triggered events, the CPU can be scheduled to do other work instead of just blocking.
In this case you don't necessarily need multiple threads that can actually run in parallel. Cooperative multi-tasking concepts like asynchroneous IO, coroutines, or fibers come into mind.
If however the application's bottleneck is CPU processing power (constantly 100% CPU usage), then it makes sense to increase the number of CPUs available to the application. At that point it is easier to scale the application up to use more CPUs if it was designed to run in parallel upfront.
As far as I can see, one answer was not yet given:
You will have to write multithreaded applications in the future!
The average number of cores will double every 18 months in the future. People have learned single-threaded programming for 50 years now, and now they are confronted with devices that have multiple cores. The programming style in a multi-threaded environment differs significantly from single-threaded programming. This refers to low-level aspects like avoiding race conditions and proper synchronization, as well as the high-level aspects like the general algorithm design.
So in addition to the points already mentioned, it's also about writing future-proof software, scalability and the development of the skills that are required to achieve these goals.
Is the following correct?
The disruptor pattern has better parallel performance and scalability if each entry has to be processed in multiple ways (io operations or annotations), since that can be parallelized using multiple consumers without contention.
Contrarily, work stealing (i.e. storing entries locally and stealing entries from other threads) has better parallel performance and scalability if each entry has to be processed in a single way only, since disjointly distributing the entries onto multiple threads in the disruptor pattern causes contention.
(And is the disruptor pattern still so much faster than other lockless multi-producer multi-consumer queues (e.g. from boost) when multiple producers (i.e. CAS operations) are involved?)
My situation in detail:
Processing an entry can produce several new entries, which must be processed eventually, too. Performance has highest priority, entries being processed in FIFO order has second priority.
In the current implementation, each thread uses a local FIFO, where it adds its new entries. Idle threads steal work from other thread's local FIFO. Dependencies between the thread's processing are resolved using a lockless, mechanically sympathetic hash table (CASs on write, with bucket granularity). This results in pretty low contention but FIFO order is sometimes broken.
Using the disruptor pattern would guarantee FIFO order. But wouldn't distributing the entries onto the threads cause much higher contention (e.g. CAS on a read cursor) than for local FIFOs with work stealing (each thread's throughput is about the same)?
References I've found
The performance tests in the standard technical paper on the disruptor (Chapter 5 + 6) do not cover disjoint work distribution.
https://groups.google.com/forum/?fromgroups=#!topic/lmax-disruptor/tt3wQthBYd0 is the only reference I've found on disruptor + work stealing. It states that a queue per thread is dramatically slower if there is any shared state, but does not go into detail or explain why. I doubt that this sentence applies to my situation with:
shared state being resolved with a lockless hash table;
having to disjointly distribute entries amongst consumers;
except for work stealing, each thread reads and writes only in its local queue.
Update - Bottom line up front for max performance: You need to write both in the idiomatic syntax for disruptor and work stealing, and then benchmark.
To me, I think the distinction is primarily in the split between message vs task focus, and therefore in the way you want to think of the problem. Try to solve your problem, and if it is task-focused then Disruptor is a good fit. If the problem is message focused, then you might be more suited to another technique such as work stealing.
Use work stealing when your implementation is message focused. Each thread can pick up a message and run it through to completion. For an example HTTP server - Each inbound http request is allocated a thread. That thread is focused on handling the request start to finish - logging the request, checking security controls, doing vhost lookup, fetching file, sending response, and closing connection
Use disruptor when your implementation is task focused. Each thread can work on a particular stage of the processing. Alternative example: for a task focus, the processing would be split into stages, so you would have a thread that does logging, a thread for security controls, a thread for vhost lookup, etc; each thread focused on its task and passes the request to the next thread in the pipeline. Stages may be parallelised but the overall structure is a thread focused on a specific task and hands the message along between threads.
Of course, you can change your implementation to suit each approach better.
In your case, I would structure the problem differently if you wanted to use Disruptor. Generally you would eliminate shared state by having a single thread own the state and pass all tasks through that thread of work - look up SEDA for lots of diagrams like this. This can have lots of benefits, but again, is really down to your implementation.
Some more verbosity:
Disruptor - very useful when strict ordering of stages is required, additional benefits when all tasks are of a consistent length eg: no blocking on external system, and very similar amount of processing per task. In this scenario, you can assume that all threads will work evenly through the system, and therefore arrange N threads to process every N messages. I like to think of Disruptor as an efficient way to implement SEDA-like systems where threads process stages. You can of course have an application with a single stage and multiple parallel units performing same work at each stage however this is not really the point in my view. This would totally avoid the cost of shared state.
Work stealing - use this when tasks are of a varied duration and the order of message processing is not important, as this allows threads that are free and have already consumed their messages to continue progress from another task queue. This way if for example you have 10 threads and 1 is blocked on IO, the remainder will still complete their processing.
I am implementing custom server that needs to maintain very large number (100K or more) of long lived connections. Server simply passes messages between sockets and it doesn't do any serious data processing. Messages are small, but many of them are received/send every second. Reducing latency is one of the goals. I realize that using multiple cores won't improve performance and therefore I decided to run the server in a single thread by calling run_one or poll methods of io_service object. Anyway multi-threaded server would be much harder to implement.
What are the possible bottlenecks? Syscalls, bandwidth, completion queue / event demultiplexing? I suspect that dispatching handlers may require locking (that is done internally by asio library). Is it possible to disable even queue locking (or any other locking) in boost.asio?
EDIT: related question. Does syscall performance improve with multiple threads? My feeling is that because syscalls are atomic/synchronized by the kernel adding more threads won't improve speed.
You might want to read my question from a few years ago, I asked it when first investigating the scalability of Boost.Asio while developing the system software for the Blue Gene/Q supercomputer.
Scaling to 100k or more connections should not be a problem, though you will need to be aware of the obvious resource limitations such as the maximum number of open file descriptors. If you haven't read the seminal C10K paper, I suggest reading it.
After you have implemented your application using a single thread and a single io_service, I suggest investigating a pool of threads invoking io_service::run(), and only then investigate pinning an io_service to a specific thread and/or cpu. There are multiple examples included in the Asio documentation for all three of these designs, and several questions on SO with more information. Be aware that as you introduce multiple threads invoking io_service::run() you may need to implement strands to ensure the handlers have exclusive access to shared data structures.
Using boost::asio you can write single-thread or multi-thread server approximately at same development cost. You can write single-threaded version as first version, then convert it to multithreaded, if needed.
Typically, only bottleneck for boost::asio is that epoll/kqueue reactor is working in a mutex. So, only one thread is doing epoll at same time. This can decrease performance in case when you have multithreaded server, which serves lots and lots very small packets. But, imo it anyway should be faster than just plain-singlethread server.
Now about your task. If you want to just pass messages between connections - i think it must be multithreaded server. The problem is syscalls(recv/send etc). An instruction is very easy think to do for CPU, but any syscall is not very "light" operation (everything is relative, but relative to other jobs in your task). So, with single thread you will get big syscalls overhead, its why i recommend to use multithreaded scheme.
Also, you can separate io_service and make it work as "io_service per thread" idiom. I think this must give best performance, but it has drawback: if one of io_service will get too big queue - other threads will not help it, so some connections may slowdown. On other side, with single io_service - queue overrun can lead to big locking overhead. All you can do - do the both variants and measure bandwidth/latency. It should be not too difficult to implement both variants.
I want to write a simple multiplayer game as part of my C++ learning project.
So I thought, since I am at it, I would like to do it properly, as opposed to just getting-it-done.
If I understood correctly: Apache uses a Thread-per-connection architecture, while nginx uses an event-loop and then dedicates a worker [x] for the incoming connection. I guess nginx is wiser, since it supports a higher concurrency level. Right?
I have also come across this clever analogy, but I am not sure if it could be applied to my situation. The analogy also seems to be very idealist. I have rarely seen my computer run at 100% CPU (even with a umptillion Chrome tabs open, Photoshop and what-not running simultaneously)
Also, I have come across a SO post (somehow it vanished from my history) where a user asked how many threads they should use, and one of the answers was that it's perfectly acceptable to have around 700, even up to 10,000 threads. This question was related to JVM, though.
So, let's estimate a fictional user-base of around 5,000 users. Which approach should would be the "most concurrent" one?
A reactor pattern running everything in a single thread.
A reactor pattern with a thread-pool (approximately, how big do you suggest the thread pool should be?
Creating a thread per connection and then destroying the thread the connection closes.
I admit option 2 sounds like the best solution to me, but I am very green in all of this, so I might be a bit naive and missing some obvious flaw. Also, it sounds like it could be fairly difficult to implement.
PS: I am considering using POCO C++ Libraries. Suggesting any alternative libraries (like boost) is fine with me. However, many say POCO's library is very clean and easy to understand. So, I would preferably use that one, so I can learn about the hows of what I'm using.
Reactive Applications certainly scale better, when they are written correctly. This means
Never blocking in a reactive thread:
Any blocking will seriously degrade the performance of you server, you typically use a small number of reactive threads, so blocking can also quickly cause deadlock.
No mutexs since these can block, so no shared mutable state. If you require shared state you will have to wrap it with an actor or similar so only one thread has access to the state.
All work in the reactive threads should be cpu bound
All IO has to be asynchronous or be performed in a different thread pool and the results feed back into the reactor.
This means using either futures or callbacks to process replies, this style of code can quickly become unmaintainable if you are not used to it and disciplined.
All work in the reactive threads should be small
To maintain responsiveness of the server all tasks in the reactor must be small (bounded by time)
On an 8 core machine you cannot cannot allow 8 long tasks arrive at the same time because no other work will start until they are complete
If a tasks could take a long time it must be broken up (cooperative multitasking)
Tasks in reactive applications are scheduled by the application not the operating system, that is why they can be faster and use less memory. When you write a Reactive application you are saying that you know the problem domain so well that you can organise and schedule this type of work better than the operating system can schedule threads doing the same work in a blocking fashion.
I am a big fan of reactive architectures but they come with costs. I am not sure I would write my first c++ application as reactive, I normally try to learn one thing at a time.
If you decide to use a reactive architecture use a good framework that will help you design and structure your code or you will end up with spaghetti. Things to look for are:
What is the unit of work?
How easy is it to add new work? can it only come in from an external event (eg network request)
How easy is it to break work up into smaller chunks?
How easy is it to process the results of this work?
How easy is it to move blocking code to another thread pool and still process the results?
I cannot recommend a C++ library for this, I now do my server development in Scala and Akka which provide all of this with an excellent composable futures library to keep the code clean.
Best of luck learning C++ and with which ever choice you make.
Option 2 will most efficiently occupy your hardware. Here is the classic article, ten years old but still good.
http://www.kegel.com/c10k.html
The best library combination these days for structuring an application with concurrency and asynchronous waiting is Boost Thread plus Boost ASIO. You could also try a C++11 std thread library, and std mutex (but Boost ASIO is better than mutexes in a lot of cases, just always callback to the same thread and you don't need protected regions). Stay away from std future, cause it's broken:
http://bartoszmilewski.com/2009/03/03/broken-promises-c0x-futures/
The optimal number of threads in the thread pool is one thread per CPU core. 8 cores -> 8 threads. Plus maybe a few extra, if you think it's possible that your threadpool threads might call blocking operations sometimes.
FWIW, Poco supports option 2 (ParallelReactor) since version 1.5.1
I think that option 2 is the best one. As for tuning of the pool size, I think the pool should be adaptive. It should be able to spawn more threads (with some high hard limit) and remove excessive threads in times of low activity.
as the analogy you linked to (and it's comments) suggest. this is somewhat application dependent. now what you are building here is a game server. let's analyze that.
game servers (generally) do a lot of I/O and relatively few calculations, so they are far from 100% CPU applications.
on the other hand they also usually change values in some database (a "game world" model). all players create reads and writes to this database. which is exactly the intersection problem in the analogy.
so while you may gain some from handling the I/O in separate threads, you will also lose from having separate threads accessing the same database and waiting for its locks.
so either option 1 or 2 are acceptable in your situation. for scalability reasons I would not recommend option 3.
When I write a message driven app. much like a standard windows app only that it extensively uses messaging for internal operations, what would be the best approach regarding to threading?
As I see it, there are basically three approaches (if you have any other setup in mind, please share):
Having a single thread process all of the messages.
Having separate threads for separate message types (General, UI, Networking, etc...)
Having multiple threads that share and process a single message queue.
So, would there be any significant performance differences between the three?
Here are some general thoughts:
Obviously, the last two options benefit from a situation where there's more than one processor. Plus, if any thread is waiting for an external event, other threads can still process unrelated messages. But ignoring that, seems that multiple threads only add overhead (Thread switches, not to mention more complicated sync situations).
And another question: Would you recommend to implement such a system upon the standard Windows messaging system, or to implement a separate queue mechanism, and why?
The specific choice of threading model should be driven by the nature of the problem you are trying to solve. There isn't necessarily a single "correct" approach to designing the threading model for such an application. However, if we adopt the following assumptions:
messages arrive frequently
messages are independent and don't rely too heavily on shared resources
it is desirable to respond to an arriving message as quickly as possible
you want the app to scale well across processing architectures (i.e. multicode/multi-cpu systems)
scalability is the key design requirement (e.g. more message at a faster rate)
resilience to thread failure / long operations is desirable
In my experience, the most effective threading architecture would be to employ a thread pool. All messages arrive on a single queue, multiple threads wait on the queue and process messages as they arrive. A thread pool implementation can model all three thread-distribution examples you have.
#1 Single thread processes all messages => thread pool with only one thread
#2 Thread per N message types => thread pool with N threads, each thread peeks at the queue to find appropriate message types
#3 Multiple threads for all messages => thread pool with multiple threads
The benefits of this design is that you can scale the number of threads in the thread in proportion to the processing environment or the message load. The number of threads can even scale at runtime to adapt to the realtime message load being experienced.
There are many good thread pooling libraries available for most platforms, including .NET, C++/STL, Java, etc.
As to your second question, whether to use standard windows message dispatch mechanism. This mechanism comes with significant overhead and is really only intended for pumping messages through an windows application's UI loop. Unless this is the problem you are trying to solve, I would advise against using it as a general message dispatching solution. Furthermore, windows messages carry very little data - it is not an object-based model. Each windows message has a code, and a 32-bit parameter. This may not be enough to base a clean messaging model on. Finally, the windows message queue is not design to handle cases like queue saturation, thread starvation, or message re-queuing; these are cases that often arise in implementing a decent message queing solution.
We can't tell you much for sure without knowing the workload (ie, the statistical distribution of events over time) but in general
single queue with multiple servers is at least as fast, and usually faster, so 1,3 would be preferable to 2.
multiple threads in most languages add complexity because of the need to avoid contention and multiple-writer problems
long duration processes can block processing for other things that could get done quicker.
So horseback guess is that having a single event queue, with several server threads taking events off the queue, might be a little faster.
Make sure you use a thread-safe data structure for the queue.
It all depends.
For example:
Events in a GUI queue are best done by a single thread as there is an implied order in the events thus they need to be done serially. Which is why most GUI apps have a single thread to handle events, though potentially multiple events to create them (and it does not preclude the event thread from creating a job and handling it off to a worker pool (see below)).
Events on a socket can potentially by done in parallel (assuming HTTP) as each request is stateless and can thus by done independently (OK I know that is over simplifying HTTP).
Work Jobs were each job is independent and placed on queue. This is the classic case of using a set of worker threads. Each thread does a potentially long operation independently of the other threads. On completion comes back to the queue for another job.
In general, don't worry about the overhead of threads. It's not going to be an issue if you're talking about merely a handful of them. Race conditions, deadlocks, and contention are a bigger concern, and if you don't know what I'm talking about, you have a lot of reading to do before you tackle this.
I'd go with option 3, using whatever abstractions my language of choice offers.
Note that there are two different performance goals, and you haven't stated which you are targetting: throughput and responsiveness.
If you're writing a GUI app, the UI needs to be responsive. You don't care how many clicks per second you can process, but you do care about showing some response within a 10th of a second or so (ideally less). This is one of the reasons it's best to have a single thread devoted to handling the GUI (other reasons have been mentioned in other answers). The GUI thread needs to basically convert windows messages into work-items and let your worker queue handle the heavy work. Once the worker is done, it notifies the GUI thread, which then updates the display to reflect any changes. It does things like painting a window, but not rendering the data to be displayed. This gives the app a quick "snapiness" that is what most users want when they talk about performance. They don't care if it takes 15 seconds to do something hard, as long as when they click on a button or a menu, it reacts instantly.
The other performance characteristic is throughput. This is the number of jobs you can process in a specific amount of time. Usually this type of performance tuning is only needed on server type applications, or other heavy-duty processing. This measures how many webpages can be served up in an hour, or how long it takes to render a DVD. For these sort of jobs, you want to have 1 active thread per CPU. Fewer than that, and you're going to be wasting idle clock cycles. More than that, and the threads will be competing for CPU time and tripping over each other. Take a look at the second graph in this article DDJ articles for the trade-off you're dealing with. Note that the ideal thread count is higher than the number of available CPUs due to things like blocking and locking. The key is the number of active threads.
A good place to start is to ask yourself why you need multiple threads.
The well-thought-out answer to this question will lead you to the best answer to the subsequent question, "how should I use multiple threads in my application?"
And that must be a subsequent question; not a primary question. The fist question must be why, not how.
I think it depends on how long each thread will be running. Does each message take the same amount of time to process? Or will certain messages take a few seconds for example. If I knew that Message A was going to take 10 seconds to complete I would definitely use a new thread because why would I want to hold up the queue for a long running thread...
My 2 cents.
I think option 2 is the best. Having each thread doing independant tasks would give you best results. 3rd approach can cause more delays if multiple threads are doing some I/O operation like disk reads, reading common sockets and so on.
Whether to use Windows messaging framework for processing requests depends on the work load each thread would have. I think windows restricts the no. of messages that can be queued at the most to 10000. For most of the cases this should not be an issue. But if you have lots of messages to be queued this might be some thing to take into consideration.
Seperate queue gives a better control in a sense that you may reorder it the way you want (may be depending on priority)
Yes, there will be performance differences between your choices.
(1) introduces a bottle-neck for message processing
(3) introduces locking contention because you'll need to synchronize access to your shared queue.
(2) is starting to go in the right direction... though a queue for each message type is a little extreme. I'd probably recommend starting with a queue for each model in your app and adding queues where it makes since to do so for improved performance.
If you like option #2, it sounds like you would be interested in implementing a SEDA architecture. It is going to take some reading to understand what is going on, but I think the architecture fits well with your line of thinking.
BTW, Yield is a good C++/Python hybrid implementation.
I'd have a thread pool servicing the message queue, and make the number of threads in the pool easily configurable (perhaps even at runtime). Then test it out with expected load.
That way you can see what the actual correlation is - and if your initial assumptions change, you can easily change your approach.
A more sophisticated approach would be for the system to introspect its own performance traits and adapt it's use of resources, threads in particular, as it goes. Probably overkill for most custom application code, but I'm sure there are products that do that out there.
As for the windows events question - I think that's probably an application specific question that there is no right or wrong answer to in the general case. That said, I usually implement my own queue as I can tailor it to the specific characteristics of the task at hand. Sometimes that might involve routing events via the windows message queue.