More threads, better performance?

More threads, better performance? - c++

When I write a message driven app. much like a standard windows app only that it extensively uses messaging for internal operations, what would be the best approach regarding to threading?
As I see it, there are basically three approaches (if you have any other setup in mind, please share):
Having a single thread process all of the messages.
Having separate threads for separate message types (General, UI, Networking, etc...)
Having multiple threads that share and process a single message queue.
So, would there be any significant performance differences between the three?
Here are some general thoughts:
Obviously, the last two options benefit from a situation where there's more than one processor. Plus, if any thread is waiting for an external event, other threads can still process unrelated messages. But ignoring that, seems that multiple threads only add overhead (Thread switches, not to mention more complicated sync situations).
And another question: Would you recommend to implement such a system upon the standard Windows messaging system, or to implement a separate queue mechanism, and why?

The specific choice of threading model should be driven by the nature of the problem you are trying to solve. There isn't necessarily a single "correct" approach to designing the threading model for such an application. However, if we adopt the following assumptions:
messages arrive frequently
messages are independent and don't rely too heavily on shared resources
it is desirable to respond to an arriving message as quickly as possible
you want the app to scale well across processing architectures (i.e. multicode/multi-cpu systems)
scalability is the key design requirement (e.g. more message at a faster rate)
resilience to thread failure / long operations is desirable
In my experience, the most effective threading architecture would be to employ a thread pool. All messages arrive on a single queue, multiple threads wait on the queue and process messages as they arrive. A thread pool implementation can model all three thread-distribution examples you have.
#1 Single thread processes all messages => thread pool with only one thread
#2 Thread per N message types => thread pool with N threads, each thread peeks at the queue to find appropriate message types
#3 Multiple threads for all messages => thread pool with multiple threads
The benefits of this design is that you can scale the number of threads in the thread in proportion to the processing environment or the message load. The number of threads can even scale at runtime to adapt to the realtime message load being experienced.
There are many good thread pooling libraries available for most platforms, including .NET, C++/STL, Java, etc.
As to your second question, whether to use standard windows message dispatch mechanism. This mechanism comes with significant overhead and is really only intended for pumping messages through an windows application's UI loop. Unless this is the problem you are trying to solve, I would advise against using it as a general message dispatching solution. Furthermore, windows messages carry very little data - it is not an object-based model. Each windows message has a code, and a 32-bit parameter. This may not be enough to base a clean messaging model on. Finally, the windows message queue is not design to handle cases like queue saturation, thread starvation, or message re-queuing; these are cases that often arise in implementing a decent message queing solution.

We can't tell you much for sure without knowing the workload (ie, the statistical distribution of events over time) but in general
single queue with multiple servers is at least as fast, and usually faster, so 1,3 would be preferable to 2.
multiple threads in most languages add complexity because of the need to avoid contention and multiple-writer problems
long duration processes can block processing for other things that could get done quicker.
So horseback guess is that having a single event queue, with several server threads taking events off the queue, might be a little faster.
Make sure you use a thread-safe data structure for the queue.

It all depends.
For example:
Events in a GUI queue are best done by a single thread as there is an implied order in the events thus they need to be done serially. Which is why most GUI apps have a single thread to handle events, though potentially multiple events to create them (and it does not preclude the event thread from creating a job and handling it off to a worker pool (see below)).
Events on a socket can potentially by done in parallel (assuming HTTP) as each request is stateless and can thus by done independently (OK I know that is over simplifying HTTP).
Work Jobs were each job is independent and placed on queue. This is the classic case of using a set of worker threads. Each thread does a potentially long operation independently of the other threads. On completion comes back to the queue for another job.

In general, don't worry about the overhead of threads. It's not going to be an issue if you're talking about merely a handful of them. Race conditions, deadlocks, and contention are a bigger concern, and if you don't know what I'm talking about, you have a lot of reading to do before you tackle this.
I'd go with option 3, using whatever abstractions my language of choice offers.

Note that there are two different performance goals, and you haven't stated which you are targetting: throughput and responsiveness.
If you're writing a GUI app, the UI needs to be responsive. You don't care how many clicks per second you can process, but you do care about showing some response within a 10th of a second or so (ideally less). This is one of the reasons it's best to have a single thread devoted to handling the GUI (other reasons have been mentioned in other answers). The GUI thread needs to basically convert windows messages into work-items and let your worker queue handle the heavy work. Once the worker is done, it notifies the GUI thread, which then updates the display to reflect any changes. It does things like painting a window, but not rendering the data to be displayed. This gives the app a quick "snapiness" that is what most users want when they talk about performance. They don't care if it takes 15 seconds to do something hard, as long as when they click on a button or a menu, it reacts instantly.
The other performance characteristic is throughput. This is the number of jobs you can process in a specific amount of time. Usually this type of performance tuning is only needed on server type applications, or other heavy-duty processing. This measures how many webpages can be served up in an hour, or how long it takes to render a DVD. For these sort of jobs, you want to have 1 active thread per CPU. Fewer than that, and you're going to be wasting idle clock cycles. More than that, and the threads will be competing for CPU time and tripping over each other. Take a look at the second graph in this article DDJ articles for the trade-off you're dealing with. Note that the ideal thread count is higher than the number of available CPUs due to things like blocking and locking. The key is the number of active threads.

A good place to start is to ask yourself why you need multiple threads.
The well-thought-out answer to this question will lead you to the best answer to the subsequent question, "how should I use multiple threads in my application?"
And that must be a subsequent question; not a primary question. The fist question must be why, not how.

I think it depends on how long each thread will be running. Does each message take the same amount of time to process? Or will certain messages take a few seconds for example. If I knew that Message A was going to take 10 seconds to complete I would definitely use a new thread because why would I want to hold up the queue for a long running thread...
My 2 cents.

I think option 2 is the best. Having each thread doing independant tasks would give you best results. 3rd approach can cause more delays if multiple threads are doing some I/O operation like disk reads, reading common sockets and so on.
Whether to use Windows messaging framework for processing requests depends on the work load each thread would have. I think windows restricts the no. of messages that can be queued at the most to 10000. For most of the cases this should not be an issue. But if you have lots of messages to be queued this might be some thing to take into consideration.
Seperate queue gives a better control in a sense that you may reorder it the way you want (may be depending on priority)

Yes, there will be performance differences between your choices.
(1) introduces a bottle-neck for message processing
(3) introduces locking contention because you'll need to synchronize access to your shared queue.
(2) is starting to go in the right direction... though a queue for each message type is a little extreme. I'd probably recommend starting with a queue for each model in your app and adding queues where it makes since to do so for improved performance.
If you like option #2, it sounds like you would be interested in implementing a SEDA architecture. It is going to take some reading to understand what is going on, but I think the architecture fits well with your line of thinking.
BTW, Yield is a good C++/Python hybrid implementation.

I'd have a thread pool servicing the message queue, and make the number of threads in the pool easily configurable (perhaps even at runtime). Then test it out with expected load.
That way you can see what the actual correlation is - and if your initial assumptions change, you can easily change your approach.
A more sophisticated approach would be for the system to introspect its own performance traits and adapt it's use of resources, threads in particular, as it goes. Probably overkill for most custom application code, but I'm sure there are products that do that out there.
As for the windows events question - I think that's probably an application specific question that there is no right or wrong answer to in the general case. That said, I usually implement my own queue as I can tailor it to the specific characteristics of the task at hand. Sometimes that might involve routing events via the windows message queue.

Related

Optimal producer/consumer thread pattern in Qt

I have implemented a producer/consumer pattern using Qt threads. Multiple producer threads generate data that are combined by a consumer. Communication is implemented using signals/slots and queued connections. This works fine as long as the consumer is able to consume the data faster than the producer threads produce the data.
It is hard to make my code scale. Particularly it is easy to increase the number of producers but it is very hard to spawn more than one consumer thread.
Now the problem starts when running the software on a CPU/system that has a lot of cores. In that case I use more threads to produce data. It can sometimes happen (depending from the complexity of data generation) that the consumer is not able to handle the produced data in time. Then the Qt event queue fills rapidly with events and the memory consumption grows extremely.
I can solve this by using blocking queued connections. However this does not allow full CPU load since producers tend to wait unnecessarily for the consumer after each data emission.
In a non-Qt software I would use a queue/mailbox/ring-buffer with a fixed size that makes the producers sleep until the consumer frees space in that container. This mechanism limits memory consumption and allows best possible CPU load.
However I could not find an equivalent solution using Qt classes. The event queue is global and has no size property. Is there a Qt way to solve this optimally? If not, are there STL classes I can use to couple (Q)Threads in my way?

I think you should move away from using Qt in this case because while the event handling is quite fast, it has clearly not been designed for HPC workloads targeting many-cores (because of the centralized sequential event queue). So I think you should use a fast atomic muti-producer/multi-consumer (MPMC) queue. While you could probably write a Qt event layer on top of that, I am not sure this is a good idea performance-wise. An alternative solution is to use variable-sized chunks to reduce the amount of events (with a feedback loop between the producers and the consumers). Note that regarding your workload, it might be good to consider to use a task-based runtime (known to scale well).
If you are searching for a fast MPMC queue, there is the one of provided by Boost (boost::lockfree::queue) which is not very fast, but this is often enough. One of the best I am aware of is this one. It is based on a research paper and used in big games. This one is slightly faster on my machine on specific cases and more flexible but you should be very careful when using it as consistency is not always ensured (ie. read the docs). Note that the threading library should not matter in the choice of the queue.

Benefits of a multi thread program in a unicore system [duplicate]

This question already has answers here:
How can multithreading speed up an application (when threads can't run concurrently)?
(9 answers)
Closed 9 years ago.
My professor causally mentioned that we should program multi-thread programs even if we are using a unicore processor however because of the lack of time , he did not elaborate on it .
I would like to know what are the benefits of a multi-thread program in a unicore processor ??

It won't be as significant as a multi-core system but it can still provide some benefits.
Mainly all the benefits that you are going to get will be regarding to the context switch that will happen after a input miss to the already executing thread. Executing thread may be waiting for anything such as a hardware resource or a branch mis-prediction or even data transfer after a cache miss.
At this point the waiting thread can be executed to benefit from this "waiting time". But of course context switch will take some time. Also managing threads inside the code rather than sequential computation can create some extra complexity to your program. And as it has been said, some applications needs to be multi-threaded so there is no escape from the context switch in some cases.

Some applications need to be multi-threaded. Multi-threading isn't just about improving performance by using more cores, it's also about performing multiple tasks at once.
Take Skype for example - The GUI needs to be able to accept the text you're entering, display it on the screen, listen for new messages coming from the user you're talking to, and display them. This wouldn't be a trivial task in a single threaded application.
Even if there's only one core available, the OS thread scheduler will give you the illusion of parallelism.

Usually it is about not blocking. Running many threads on a single core still gives the illusion of concurrency. So you can have, say, a thread doing IO while another one does user interactions. The user interaction thread is not blocked while the other does IO, so the user is free to carry on interacting.

Benefits could be different.
One of the widely used examples is the application with GUI, which supposed to perform some kind of computations. If you will have a single thread - the user will have to wait the result before dealing something else with the application, but if you start it in the separate thread - user interface could be still available for user during the computation process. So, multi-thread program could emulate multi-task environment even on a unicore system. That's one of the points.

As others have already mentioned, not blocking is one application. Another one is separation of logic for unrelated tasks that are to be executed simultaneously. Using threads for that leaves handling of scheduling these tasks to the OS.
However, note that it may also be possible to implement similar behavior using asynchronous operations in a single thread. "Future" and boost::asio provide ways of doing non-blocking stuff without necessarily resorting to multiple threads.

I think it depends a bit on how exactly you design your threads and which logic is actually in the thread. Some benefits you can even get on a single core:
A thread can wrap a blocking/long-during call you can't circumvent otherwise. For some operations there are polling mechanisms, but not for all.
A thread can wrap an almost standalone part of your application that has virtually no interaction with other code. For example background polling for updates, monitoring some resource (e.g. free storage), checking internet connectivity. If you keep them in a separate thread you can keep the code relatively simple in its own 'runtime' without caring too much about the impact on the main program, the sole communication with the main logic is usually a single 'event'.
In some environments you might get more processing time. This mainly depends on how your OS scheduling system works, but if this allocates time per thread, the more threads you have the more your app will be scheduled.
Some benefits long-term:
Where it's not hard to do you benefit if your hardware evolves. You never know what's going to happen, today your app runs on a single-core embedded device, tomorrow that embedded device gets a quad core. Programming threaded from the beginning improves your future scalability.
One example is an environment where you can deterministically assign work to a thread, e.g. based on some hash all related operations end up in the same thread. The advantage for single cores is 'small' but it's not hard to do as you need little synchronization primitives so the overhead stays small.
That said, I think there are situations where it's very ill advise:
As soon as your required synchronization mechanism with other threads becomes complex (e.g. multiple locks, lots of critical sections, ...). It might still be then that multi-threading gives you a benefit when effectively moving to multiple CPUs, but the overhead is huge both for your single core and your programming time.

For instance think about operations that block because of slow peripheral devices (harddisk access etc.). While these are waiting, even the single core can do other things asyncronously.

In a lot of applications the bottleneck is not CPU processing power. So when the program flow is waiting for completion of IO requests (user input, network/disk IO), critical resources to be available, or any sort of asynchroneously triggered events, the CPU can be scheduled to do other work instead of just blocking.
In this case you don't necessarily need multiple threads that can actually run in parallel. Cooperative multi-tasking concepts like asynchroneous IO, coroutines, or fibers come into mind.
If however the application's bottleneck is CPU processing power (constantly 100% CPU usage), then it makes sense to increase the number of CPUs available to the application. At that point it is easier to scale the application up to use more CPUs if it was designed to run in parallel upfront.

As far as I can see, one answer was not yet given:
You will have to write multithreaded applications in the future!
The average number of cores will double every 18 months in the future. People have learned single-threaded programming for 50 years now, and now they are confronted with devices that have multiple cores. The programming style in a multi-threaded environment differs significantly from single-threaded programming. This refers to low-level aspects like avoiding race conditions and proper synchronization, as well as the high-level aspects like the general algorithm design.
So in addition to the points already mentioned, it's also about writing future-proof software, scalability and the development of the skills that are required to achieve these goals.

When to use the disruptor pattern and when local storage with work stealing?

Is the following correct?
The disruptor pattern has better parallel performance and scalability if each entry has to be processed in multiple ways (io operations or annotations), since that can be parallelized using multiple consumers without contention.
Contrarily, work stealing (i.e. storing entries locally and stealing entries from other threads) has better parallel performance and scalability if each entry has to be processed in a single way only, since disjointly distributing the entries onto multiple threads in the disruptor pattern causes contention.
(And is the disruptor pattern still so much faster than other lockless multi-producer multi-consumer queues (e.g. from boost) when multiple producers (i.e. CAS operations) are involved?)
My situation in detail:
Processing an entry can produce several new entries, which must be processed eventually, too. Performance has highest priority, entries being processed in FIFO order has second priority.
In the current implementation, each thread uses a local FIFO, where it adds its new entries. Idle threads steal work from other thread's local FIFO. Dependencies between the thread's processing are resolved using a lockless, mechanically sympathetic hash table (CASs on write, with bucket granularity). This results in pretty low contention but FIFO order is sometimes broken.
Using the disruptor pattern would guarantee FIFO order. But wouldn't distributing the entries onto the threads cause much higher contention (e.g. CAS on a read cursor) than for local FIFOs with work stealing (each thread's throughput is about the same)?
References I've found
The performance tests in the standard technical paper on the disruptor (Chapter 5 + 6) do not cover disjoint work distribution.
https://groups.google.com/forum/?fromgroups=#!topic/lmax-disruptor/tt3wQthBYd0 is the only reference I've found on disruptor + work stealing. It states that a queue per thread is dramatically slower if there is any shared state, but does not go into detail or explain why. I doubt that this sentence applies to my situation with:
shared state being resolved with a lockless hash table;
having to disjointly distribute entries amongst consumers;
except for work stealing, each thread reads and writes only in its local queue.

Update - Bottom line up front for max performance: You need to write both in the idiomatic syntax for disruptor and work stealing, and then benchmark.
To me, I think the distinction is primarily in the split between message vs task focus, and therefore in the way you want to think of the problem. Try to solve your problem, and if it is task-focused then Disruptor is a good fit. If the problem is message focused, then you might be more suited to another technique such as work stealing.
Use work stealing when your implementation is message focused. Each thread can pick up a message and run it through to completion. For an example HTTP server - Each inbound http request is allocated a thread. That thread is focused on handling the request start to finish - logging the request, checking security controls, doing vhost lookup, fetching file, sending response, and closing connection
Use disruptor when your implementation is task focused. Each thread can work on a particular stage of the processing. Alternative example: for a task focus, the processing would be split into stages, so you would have a thread that does logging, a thread for security controls, a thread for vhost lookup, etc; each thread focused on its task and passes the request to the next thread in the pipeline. Stages may be parallelised but the overall structure is a thread focused on a specific task and hands the message along between threads.
Of course, you can change your implementation to suit each approach better.
In your case, I would structure the problem differently if you wanted to use Disruptor. Generally you would eliminate shared state by having a single thread own the state and pass all tasks through that thread of work - look up SEDA for lots of diagrams like this. This can have lots of benefits, but again, is really down to your implementation.
Some more verbosity:
Disruptor - very useful when strict ordering of stages is required, additional benefits when all tasks are of a consistent length eg: no blocking on external system, and very similar amount of processing per task. In this scenario, you can assume that all threads will work evenly through the system, and therefore arrange N threads to process every N messages. I like to think of Disruptor as an efficient way to implement SEDA-like systems where threads process stages. You can of course have an application with a single stage and multiple parallel units performing same work at each stage however this is not really the point in my view. This would totally avoid the cost of shared state.
Work stealing - use this when tasks are of a varied duration and the order of message processing is not important, as this allows threads that are free and have already consumed their messages to continue progress from another task queue. This way if for example you have 10 threads and 1 is blocked on IO, the remainder will still complete their processing.

Thread per connection vs Reactor pattern (with a thread pool)?

I want to write a simple multiplayer game as part of my C++ learning project.
So I thought, since I am at it, I would like to do it properly, as opposed to just getting-it-done.
If I understood correctly: Apache uses a Thread-per-connection architecture, while nginx uses an event-loop and then dedicates a worker [x] for the incoming connection. I guess nginx is wiser, since it supports a higher concurrency level. Right?
I have also come across this clever analogy, but I am not sure if it could be applied to my situation. The analogy also seems to be very idealist. I have rarely seen my computer run at 100% CPU (even with a umptillion Chrome tabs open, Photoshop and what-not running simultaneously)
Also, I have come across a SO post (somehow it vanished from my history) where a user asked how many threads they should use, and one of the answers was that it's perfectly acceptable to have around 700, even up to 10,000 threads. This question was related to JVM, though.
So, let's estimate a fictional user-base of around 5,000 users. Which approach should would be the "most concurrent" one?
A reactor pattern running everything in a single thread.
A reactor pattern with a thread-pool (approximately, how big do you suggest the thread pool should be?
Creating a thread per connection and then destroying the thread the connection closes.
I admit option 2 sounds like the best solution to me, but I am very green in all of this, so I might be a bit naive and missing some obvious flaw. Also, it sounds like it could be fairly difficult to implement.
PS: I am considering using POCO C++ Libraries. Suggesting any alternative libraries (like boost) is fine with me. However, many say POCO's library is very clean and easy to understand. So, I would preferably use that one, so I can learn about the hows of what I'm using.

Reactive Applications certainly scale better, when they are written correctly. This means
Never blocking in a reactive thread:
Any blocking will seriously degrade the performance of you server, you typically use a small number of reactive threads, so blocking can also quickly cause deadlock.
No mutexs since these can block, so no shared mutable state. If you require shared state you will have to wrap it with an actor or similar so only one thread has access to the state.
All work in the reactive threads should be cpu bound
All IO has to be asynchronous or be performed in a different thread pool and the results feed back into the reactor.
This means using either futures or callbacks to process replies, this style of code can quickly become unmaintainable if you are not used to it and disciplined.
All work in the reactive threads should be small
To maintain responsiveness of the server all tasks in the reactor must be small (bounded by time)
On an 8 core machine you cannot cannot allow 8 long tasks arrive at the same time because no other work will start until they are complete
If a tasks could take a long time it must be broken up (cooperative multitasking)
Tasks in reactive applications are scheduled by the application not the operating system, that is why they can be faster and use less memory. When you write a Reactive application you are saying that you know the problem domain so well that you can organise and schedule this type of work better than the operating system can schedule threads doing the same work in a blocking fashion.
I am a big fan of reactive architectures but they come with costs. I am not sure I would write my first c++ application as reactive, I normally try to learn one thing at a time.
If you decide to use a reactive architecture use a good framework that will help you design and structure your code or you will end up with spaghetti. Things to look for are:
What is the unit of work?
How easy is it to add new work? can it only come in from an external event (eg network request)
How easy is it to break work up into smaller chunks?
How easy is it to process the results of this work?
How easy is it to move blocking code to another thread pool and still process the results?
I cannot recommend a C++ library for this, I now do my server development in Scala and Akka which provide all of this with an excellent composable futures library to keep the code clean.
Best of luck learning C++ and with which ever choice you make.

Option 2 will most efficiently occupy your hardware. Here is the classic article, ten years old but still good.
http://www.kegel.com/c10k.html
The best library combination these days for structuring an application with concurrency and asynchronous waiting is Boost Thread plus Boost ASIO. You could also try a C++11 std thread library, and std mutex (but Boost ASIO is better than mutexes in a lot of cases, just always callback to the same thread and you don't need protected regions). Stay away from std future, cause it's broken:
http://bartoszmilewski.com/2009/03/03/broken-promises-c0x-futures/
The optimal number of threads in the thread pool is one thread per CPU core. 8 cores -> 8 threads. Plus maybe a few extra, if you think it's possible that your threadpool threads might call blocking operations sometimes.

FWIW, Poco supports option 2 (ParallelReactor) since version 1.5.1

I think that option 2 is the best one. As for tuning of the pool size, I think the pool should be adaptive. It should be able to spawn more threads (with some high hard limit) and remove excessive threads in times of low activity.

as the analogy you linked to (and it's comments) suggest. this is somewhat application dependent. now what you are building here is a game server. let's analyze that.
game servers (generally) do a lot of I/O and relatively few calculations, so they are far from 100% CPU applications.
on the other hand they also usually change values in some database (a "game world" model). all players create reads and writes to this database. which is exactly the intersection problem in the analogy.
so while you may gain some from handling the I/O in separate threads, you will also lose from having separate threads accessing the same database and waiting for its locks.
so either option 1 or 2 are acceptable in your situation. for scalability reasons I would not recommend option 3.

How to design multithreaded application

I have a multithreaded application. Each module is executed in a separate thread.
Modules are:
- network module - used to receive/send data from network
- parser module - encode/decode network data to internal presentation
- 2 application module - perform some application logic on the above data one after other
- counter module - used to gather statistics from other modules
- timer module - used to schedule timers
- and much more ...
All threads using message queues for inter thread communication (std::deque sync by conditional variable and mutex).
Some modules are used by others ones (e.g. all modules use timer and counter) and this for each message received from network wich should be handled in very high rates.
This is pretty complex application and the design looks "reasonable". From other hand, I'm not sure that such design, thread per module, is the "best" one? In particular, I'm afraid that such design "encorage" a lot of context switches.
What do you think?
Is there're any good guidelines or open source project to learn from how to do "correct" design of threaded application?

Thread-per-function designs are just naive: they assume that by separating tasks - by module - onto threads, that some kind of scalability will be achieved.
This kind of design is inefficient, as very few task breakdowns yield exactly as many tasks as there are CPUs.
Far more rational designs are to break tasks down into 'jobs' - and then use thread pooling mechanisms to dispatch those jobs.
Advantages over the thread-per-module approach:
Thread pools take advantage of all cores. with thread-per-module if you have modules < cores you have cores sitting idle.
Thread pools minimize contention and resources by maintaining a parity between active threads, and cores. with thread-per-module, if modules > cores you incur needless extra context switches and (on some platforms) each thread exhausts other limited per process resources (like virtual memory).
Thread pools let a "module" do multiple jobs at a time. thread-per-module means that the busiest module still only gets one core.

I wouldn't call myself an expert an multi-threaded design. But I've at least worked with threads enough to have run into various issues trying to design them to work together (communication, locking resources, waiting for threads to end, etc).
At this point, my general rule of thumb is that I must justify the existence of each new thread. For example, if the network layer I'm using provides both a synchronous and an asynchronous API, can I really justify making the network code use synchronous calls in a new thread instead of just using the asynchronous calls in the main thread? In your case, how many modules actually need a thread of their own for a specific reason. Are there any that could instead just be called in turn from the main thread?
If some threads have no good reason for existing, then you might be able to save yourself some trouble and complexity by just putting that module in the main thread.
Now of course, there are good justifiable reasons for putting things in threads. Such as making synchronous calls that may block for a long time, keeping a GUI thread responsive while performing a long task, or being able to take advantage of parallel processing of a large task on a multi-core system.
I don't know of any particular "correct" way to do it. A lot of it really comes down to the details of what your application is actually supposed to do.

A good guideline is to put operations that might block (such as I/O) in its own thread. Your network module is a definite candidate here. Have your network thread use select (I assume UNIX here) to block on input.
Asynchronous events are good in separate threads as well. Your timer module looks like a good candidate here.
You might want to put your other modules in one thread to decrease complexity of your application. BUT, you might want to split them up if you have a multi-processor system.
Have a good strategy for locking resources and mutex handling to prevent deadlocks. A dependency graph (using a whiteboard!) might help here to get your design correct.
Good luck! Sounds like a complex system which will cause many hours of fun development!

For what platform?
For instance a Win32 applications the best model for back-end servers (like yours seems to be) is the thread pool and IO Completion Port. This is not just some hear say and opinion, there are strong facts behind this claim. Rick Vicik of the Windows Performance team has posted a series of articles describing in greater detail why high end servers need to follow this model, see High Performance Windows Programs.
There are other factors that come into play, like for instance the typo of protocol your network module has to handle. Request-Response protocols are often handled by one-thread-per-request metaphor and they do well enough, but high-throughput high-scale protocols don't fare well in that model, specifically because of boxcaring requirements.
Ultimately, whether your design is sound or not is hard to tell just from this brief description. Personally I tend o favor an IO completion driven threading model, as opposed to logical-module driven one, but that's just me.

Just to add to the other answers, lets reason every single thread in your dessign:
network module
Accepted.
parser module + 2 application module
Are you sure that these 3 threads can't be merged into one, main data processing thread? If that were the case, you could then benefit of a thread pool like others sugested, having this processing performed by N threads.
timer module
This one probably is reasonable in most platforms, as you will need a message processing loop to dispatch timer events. Also, if you ever need a GUI that could be the place.
counter module
This is the one that most annoys me. I can't find the reason for having a separate thread for this. Depending on how much you increment it, it will be a nice bottleneck for the application.
I'll suggest keeping separate counters in each thread and poll(message queue) for them when you need it.
and much more ...
Hope not!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js