I'm parsing a financial market data feed that has lots of data.
I will receive a string and extract a value, which I will store in a globally declared variable.
The program works like this: When data (string) arrives, a thread is invoked. This thread checks a value passed alongside the string to understand what kind of string it is.
Conditioned on it being the kind of string for my interest, my question is this:
Is it better to pass the string into a queue, for parsing, or to simply parse directly within the invoked thread.
Conceptually, I am worried that if I ask the invoked thread to do work then it may not be available for subsequent market data events, which occur at high frequency, and I will lose data.
Were I to place the string in a queue, I would of course need another thread that popped items off the queue and parsed them.
I have a very fast PC and speed is my interest here. Does the board have experience and know what is the best approach here?
The real question is, do you care more about the latency of your system, or about the throughput.
If you optimize for latency, meaning that you want to respond as soon as the event occured (which is usually the case in HFT), you'll probably try to avoid passing variables to another thread, as this will generate unnecesary slowdown and will increase your latency. In particular, while optimizing for latency, you want your CPU cache not to be invalidated, which often may be the case if you use multithreading (well, at least more often, than if it's single threaded). Besides, checking what specific string contains is really fast in comparison to sending things over a network, meaning that if that's the only operation you shouldn't be worried about too many messages coming in and waiting to be processed. Even if it happens, the message will gently wait in kernels queue for a microsecond, while you're processing the previous one. To put it in perspective, I'd assume that if you're getting less than 300 thousand messages per second (and I'd bet you do) and the only thing you're doing is checking if the string contains a pattern, you should not go for threads.
On the other hand, if you care more about throughput - meaning that you really have a lot messages (think hunderds of thousands per second), or you're doing more heavy computations, than just searching for a pattern in a string, then probably it's better to use queue and process events in another thread.
Related
I'm trying to get my head around multithreading in C++, to come up with a general purpose implementation that suits me. Everyone has a different implementation, Awesome CPP lists 39 libraries. It seems to me though that this is a logistical problem that is of the same ilk as any logistical scheduling problem in any field.
In my head, there are two obvious ways to repeatedly perform the job abc:
Split abc into 3 separate tasks: a, b & c. Spawn x threads. Have a queue. Jobs coming in get added to the queue. Each thread grabs the next task from the queue, and at the end of the task puts it back into the queue for the next task. They can either access the queue directly, or they can all communicate with a central 'manager' or 'scheduler' thread that serves them with their tasks.
Perform abc sequentially on x separate threads independently (parallelism.)
(1) has the problem that there is potentially a lot of overhead in keeping a queue and dealing with race conditions on it. (1) is otherwise intuitive and makes sense to me. It's what I would do in real life with a real life problem. It's literally how companies work in the real world.
(2) has the problem that any blocking causes the whole thread to block, idling the CPU thread. And (2) is far less flexible and applicable in less use-cases. On the plus side is has no overhead between tasks.
Question 1: Doesn't (1) also have the same blocking problem? If a thread reads from a file, it'll have to wait for the disk. How is that usually addressed, is there some way to yield back temporarily while its doing something like reading or writing from disk, or is this usually addressed simply by having more threads running than there are CPU threads and hoping not too many block at once?
It seems to me that (1) is clearly the better solution, except that it restricts the tasks only to medium to large scaled tasks. It would be pointless to use it to do something like parallelizing straightforward math (just an example) because handling the queue would take longer than the actual processing of the task. Hence the value of (1) for any given task is inversely proportionate to the difference between the overhead of the storage mechanism (the queue) and the size of the task. This sounds fine on the surface, until you realize that the efficiency of splitting into tasks is itself proportionate to the size of the task. To put it simply: you want each task to be small for overall efficiency in theory, but in practice you want each task to be larger so as to minimize the overhead of the queue.
Its obvious that some storage mechanism is required because you can't keep track of something without a recording mechanism, it doesn't have to be strictly a queue, but any form of recording the task in memory while it waits to be picked up. The optimization of the queue (I'm using the word loosely, not strictly a queue type) is then the #1 important factor here. The cheaper a task can receive its payload, the better.
Which leads me to Question 2: is this what C++20 coroutines are useful for? I've spent hours reading tutorials on coroutines, but it's still unclear what useful they're for. I think I get what they do. If I have it right they allow a special type of function (coroutine) to pause itself in the middle, yield its processing back to the caller along with a payload, and the caller can later resume it. But why would I want to do that? And can't I do that just by splitting the function into two?
Question 3: Are coroutines meant to be used by a task scheduler thread to somehow optimize the queuing? Or is the point just to allow you to write code linearly and then put those yields in it to break it up? In which case it wouldn't be useful for me if I already had my jobs split up into separate tasks by design?
Question 4: Am I trying to reinvent the wheel here? Has this problem already been solved? And if so, why are there so many different implementations?
Q1: No, it more likely has a different blocking problem.
Q2: Co-routines have many applications; try substituting for X in "is this what X is for?" X = { while, if, return, pointer, ... }. Don't look to standards bodies (particularly that one) for insight; they are best at punctuation and spell checking.
Q3: Co-routines can be used to optimise various constructions, but the real goal of using such a formalism is to make your program as natural an expression of the problem as possible. One of the better examples of how Co-routines can be intelligently used are the Go-routines of Go.
Q4: Probably; almost definitely; because many of the solutions are inadequate.
Q1+Q4. There is no single blocking problem, some that come to mind are: Deadlock, Livelock, unnecessarily sequential, non-Scalable, Slow. Some structures {{ threads, coroutines, threads + coroutines } * { locks, conditions, message passing }} help solve some of these problems, but induce others. My favourite is { (threads + coroutines) * (message passing) }, which is typically good for everything but Slow.
I'm writing a multithreaded application that have 2 threads.
One of the threads receives data from a queue and aggregates it and the other one send the aggregated data to a server.
I want to be able to know the last time that a data was received so I use:
time_t last_data = time(NULL)
to get the correct time on each event (I dont need it to be super accurate but I need it to be fast) and then the other send this value with the aggregated data.
My questions are:
Do I have to synchronize this even if this is not very important that I get the most recent update?
I tested it with std::atomic<time_t> and it seems to have some performance issues, is there any other faster way?
What would be the worst case that can happen if I will not synchronize the read/write?
Is there a faster way to get the current time then time(NULL) (don't have to be super accurate)?
UPDATE
Here is an explanation of my application workflow.
Application needs:
1. Consume data from external sources using IPC (currently nanomsg).
2. Aggregate the data to bulks.
3. Send the aggregated data to remote server every given interval (1 second).
Current implementation:
Create 2 buffers to hold the aggregated data (one for receiving and one for sending).
Create a consumer thread to consume data from IPC and fill the receiving buffer.
Create a sending thread that will send the data to the server.
Every iteration of the interval the sending thread will swap the buffers (swap pointers and locking using mutex) and send the data to the server.
I don't want that the consumer will wait on network IO so I have created this flow.
Can I use event driven here instead of this complex mechanism without all the locking (currently it is working fine but i'm sure it can be better)?
Don't do it that way. You only need one thread. You can use select/poll/epoll. These can wait on your inputs, and at the same time for you output to finish. You will be doing event driven programming, and non-blocking output. It is something worth learning. It is a bit harder at first, but soon makes life easier i.e. not having the problem that you have now. Also the program will be faster.
Supposing one thread executes:
last_data = time(NULL);
And the other uses last_data but there is no synchronization event between the two then there are no guarantees when or if the revised value of last_data will become visible to the reading thread.
However the most serious possibility is that write of time_t (maybe long) isn't atomic and another thread could read a corrupt 'part written' value.
That could cause glitches in delay and time calculations that might foul downstream process.
You might analyse your program and find that because the two interact there is a sufficient memory fence at some point that guarantees eventual update.
NB: This is an odd situation where I'm suspecting you think something isn't synchronized but it is! The usual experience is the other way around...
Basically there's not really enough information to understand what problem you're having.
For example, if the reader thread is the only process to read the time I'd expect to see code like:
Thread 1:
If data received, lock L, update time, add to queue, unlock L.
Thread 2:
If items in queue L, read queue and update time, unlock L .. process item.
In which case the time will be synchronized already.
Please provide a minimum, complete, verifiable example...
I ran recently into a requirement in which there is a need for multithreaded application whose threads run at different rates.
The questions then become, since i am still learning multithreading:
A scenario is given to put things into perspective:
Say 1st thread runs at 100 Hz "real time"
2nd runs at 10 Hz
and say that the 1st thread provides data "myData" to the 2nd thread.
How is myData going to be provided to the 2nd thread, is the common practice to just read whatever is available from the first thread, or there need to be some kind of decimation to reduce the rate.
Does the myData need to be some kind of Singleton with locking mechanism. Although myData isn't shared, but rather updated by the first thread and used in the second thread.
How about the opposite case, when the data used in one thread need to be used at higher rate in a different thread.
How is myData going to be provided to the 2nd thread
One common method is to provide a FIFO queue -- this could be a std::dequeue or a linked list, or whatever -- and have the producer thread push data items onto one end of the queue while the consumer thread pops the data items off of the other end of the queue. Be sure to serialize all accesses to the FIFO queue (using a mutex or similar locking mechanism), to avoid race conditions.
Alternatively, instead of a queue you could have a single shared data object (essentially a queue of length one) and have your producer thread overwrite the object every time it generates new data. This could be done in cases where it's not important that the consumer thread sees every piece of data that was generated, but rather it's only important that it sees the most recent data. You'd still need to do the locking, though, to avoid the risk of the consumer thread reading from the data object at the same time the producer thread is in the middle of writing to it.
or does there need to be some kind of decimation to reduce the rate.
There doesn't need to be any decimation -- the second thread can just read in as much data as there is available to read, whenever it wakes up.
Does the myData need to be some kind of Singleton with locking
mechanism.
Singleton isn't necessary (although it's possible to do it that way). The locking mechanism is necessary, unless you have some kind of lock-free synchronization mechanism (and if you're asking this level of question, you don't have one and you don't want to try to get one either -- keep things simple for now!)
How about the opposite case, when the data used in one thread need to
be used at higher rate in a different thread.
It's the same -- if you're using a proper inter-thread communications mechanism, the rates at which the threads wake up doesn't matter, because the communications mechanism will do the right thing regardless of when or how often the the threads wake up.
Any multithreaded program has to cope with the possibility that one of the threads will work faster than another - by any ratio - even if they're executing on the same CPU with the same clock frequency.
Your choices include:
producer-consumer container than lets the first thread enqueue data, and the second thread "pop" it off for processing: you could let the queue grow as large as memory allows, or put some limit on the size after which either data would be lost or the 1st thread would be forced to slow down and wait to enqueue further values
there are libraries available (e.g. boost), or if you want to implement it yourself google some tutorials/docs on mutex and condition variables
do something conceptually similar to the above but where the size limit is 1 so there's just the single myData variable rather than a "container" - but all the synchronisation and delay choices remain the same
The Singleton pattern is orthogonal to your needs here: the two threads do need to know where the data is, but that would normally be done using e.g. a pointer argument to the function(s) run in the threads. Singleton's easily overused and best avoided unless reasons stack up high....
Is the following correct?
The disruptor pattern has better parallel performance and scalability if each entry has to be processed in multiple ways (io operations or annotations), since that can be parallelized using multiple consumers without contention.
Contrarily, work stealing (i.e. storing entries locally and stealing entries from other threads) has better parallel performance and scalability if each entry has to be processed in a single way only, since disjointly distributing the entries onto multiple threads in the disruptor pattern causes contention.
(And is the disruptor pattern still so much faster than other lockless multi-producer multi-consumer queues (e.g. from boost) when multiple producers (i.e. CAS operations) are involved?)
My situation in detail:
Processing an entry can produce several new entries, which must be processed eventually, too. Performance has highest priority, entries being processed in FIFO order has second priority.
In the current implementation, each thread uses a local FIFO, where it adds its new entries. Idle threads steal work from other thread's local FIFO. Dependencies between the thread's processing are resolved using a lockless, mechanically sympathetic hash table (CASs on write, with bucket granularity). This results in pretty low contention but FIFO order is sometimes broken.
Using the disruptor pattern would guarantee FIFO order. But wouldn't distributing the entries onto the threads cause much higher contention (e.g. CAS on a read cursor) than for local FIFOs with work stealing (each thread's throughput is about the same)?
References I've found
The performance tests in the standard technical paper on the disruptor (Chapter 5 + 6) do not cover disjoint work distribution.
https://groups.google.com/forum/?fromgroups=#!topic/lmax-disruptor/tt3wQthBYd0 is the only reference I've found on disruptor + work stealing. It states that a queue per thread is dramatically slower if there is any shared state, but does not go into detail or explain why. I doubt that this sentence applies to my situation with:
shared state being resolved with a lockless hash table;
having to disjointly distribute entries amongst consumers;
except for work stealing, each thread reads and writes only in its local queue.
Update - Bottom line up front for max performance: You need to write both in the idiomatic syntax for disruptor and work stealing, and then benchmark.
To me, I think the distinction is primarily in the split between message vs task focus, and therefore in the way you want to think of the problem. Try to solve your problem, and if it is task-focused then Disruptor is a good fit. If the problem is message focused, then you might be more suited to another technique such as work stealing.
Use work stealing when your implementation is message focused. Each thread can pick up a message and run it through to completion. For an example HTTP server - Each inbound http request is allocated a thread. That thread is focused on handling the request start to finish - logging the request, checking security controls, doing vhost lookup, fetching file, sending response, and closing connection
Use disruptor when your implementation is task focused. Each thread can work on a particular stage of the processing. Alternative example: for a task focus, the processing would be split into stages, so you would have a thread that does logging, a thread for security controls, a thread for vhost lookup, etc; each thread focused on its task and passes the request to the next thread in the pipeline. Stages may be parallelised but the overall structure is a thread focused on a specific task and hands the message along between threads.
Of course, you can change your implementation to suit each approach better.
In your case, I would structure the problem differently if you wanted to use Disruptor. Generally you would eliminate shared state by having a single thread own the state and pass all tasks through that thread of work - look up SEDA for lots of diagrams like this. This can have lots of benefits, but again, is really down to your implementation.
Some more verbosity:
Disruptor - very useful when strict ordering of stages is required, additional benefits when all tasks are of a consistent length eg: no blocking on external system, and very similar amount of processing per task. In this scenario, you can assume that all threads will work evenly through the system, and therefore arrange N threads to process every N messages. I like to think of Disruptor as an efficient way to implement SEDA-like systems where threads process stages. You can of course have an application with a single stage and multiple parallel units performing same work at each stage however this is not really the point in my view. This would totally avoid the cost of shared state.
Work stealing - use this when tasks are of a varied duration and the order of message processing is not important, as this allows threads that are free and have already consumed their messages to continue progress from another task queue. This way if for example you have 10 threads and 1 is blocked on IO, the remainder will still complete their processing.
When I write a message driven app. much like a standard windows app only that it extensively uses messaging for internal operations, what would be the best approach regarding to threading?
As I see it, there are basically three approaches (if you have any other setup in mind, please share):
Having a single thread process all of the messages.
Having separate threads for separate message types (General, UI, Networking, etc...)
Having multiple threads that share and process a single message queue.
So, would there be any significant performance differences between the three?
Here are some general thoughts:
Obviously, the last two options benefit from a situation where there's more than one processor. Plus, if any thread is waiting for an external event, other threads can still process unrelated messages. But ignoring that, seems that multiple threads only add overhead (Thread switches, not to mention more complicated sync situations).
And another question: Would you recommend to implement such a system upon the standard Windows messaging system, or to implement a separate queue mechanism, and why?
The specific choice of threading model should be driven by the nature of the problem you are trying to solve. There isn't necessarily a single "correct" approach to designing the threading model for such an application. However, if we adopt the following assumptions:
messages arrive frequently
messages are independent and don't rely too heavily on shared resources
it is desirable to respond to an arriving message as quickly as possible
you want the app to scale well across processing architectures (i.e. multicode/multi-cpu systems)
scalability is the key design requirement (e.g. more message at a faster rate)
resilience to thread failure / long operations is desirable
In my experience, the most effective threading architecture would be to employ a thread pool. All messages arrive on a single queue, multiple threads wait on the queue and process messages as they arrive. A thread pool implementation can model all three thread-distribution examples you have.
#1 Single thread processes all messages => thread pool with only one thread
#2 Thread per N message types => thread pool with N threads, each thread peeks at the queue to find appropriate message types
#3 Multiple threads for all messages => thread pool with multiple threads
The benefits of this design is that you can scale the number of threads in the thread in proportion to the processing environment or the message load. The number of threads can even scale at runtime to adapt to the realtime message load being experienced.
There are many good thread pooling libraries available for most platforms, including .NET, C++/STL, Java, etc.
As to your second question, whether to use standard windows message dispatch mechanism. This mechanism comes with significant overhead and is really only intended for pumping messages through an windows application's UI loop. Unless this is the problem you are trying to solve, I would advise against using it as a general message dispatching solution. Furthermore, windows messages carry very little data - it is not an object-based model. Each windows message has a code, and a 32-bit parameter. This may not be enough to base a clean messaging model on. Finally, the windows message queue is not design to handle cases like queue saturation, thread starvation, or message re-queuing; these are cases that often arise in implementing a decent message queing solution.
We can't tell you much for sure without knowing the workload (ie, the statistical distribution of events over time) but in general
single queue with multiple servers is at least as fast, and usually faster, so 1,3 would be preferable to 2.
multiple threads in most languages add complexity because of the need to avoid contention and multiple-writer problems
long duration processes can block processing for other things that could get done quicker.
So horseback guess is that having a single event queue, with several server threads taking events off the queue, might be a little faster.
Make sure you use a thread-safe data structure for the queue.
It all depends.
For example:
Events in a GUI queue are best done by a single thread as there is an implied order in the events thus they need to be done serially. Which is why most GUI apps have a single thread to handle events, though potentially multiple events to create them (and it does not preclude the event thread from creating a job and handling it off to a worker pool (see below)).
Events on a socket can potentially by done in parallel (assuming HTTP) as each request is stateless and can thus by done independently (OK I know that is over simplifying HTTP).
Work Jobs were each job is independent and placed on queue. This is the classic case of using a set of worker threads. Each thread does a potentially long operation independently of the other threads. On completion comes back to the queue for another job.
In general, don't worry about the overhead of threads. It's not going to be an issue if you're talking about merely a handful of them. Race conditions, deadlocks, and contention are a bigger concern, and if you don't know what I'm talking about, you have a lot of reading to do before you tackle this.
I'd go with option 3, using whatever abstractions my language of choice offers.
Note that there are two different performance goals, and you haven't stated which you are targetting: throughput and responsiveness.
If you're writing a GUI app, the UI needs to be responsive. You don't care how many clicks per second you can process, but you do care about showing some response within a 10th of a second or so (ideally less). This is one of the reasons it's best to have a single thread devoted to handling the GUI (other reasons have been mentioned in other answers). The GUI thread needs to basically convert windows messages into work-items and let your worker queue handle the heavy work. Once the worker is done, it notifies the GUI thread, which then updates the display to reflect any changes. It does things like painting a window, but not rendering the data to be displayed. This gives the app a quick "snapiness" that is what most users want when they talk about performance. They don't care if it takes 15 seconds to do something hard, as long as when they click on a button or a menu, it reacts instantly.
The other performance characteristic is throughput. This is the number of jobs you can process in a specific amount of time. Usually this type of performance tuning is only needed on server type applications, or other heavy-duty processing. This measures how many webpages can be served up in an hour, or how long it takes to render a DVD. For these sort of jobs, you want to have 1 active thread per CPU. Fewer than that, and you're going to be wasting idle clock cycles. More than that, and the threads will be competing for CPU time and tripping over each other. Take a look at the second graph in this article DDJ articles for the trade-off you're dealing with. Note that the ideal thread count is higher than the number of available CPUs due to things like blocking and locking. The key is the number of active threads.
A good place to start is to ask yourself why you need multiple threads.
The well-thought-out answer to this question will lead you to the best answer to the subsequent question, "how should I use multiple threads in my application?"
And that must be a subsequent question; not a primary question. The fist question must be why, not how.
I think it depends on how long each thread will be running. Does each message take the same amount of time to process? Or will certain messages take a few seconds for example. If I knew that Message A was going to take 10 seconds to complete I would definitely use a new thread because why would I want to hold up the queue for a long running thread...
My 2 cents.
I think option 2 is the best. Having each thread doing independant tasks would give you best results. 3rd approach can cause more delays if multiple threads are doing some I/O operation like disk reads, reading common sockets and so on.
Whether to use Windows messaging framework for processing requests depends on the work load each thread would have. I think windows restricts the no. of messages that can be queued at the most to 10000. For most of the cases this should not be an issue. But if you have lots of messages to be queued this might be some thing to take into consideration.
Seperate queue gives a better control in a sense that you may reorder it the way you want (may be depending on priority)
Yes, there will be performance differences between your choices.
(1) introduces a bottle-neck for message processing
(3) introduces locking contention because you'll need to synchronize access to your shared queue.
(2) is starting to go in the right direction... though a queue for each message type is a little extreme. I'd probably recommend starting with a queue for each model in your app and adding queues where it makes since to do so for improved performance.
If you like option #2, it sounds like you would be interested in implementing a SEDA architecture. It is going to take some reading to understand what is going on, but I think the architecture fits well with your line of thinking.
BTW, Yield is a good C++/Python hybrid implementation.
I'd have a thread pool servicing the message queue, and make the number of threads in the pool easily configurable (perhaps even at runtime). Then test it out with expected load.
That way you can see what the actual correlation is - and if your initial assumptions change, you can easily change your approach.
A more sophisticated approach would be for the system to introspect its own performance traits and adapt it's use of resources, threads in particular, as it goes. Probably overkill for most custom application code, but I'm sure there are products that do that out there.
As for the windows events question - I think that's probably an application specific question that there is no right or wrong answer to in the general case. That said, I usually implement my own queue as I can tailor it to the specific characteristics of the task at hand. Sometimes that might involve routing events via the windows message queue.