Real time data streaming with 1 writer and N concurrent readers - c++

A server controls 1 writer continuously producing data frames in real time and N possible concurrent read requests. Whenever a reader makes a request to the server, the reader should be able to get the most recent produced frame or wait for it, if not available. Although, it is allowed for N different readers to concurrently "consume" the same frame, each individual reader must not read the same frame more than one time.
Is there any well-known algorithm or a strategy for the above problem which does not waste too many resources and gives the readers a good throughput?
For now my idea is to use the so called "triple buffering" (one buffer per frame), where two buffers are filled by the writer alternatively and one buffer is shared by the concurrent readers. If the number of concurrent readings is 0, once a frame has been produced, the corresponding buffer can be swapped with the buffer dedicated to the readers. It seems an easy model, although all the concurrent readers might be affected by the timings of the slowest reader in the group. The problem about making sure that one reader cannot get the same frame two times has still to be solved with some sort of synchronisation in a clean way which fits the above model.
If you any other idea, or code (in modern C++ is preferred), C++ library... I'd appreciate it.

the leader of project Disruptor: Martin Thompson has this new project: Aeron and it's super fast. What's more, it's already support C++ api. Check out the introduction video and article from highscalability:
https://www.youtube.com/watch?v=tM4YskS94b0
http://highscalability.com/blog/2014/11/17/aeron-do-we-really-need-another-messaging-system.html

If I understood your question correctly, you can use disruptor pattern here. It uses ring buffers to effectivly pass data between threads. See multicast events section here. The LMAX disruptor was originaly written in java, though some implementation exists for c++. See pure c version, c++11 version and another c++ version. Also, have you seen intel thread building blocks library? It has some usefull and highly effective concurrent data structures, scheduler, synchronization primitives for c++. Hope this helps...

Related

Thread per connection vs Reactor pattern (with a thread pool)?

I want to write a simple multiplayer game as part of my C++ learning project.
So I thought, since I am at it, I would like to do it properly, as opposed to just getting-it-done.
If I understood correctly: Apache uses a Thread-per-connection architecture, while nginx uses an event-loop and then dedicates a worker [x] for the incoming connection. I guess nginx is wiser, since it supports a higher concurrency level. Right?
I have also come across this clever analogy, but I am not sure if it could be applied to my situation. The analogy also seems to be very idealist. I have rarely seen my computer run at 100% CPU (even with a umptillion Chrome tabs open, Photoshop and what-not running simultaneously)
Also, I have come across a SO post (somehow it vanished from my history) where a user asked how many threads they should use, and one of the answers was that it's perfectly acceptable to have around 700, even up to 10,000 threads. This question was related to JVM, though.
So, let's estimate a fictional user-base of around 5,000 users. Which approach should would be the "most concurrent" one?
A reactor pattern running everything in a single thread.
A reactor pattern with a thread-pool (approximately, how big do you suggest the thread pool should be?
Creating a thread per connection and then destroying the thread the connection closes.
I admit option 2 sounds like the best solution to me, but I am very green in all of this, so I might be a bit naive and missing some obvious flaw. Also, it sounds like it could be fairly difficult to implement.
PS: I am considering using POCO C++ Libraries. Suggesting any alternative libraries (like boost) is fine with me. However, many say POCO's library is very clean and easy to understand. So, I would preferably use that one, so I can learn about the hows of what I'm using.
Reactive Applications certainly scale better, when they are written correctly. This means
Never blocking in a reactive thread:
Any blocking will seriously degrade the performance of you server, you typically use a small number of reactive threads, so blocking can also quickly cause deadlock.
No mutexs since these can block, so no shared mutable state. If you require shared state you will have to wrap it with an actor or similar so only one thread has access to the state.
All work in the reactive threads should be cpu bound
All IO has to be asynchronous or be performed in a different thread pool and the results feed back into the reactor.
This means using either futures or callbacks to process replies, this style of code can quickly become unmaintainable if you are not used to it and disciplined.
All work in the reactive threads should be small
To maintain responsiveness of the server all tasks in the reactor must be small (bounded by time)
On an 8 core machine you cannot cannot allow 8 long tasks arrive at the same time because no other work will start until they are complete
If a tasks could take a long time it must be broken up (cooperative multitasking)
Tasks in reactive applications are scheduled by the application not the operating system, that is why they can be faster and use less memory. When you write a Reactive application you are saying that you know the problem domain so well that you can organise and schedule this type of work better than the operating system can schedule threads doing the same work in a blocking fashion.
I am a big fan of reactive architectures but they come with costs. I am not sure I would write my first c++ application as reactive, I normally try to learn one thing at a time.
If you decide to use a reactive architecture use a good framework that will help you design and structure your code or you will end up with spaghetti. Things to look for are:
What is the unit of work?
How easy is it to add new work? can it only come in from an external event (eg network request)
How easy is it to break work up into smaller chunks?
How easy is it to process the results of this work?
How easy is it to move blocking code to another thread pool and still process the results?
I cannot recommend a C++ library for this, I now do my server development in Scala and Akka which provide all of this with an excellent composable futures library to keep the code clean.
Best of luck learning C++ and with which ever choice you make.
Option 2 will most efficiently occupy your hardware. Here is the classic article, ten years old but still good.
http://www.kegel.com/c10k.html
The best library combination these days for structuring an application with concurrency and asynchronous waiting is Boost Thread plus Boost ASIO. You could also try a C++11 std thread library, and std mutex (but Boost ASIO is better than mutexes in a lot of cases, just always callback to the same thread and you don't need protected regions). Stay away from std future, cause it's broken:
http://bartoszmilewski.com/2009/03/03/broken-promises-c0x-futures/
The optimal number of threads in the thread pool is one thread per CPU core. 8 cores -> 8 threads. Plus maybe a few extra, if you think it's possible that your threadpool threads might call blocking operations sometimes.
FWIW, Poco supports option 2 (ParallelReactor) since version 1.5.1
I think that option 2 is the best one. As for tuning of the pool size, I think the pool should be adaptive. It should be able to spawn more threads (with some high hard limit) and remove excessive threads in times of low activity.
as the analogy you linked to (and it's comments) suggest. this is somewhat application dependent. now what you are building here is a game server. let's analyze that.
game servers (generally) do a lot of I/O and relatively few calculations, so they are far from 100% CPU applications.
on the other hand they also usually change values in some database (a "game world" model). all players create reads and writes to this database. which is exactly the intersection problem in the analogy.
so while you may gain some from handling the I/O in separate threads, you will also lose from having separate threads accessing the same database and waiting for its locks.
so either option 1 or 2 are acceptable in your situation. for scalability reasons I would not recommend option 3.

Good Multi-Thread Model for a bittorrent client?

I am currently writing a bittorrent client. I am getting to the stage in my program where I need to start thinking about whether multiple threads would improve my program and how many I would need.
I assume that I would assign one thread to deal with the trackers because the program may be in contact with several (1-5 roughly) of them at once, but will only need to contact them in an interval assigned by the tracker (around 20 minutes), so won't be very intensive on the program.
The program will be in regular contact with numerous peers to download pieces of files from them. The following is taken from the Bittorrent Specification Wiki:
Implementer's Note: Even 30 peers is plenty, the official client version 3 in fact only actively forms new connections if it has less than 30 peers and will refuse connections if it has 55. This value is important to performance. When a new piece has completed download, HAVE messages (see below) will need to be sent to most active peers. As a result the cost of broadcast traffic grows in direct proportion to the number of peers. Above 25, new peers are highly unlikely to increase download speed. UI designers are strongly advised to make this obscure and hard to change as it is very rare to be useful to do so.
It suggests that I should be in contact with roughly 30 peers. What would be a good thread model to use for my Bittorrent Client? Obviously I don't want to assign a thread to each peer and each tracker, but I will probably need more than just the main thread. What do you suggest?
I don't see a lot of need for multithreading here. Having too many threads also means having a lot of communication between these to make sure everyone is doing the right thing at the right time.
For the networking, keep everything on one thread and just multiplex using nonblocking I/O. On Unix systems this would be a setup with select/poll (or platform-specific extensions such as epoll); on Windows this would be completion ports.
You can even add the disk I/O into this, which would make the communication between the threads trivial since there isn't any :-)
If you want to consider threads to be containers for separate components, the disk I/O could go into another thread. You could use blocking I/O in this case, since there isn't a lot of multiplexing anyway.
Likewise, in such a scenario, tracker handling could go into a different thread as well since it's a different component from peer handling. Same for DHT.
You might want to offload the checksum-checking to a separate thread. Not quite sure how complex this gets, but if there's significant CPU use involved then putting it away from the I/O stuff doesn't sound that bad.
As you tagged your question [C++] I suggest std:thread of C++11 . A nice tutorial (among lots of others) you find here.
Concerning the number of threads: You can use 30 threads without any problem and have them check whether there is something to do for them and putting them to sleep for a reasonable time between the checks. The operating system will take care of the rest.

how to synchronize three dependent threads

If I have
1. mainThread: write data A,
2. Thread_1: read A and write it to into a Buffer;
3. Thread_2: read from the Buffer.
how to synchronize these three threads safely, with not much performance loss? Is there any existing solution to use? I use C/C++ on linux.
IMPORTANT: the goal is to know the synchronization mechanism or algorithms for this particular case, not how mutex or semaphore works.
First, I'd consider the possibility of building this as three separate processes, using pipes to connect them. A pipe is (in essence) a small buffer with locking handled automatically by the kernel. If you do end up using threads for this, most of your time/effort will be spent on creating nearly an exact duplicate of the pipes that are already built into the kernel.
Second, if you decide to build this all on your own anyway, I'd give serious consideration to following a similar model anyway. You don't need to be slavish about it, but I'd still think primarily in terms of a data structure to which one thread writes data, and from which another reads the data. By strong preference, all the necessary thread locking necessary would be built into that data structure, so most of the code in the thread is quite simple, reading, processing, and writing data. The main difference from using normal Unix pipes would be that in this case you can maintain the data in a more convenient format, instead of all the reading and writing being in text.
As such, what I think you're looking for is basically a thread-safe queue. With that, nearly everything else involved becomes borders on trivial (at least the threading part of it does -- the processing involved may not be, but at least building it with multiple threads isn't adding much to the complexity).
It's hard to say how much experience with C/C++ threads you have. I hate to just point to a link but have you read up on pthreads?
https://computing.llnl.gov/tutorials/pthreads/
And for a shorter example with code and simple mutex'es (lock object you need to sync data):
http://students.cs.byu.edu/~cs460ta/cs460/labs/pthreads.html
I would suggest Boost.Thread for this purpose. This is quite good framework with mutexes and semaphores, and it is multiplatform. Here you can find very good tutorial about this.
How exactly synchronize these threads is another problem and needs more information about your problem.
Edit The simplest solution would be to put two mutexes -- one on A and second on Buffer. You don't have to worry about deadlocks in this particular case. Just:
Enter mutex_A from MainThread; Thread1 waits for mutex to be released.
Leave mutex from MainThread; Thread1 enters mutex_A and mutex_Buffer, starts reading from A and writes it to Buffer.
Thread1 releases both mutexes. ThreadMain can enter mutex_A and write data, and Thread2 can enter mutex_Buffer safely read data from Buffer.
This is obviously the simplest solution, and probably can be improved, but without more knowledge about the problem, this is the best I can come up with.

Multithreaded application concept

I have a small architecture doubt about organizing code in separate functional units (most probably threads?). Application being developed is supposed to be doing the following tasks:
Display some images on a screen (i.e. slideshow)
Read the data from external device through the USB port
Match received data against the corresponding image (stimulus)
Do some data analysis
Plot the results of data analysis
My thoughts were to organize the application into the following modules:
GUI thread (+ image slideshow)
USB thread buffering the received data
Thread for analyzing/plotting data (main GUI thread should not be blocked while plotting the data which might consume some more time)
So, what do you generally think about this concept? Is there anything else you think that might be a better fit in this particular scenario?
You can probably get away with combining 1 & 2, since the slide-show feature is essentially gui oriented anyway.
For #3, you may be able to make do with some kind of asynchronous I/O methodology, so that you don't need to dedicate a polling thread. Not sure if you can do this with USB, but you can certainly get async I/O with serial and network interfaces, so it's worth looking into.
It's probably a good idea to move heavy-weight tasks like 4 & 5 to their own thread. If you aren't doing the analysis and plotting concurrently, maybe one thread for them both. However, you should really consider how much cpu time these activities will need. If the worst-case analyze-and-plot takes much less than half a second, you might even just perform these actions with a call from the gui. Conversely, if there are cases where this will take longer than that, a separate thread is favorable b/c your users won't like a laggy gui.
Just bear in mind that the dark side of threads lies in the inevitable challenge of coordinating them.
Because of the way the Windows API works, especially with regard to user input and window ownership. You can really only do UI on a single thread. If you try and use multiple threads, they just end up locking each other out and only 1 thread runs at a time. There are some specialized exceptions, but you have to be a real master of the API to pull it off.
So.
GUI thread, owns the Window, and handles all user input.
USB listening thread, you would know better than I whether this makes sense
Thread(s) for analyzing/plotting data, once again, I can't speak to this, but I'm skeptical that they will really both be running at the same time. It seems more likely this it would be analyze then plot so 1 thread.
Thread for rendering frames for a slideshow.
I'm not sure how plotting isn't the same thing as the slideshow, but I do think you can have a background thread for drawing the slideshow as long as it doesn't display the images.
You can render (i.e. draw to a bitmap or DirectX surface) in a background thread, you just can't show it in a window. But you could hand completed bitmaps off to the GUI thread and have it do the actual displaying of the bitmap. This is essentially how a lot of video playback code works.
A lot of this depends on how much is involved in performing 3 (Do some data analysis.) and 4 (Plot analyzed data.)
My instincts would be:
Definitely have a separate thread for reading the data off the USB. Assuming for a moment that 3 is dependent on reading the data, then I would do 3 in the same thread as reading the data. This will simplify your signaling to the GUI when the data is ready. This also assumes the processing is quick, and won't block the USB port (How is that being read? IO completion ports?). If the processing takes time then you need a separate thread.
Likewise if image slide processing show takes a long time, this should be done in a separate thread. If this can be quickly recalculated depending say in a paint function, I would keep it as part of the main GUI.
There is some overhead with context switch of threads, and for each thread added complexity of signaling. So I would only add a thread to solve blocking of the GUI and the USB port. It may be possible to do all of this in just two threads.
4 and 5 are definitely good ideas. That being said, avoid using low-level threads unless you absolutely must.
I'd check out Boost and Boost::Thread. Not only does it make your code more portable, but I haven't worked with an easier library for threading.
If you are using Builder 2009, you should look at TThread. It has some stuff to simplify thread coding.
I can't help thinking that you may be going a bit overboard here. A USB port can't really deliver data terribly quickly -- it's theoretical bandwidth is only 480 Mbits/second, and realistically, it's a pretty rare USB device that can really get very close to that.
Unless the analysis you've mentioned is quite a bit more complex than you've implied, my guess is that a single thread is probably entirely adequate. I'd think hard about using overlapped I/O to read the data, and MsgWaitForMultipleObjects for the main message loop.
It seems to me that the main place you stand a good chance of gaining a lot is in plotting the data after it's processed. It might be worth considering something like OpenGL or DirectX Graphics to do the drawing. Especially if you're producing quite a bit of output, this can give a really substantial speed improvement. In an ideal situation, multiple threads might multiply your speed by the number of available cores -- typically 2 or 4 on today's machines. Drawing the output is likely to be the slowest part of the job, and hardware acceleration can easily speed that up by a considerably larger factor -- 10x is at the low end of what you can typically expect, and 100x is fairly common.

How do I tell a multi-core / multi-CPU machine to process function calls in a loop in parallel?

I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances.
Many of the more intensive operations behave deterministically and would lend themselves to parallel processing.
Provided I have a loop that iterates over a large number of data chunks arriving from the db and for each one call a deterministic function without side effects, how would I make it so that the program does not wait for the function to return but rather sets the next calls going, so they could be processed in parallel? A naive approach to demonstrate the principle would do me for now.
I have read Google's MapReduce paper and while I could use the overall principle in a number of places, I won't, for now, target large clusters, rather it's going to be a single multi-core or multi-CPU machine for version 1.0. So currently, I'm not sure if I can actually use the library or would have to roll a dumbed-down basic version myself.
I am at an early stage of the design process and so far I am targeting C-something (for the speed critical bits) and Python (for the productivity critical bits) as my languages. If there are compelling reasons, I might switch, but so far I am contented with my choice.
Please note that I'm aware of the fact that it might take longer to retrieve the next chunk from the database than to process the current one and the whole process would then be I/O-bound. I would, however, assume for now that it isn't and in practice use a db cluster or memory caching or something else to be not I/O-bound at this point.
Well, if .net is an option, they have put a lot of effort into Parallel Computing.
If you still plan on using Python, you might want to have a look at Processing. It uses processes rather than threads for parallel computing (due to the Python GIL) and provides classes for distributing "work items" onto several processes. Using the pool class, you can write code like the following:
import processing
def worker(i):
return i*i
num_workers = 2
pool = processing.Pool(num_workers)
result = pool.imap(worker, range(100000))
This is a parallel version of itertools.imap, which distributes calls over to processes. You can also use the apply_async methods of the pool and store lazy result objects in a list:
results = []
for i in range(10000):
results.append(pool.apply_async(worker, i))
For further reference, see the documentation of the Pool class.
Gotchas:
processing uses fork(), so you have to be careful on Win32
objects transferred between processes need to be pickleable
if the workers are relatively fast, you can tweak chunksize, i.e.
the number of work items send to a worker process in one batch
processing.Pool uses a background thread
You can implement the algorithm from Google's MapReduce without having physically separate machines. Just consider each of those "machines" to be "threads." Threads are automatically distributed on multi-core machines.
I might be missing something here, but this this seems fairly straight forward using pthreads.
Set up a small threadpool with N threads in it and have one thread to control them all.
The master thread simply sits in a loop doing something like:
Get data chunk from DB
Find next free thread If no thread is free then wait
Hand over chunk to worker thread
Go back and get next chunk from DB
In the meantime the worker threads they sit and do:
Mark myself as free
Wait for the mast thread to give me a chunk of data
Process the chunk of data
Mark myself as free again
The method by which you implement this can be as simple as two mutex controlled arrays. One has the worked threads in it (the threadpool) and the other indicated if each corresponding thread is free or busy.
Tweak N to your liking ...
If you're working with a compiler that will support it, I would suggest taking a look at http://www.openmp.org for a way of annotating your code in such a way that
certain loops will be parallelized.
It does a lot more as well, and you might find it very helpful.
Their web page reports that gcc4.2 will support openmp, for example.
The same thread pool is used in java. But the threads in threadpools are serialisable and sent to other computers and deserialised to run.
I have developed a MapReduce library for multi-threaded/multi-core use on a single server. Everything is taken care of by the library, and the user just has to implement Map and Reduce. It is positioned as a Boost library, but not yet accepted as a formal lib. Check out http://www.craighenderson.co.uk/mapreduce
You may be interested in examining the code of libdispatch, which is the open source implementation of Apple's Grand Central Dispatch.
Intel's TBB or boost::mpi might be of interest to you also.