C++ TBB concurrent_bounded_queue - pop timeout

C++ TBB concurrent_bounded_queue - pop timeout - c++

I noticed that TBB concurrent_bounded_queue blocking pop has no timeout. We are moving to TBB from another implementation where we had timed wait and hence looking for the same functionalities here.
In any case, it's often useful to have timed-wait, any suggestion will be appreciated.
Thanks

According to Arch Robinson, the architect of TBB, timeouts were never a priority:
The initial design of TBB targeted a paradigm of task-based programming for parallel speedup. It is what I think of as "classical" parallel algorithms like parallel_for, parallel_reduce, etc.The containers and mutexes were designed with that in mind; i.e., to avoid race conditions.Blocking was expected to be short, otherwise the program would not scale. Therefore timeouts were not a priority.
There's an old article where one of the TBB engineers discusses timed mutexes. You probably won't be able to use the workaround sketched there directly, but it can help you implement a blocking dequeue yourself, basing it off on concurrent_queue's not blocking try_pop.
I wouldn't expect the performance to be anywhere near that of just using the TBB queues, and it won't be that trivial (probably 100+ LOC), but if you really want it, it can be done.
P.S. Java BlockingQueue-style blocking poll/offer with a timeout is something that you probably don't want to be using on your fast paths. A couple of times I have started an implementation with such a method, only to find after some period of time (no pun intended), that either producer/consumer pressure is not enough and just blocking is at least as efficient, or I need to investigate what happens and rethink some part of my code.

Related

Does Boost have support for Windows EnterCriticalSection API?

I know Boost has support for mutexes and lock_guard, which can be used to implement critical sections.
But Windows has a special API for critical sections (see EnterCriticalSection and LeaveCriticalSection) which is a LOT faster than a mutex (for rarely contended, short sections of code).
Hence my question - it is possible in Boost to take advantage of this API, and fallback to spinlock/mutex/futex-based implementation on other platforms?

The simple answer is no.
Here's some relevant background from an old mailing list thread:
BTW. I am agree that mutex is more universal solution from a
performance point of view. But to be fair - CS are faster in simple
design. I believe that possibility to support them should be at
least
taken in account.
This was the article that someone pointed me to. The conclusion was
that CS are only faster if:
There are less than 8 threads total in the process.
You weren't running in the background.
You weren't on an dual processor machine.
To me this means that simple testing yields good CS performance
results, but any real world program is better off with a full blown
mutex.
I'm not adverse to supporting a CS implementation. However, I
originally chose not to for the following reasons:
You get either construction and destruction hits from using a PIMPL
idiom or you must include Windows.h in the Boost.Threads headers,
which I simply don't want to do. (This can be worked around by
emulating a CS ala OPTEX from the MSDN.)
According to this research paper most programs won't benefit from
a CS design.
It's trivial to code a (non-portable) critical_section class that
follows the Mutex model if you truly can make use of this.
For now I think I've made the right choice, though down the road we
may change the implementation to use a critical section or OPTEX.
Bill Kempf

Speaking as someone who helps out maintaining Boost.Thread, and as someone who failed to get an event object into Boost.Thread, I don't think critical sections have ever been added nor would be added to Boost for these reasons:
A Win32 critical section is trivially easy to build using a boost::atomic and a boost::condition_variable, so much so it isn't really worth having an official one. Here is probably the most complex one you could imagine, but extremely configurable including being constexpr ready (don't ask!): https://github.com/ned14/boost.outcome/blob/master/include/boost/outcome/v1/spinlock.hpp#L331
You can build your own simply by matching (Basic)Lockable concept and using atomic compare_exchange (non-x86/x64) or atomic exchange (x86/x64) and then grab it using a lock_guard around the critical section.
Some may object that a win32 critical section is not this. I am afraid it is: it simply spins on an atomic for a spin count, and then lazily tries to allocate a win32 event object which it then waits upon. Nothing special.
As much as you might think critical sections (really user mode mutexes) are better/faster/whatever, they probably are not as great as you might think. boost::mutex is a big vast heavyweight thing on Windows internally using a win32 semaphore as the kernel wait object because of the need to emulate thread cancellation and to behave well in a general purpose use context. It's easy to write a concurrency structure which is faster than another for some single use case, but it is very very hard to write a concurrency structure which is all of:
Faster than a standard implementation in the uncontended case.
Faster than a standard implementation in the lightly contended case.
Faster than a standard implementation in the heavily contended case.
Even if you manage all three of the above, that still isn't enough: you also need some guarantees on worst case progression ordering, so whether certain patterns of locks, waits and unlocks produce predictable outcomes. This is why threading facilities can appear to look slow in narrow use case scenarios, so Boost.Thread much as the STL can appear to be much slower than hand rolled locking code in say an uncontended use case.
Boost.Thread already does substantial work in user mode to avoid going to kernel sleep on Windows. On POSIX any of the major pthreads implementations also does substantial work to avoid kernel sleeps and hence Boost.Thread doesn't replicate that work. In other words, critical sections don't gain you anything in terms of scaling to load behaviours, though inevitably Boost.Thread v4 especially on Windows does a ton load of work a naive implementation does not (the planned rewrite of Boost.Thread is vastly more efficient on Windows as it can assume Windows Vista or above).

So, it looks like the default Boost mutex doesn't support it, but asio::detail::mutex does.
So I ended up using that:
#include <boost/asio/detail/mutex.hpp>
#include <boost/thread.hpp>
using boost::asio::detail::mutex;
using boost::lock_guard;
int myFunc()
{
static mutex mtx;
lock_guard<mutex> lock(mtx);
. . .
}

Thread per connection vs Reactor pattern (with a thread pool)?

I want to write a simple multiplayer game as part of my C++ learning project.
So I thought, since I am at it, I would like to do it properly, as opposed to just getting-it-done.
If I understood correctly: Apache uses a Thread-per-connection architecture, while nginx uses an event-loop and then dedicates a worker [x] for the incoming connection. I guess nginx is wiser, since it supports a higher concurrency level. Right?
I have also come across this clever analogy, but I am not sure if it could be applied to my situation. The analogy also seems to be very idealist. I have rarely seen my computer run at 100% CPU (even with a umptillion Chrome tabs open, Photoshop and what-not running simultaneously)
Also, I have come across a SO post (somehow it vanished from my history) where a user asked how many threads they should use, and one of the answers was that it's perfectly acceptable to have around 700, even up to 10,000 threads. This question was related to JVM, though.
So, let's estimate a fictional user-base of around 5,000 users. Which approach should would be the "most concurrent" one?
A reactor pattern running everything in a single thread.
A reactor pattern with a thread-pool (approximately, how big do you suggest the thread pool should be?
Creating a thread per connection and then destroying the thread the connection closes.
I admit option 2 sounds like the best solution to me, but I am very green in all of this, so I might be a bit naive and missing some obvious flaw. Also, it sounds like it could be fairly difficult to implement.
PS: I am considering using POCO C++ Libraries. Suggesting any alternative libraries (like boost) is fine with me. However, many say POCO's library is very clean and easy to understand. So, I would preferably use that one, so I can learn about the hows of what I'm using.

Reactive Applications certainly scale better, when they are written correctly. This means
Never blocking in a reactive thread:
Any blocking will seriously degrade the performance of you server, you typically use a small number of reactive threads, so blocking can also quickly cause deadlock.
No mutexs since these can block, so no shared mutable state. If you require shared state you will have to wrap it with an actor or similar so only one thread has access to the state.
All work in the reactive threads should be cpu bound
All IO has to be asynchronous or be performed in a different thread pool and the results feed back into the reactor.
This means using either futures or callbacks to process replies, this style of code can quickly become unmaintainable if you are not used to it and disciplined.
All work in the reactive threads should be small
To maintain responsiveness of the server all tasks in the reactor must be small (bounded by time)
On an 8 core machine you cannot cannot allow 8 long tasks arrive at the same time because no other work will start until they are complete
If a tasks could take a long time it must be broken up (cooperative multitasking)
Tasks in reactive applications are scheduled by the application not the operating system, that is why they can be faster and use less memory. When you write a Reactive application you are saying that you know the problem domain so well that you can organise and schedule this type of work better than the operating system can schedule threads doing the same work in a blocking fashion.
I am a big fan of reactive architectures but they come with costs. I am not sure I would write my first c++ application as reactive, I normally try to learn one thing at a time.
If you decide to use a reactive architecture use a good framework that will help you design and structure your code or you will end up with spaghetti. Things to look for are:
What is the unit of work?
How easy is it to add new work? can it only come in from an external event (eg network request)
How easy is it to break work up into smaller chunks?
How easy is it to process the results of this work?
How easy is it to move blocking code to another thread pool and still process the results?
I cannot recommend a C++ library for this, I now do my server development in Scala and Akka which provide all of this with an excellent composable futures library to keep the code clean.
Best of luck learning C++ and with which ever choice you make.

Option 2 will most efficiently occupy your hardware. Here is the classic article, ten years old but still good.
http://www.kegel.com/c10k.html
The best library combination these days for structuring an application with concurrency and asynchronous waiting is Boost Thread plus Boost ASIO. You could also try a C++11 std thread library, and std mutex (but Boost ASIO is better than mutexes in a lot of cases, just always callback to the same thread and you don't need protected regions). Stay away from std future, cause it's broken:
http://bartoszmilewski.com/2009/03/03/broken-promises-c0x-futures/
The optimal number of threads in the thread pool is one thread per CPU core. 8 cores -> 8 threads. Plus maybe a few extra, if you think it's possible that your threadpool threads might call blocking operations sometimes.

FWIW, Poco supports option 2 (ParallelReactor) since version 1.5.1

I think that option 2 is the best one. As for tuning of the pool size, I think the pool should be adaptive. It should be able to spawn more threads (with some high hard limit) and remove excessive threads in times of low activity.

as the analogy you linked to (and it's comments) suggest. this is somewhat application dependent. now what you are building here is a game server. let's analyze that.
game servers (generally) do a lot of I/O and relatively few calculations, so they are far from 100% CPU applications.
on the other hand they also usually change values in some database (a "game world" model). all players create reads and writes to this database. which is exactly the intersection problem in the analogy.
so while you may gain some from handling the I/O in separate threads, you will also lose from having separate threads accessing the same database and waiting for its locks.
so either option 1 or 2 are acceptable in your situation. for scalability reasons I would not recommend option 3.

What are the limitations of MS Concurrency Runtime?

I am using a simple Concurrency Runtime task_group in Visual Studio 2010 to run a single working thread to separate the work from the GUI thread.
However one of my colleagues told me that I'm using CR wrong: it was designed for parallelizing lightweight tasks with small context and not for separating bulky and I/O-dependent threads from the GUI. He said that he'd taken this from the documentation, but failed to provide any specific links.
So, what are the limitations of Microsoft Concurrency Runtime and to solve what problems I should NOT use it?
Of course CR is not portable, but let's leave it out: I'm talking about situations, when you code compiles, but you get problems nevertheless.

The concurrency runtime is a cooperative scheduling infrastructure. If you're not going to take advantage of cooperative scheduling, then you're better off creating threads when you need to, and letting the OS take care of scheduling.
If you are into cooperative scheduling, then there's really no point to wait for an IO operation to complete, because you're blocking a thread which could have otherwise been used for running other tasks, which do not depend on this IO operation to complete. If other tasks depend on the IO task to complete, you can simply make them continuations, and the ConcRT scheduler will make sure to run them when their time comes.
So it's really not about limitations here. It's simply about knowing what you're trying to achieve, and picking the right tool for the job.

As Yam mentioned, concurrency runtime does not provide the parallel execution guarantee, it just makes a potential possibility, and that is the difference between notions of tasks and threads. If you get your tasks right (not too granular to spend much time on switching between tasks, and not too coarse to always have some work for all the cores - in your case - just one), then the overhead will not be significant, and your program will be ready for running on a multi-core or a multi-processor platform, "future proof" as MSFT people like to say.

TBB vs. Homegrown Workqueue

I know TBB (Thread Building Blocks) claim to have a sophisticated engine, but from the algorithmic point of view:
If we had (say on Linux) a workqueue that has N working-threads (POSIX threads, N is the number of cores) and a mutex-synchronized queue of tasks, each working thread then taking a task from the queue when idle, also some synchronization calls, what else could TBB offer, not counting nice C++ syntax? I don't see a better algorithm than greedy assignment of tasks to cores.

As somebody who has developed their own work-stealing scheduler, I can say the following:
Don’t write your own scheduler (and a work-queue counts here).
You’ll either do it inefficiently, or you’ll do it wrong.
In fact, it’s not that hard to write a correct scheduler. Unfortunately, it is hard if you want to do it efficiently. An efficient scheduler effectively precludes the use of locks (except perhaps in very specific, well-specified situations) and lock-free cross-thread communication is a world of pain.
As an anecdote, I actually implemented one scheduler where I essentially had to copy the existing algorithm into code and I still managed to introduce almost any race condition imaginable into the code. Debugging this code was a mixture of
writing huge, convoluted test cases (just to pick up the occasional failure which only occurred in < 1% of the runs),
spending hours on end just staring at the code, trying to figure out the error by applying logic
tracing each single line in the debugger (which would crash without stack trace once an error occurred), keeping track of the state of all variables in all threads manually just to be sure that the actual state of the program matched the expected state
reducing the code several times essentially down to zero and rebuilding, commenting out single lines or pairs of lines to see the effect (huge combinatorial space), and
running against walls, head first.

Not knowing the precise implementation of TBB, I cannot say what exactly it offers, but since you said "what could it offer"...
Among others,
It could offer lockfree queueing and unqueueing instead of one syscall and context switch per task. This is harder to implement than it sounds.
It could, in addition, offer blocking of worker threads if the queue is empty. This again, is harder to implement than it sounds.
It could offer work stealing.
It could offer LIFO task-to-thread assignment in the same way Windows completion ports work (improving cache efficiency).
It could be bug-free. This, again, is something harder to implement than you think.

What should I know about multithreading and when to use it, mainly in c++

I have never come across multithreading but I hear about it everywhere. What should I know about it and when should I use it? I code mainly in c++.

Mostly, you will need to learn about MT libraries on OS on which your application needs to run. Until and unless C++0x becomes a reality (which is a long way as it looks now), there is no support from the language proper or the standard library for threads. I suggest you take a look at the POSIX standard pthreads library for *nix and Windows threads to get started.

This is my opinion, but the biggest issue with multithreading is that it is difficult. I don't mean that from an experienced programmer point of view, I mean it conceptually. There really are a lot of difficult concurrency problems that appear once you dive into parallel programming. This is well known, and there are many approaches taken to make concurrency easier for the application developer. Functional languages have become a lot more popular because of their lack of side effects and idempotency. Some vendors choose to hide the concurrency behind API's (like Apple's Core Animation).
Multitheaded programs can see some huge gains in performance (both in user perception and actual amount of work done), but you do have to spend time to understand the interactions that your code and data structures make.

MSDN Multithreading for Rookies article is probably worth reading. Being from Microsoft, it's written in terms of what Microsoft OSes support(ed in 1993), but most of the basic ideas apply equally to other systems, with suitable renaming of functions and such.

That is a huge subject.
A few points...
With multi-core, the importance of multi-threading is now huge. If you aren't multithreading, you aren't getting the full performance capability of the machine.
Multi-threading is hard. Communicating and synchronization between threads is tricky to get right. Problems are often intermittent, hard to diagnose, and if the design isn't right for multi-threading, hard to fix.
Multi-threading is currently mostly non-portable and platform specific.
There are portable libraries with wrappers around threading APIs. Boost is one. wxWidgets (mainly a GUI library) is another. It can be done reasonably portably, but you won't have all the options you get from platform-specific APIs.

I've got an introduction to multithreading that you might find useful.
In this article there isn't a single
line of code and it's not aimed at
teaching the intricacies of
multithreaded programming in any given
programming language but to give a
short introduction, focusing primarily
on how and especially why and when
multithreaded programming would be
useful.

Here's a link to a good tutorial on POSIX threads programming (with diagrams) to get you started. While this tutorial is pthread specific, many of the concepts transfer to other systems.
To understand more about when to use threads, it helps to have a basic understanding of parallel programming. Here's a link to a tutorial on the very basics of parallel computing intended for those who are just becoming acquainted with the subject.

The other replies covered the how part, I'll briefly mention when to use multithreading.
The main alternative to multithreading is using a timer. Consider for example that you need to update a little label on your form with the existence of a file. If the file exists, you need to draw a special icon or something. Now if you use a timer with a low timeout, you can achieve basically the same thing, a function that polls if the file exists very frequently and updates your ui. No extra hassle.
But your function is doing a lot of unnecessary work, isn't it. The OS provides a "hey this file has been created" primitive that puts your thread to sleep until your file is ready. Obviously you can't use this from the ui thread or your entire application would freeze, so instead you spawn a new thread and set it to wait on the file creation event.
Now your application is using as little cpu as possible because of the fact that threads can wait on events (be it with mutexes or events). Say your file is ready however. You can't update your ui from different threads because all hell would break loose if 2 threads try to change the same bit of memory at the same time. In fact this is so bad that windows flat out rejects your attempts to do it at all.
So now you need either a synchronization mechanism of sorts to communicate with the ui one after the other (serially) so you don't step on eachother's toes, but you can't code the main thread part because the ui loop is hidden deep inside windows.
The other alternative is to use another way to communicate between threads. In this case, you might use PostMessage to post a message to the main ui loop that the file has been found and to do its job.
Now if your work can't be waited upon and can't be split nicely into little bits (for use in a short-timeout timer), all you have left is another thread and all the synchronization issues that arise from it.
It might be worth it. Or it might bite you in the ass after days and days, potentially weeks, of debugging the odd race condition you missed. It might pay off to spend a long time first to try to split it up into little bits for use with a timer. Even if you can't, the few cases where you can will outweigh the time cost.

You should know that it's hard. Some people think it's impossibly hard, that there's no practical way to verify that a program is thread safe. Dr. Hipp, author of sqlite, states that thread are evil. This article covers the problems with threads in detail.
The Chrome browser uses processes instead of threads, and tools like Stackless Python avoid hardware-supported threads in favor of interpreter-supported "micro-threads". Even things like web servers, where you'd think threading would be a perfect fit, and moving towards event driven architectures.
I myself wouldn't say it's impossible: many people have tried and succeeded. But there's no doubt writting production quality multi-threaded code is really hard. Successful multi-threaded applications tend to use only a few, predetermined threads with just a few carefully analyzed points of communication. For example a game with just two threads, physics and rendering, or a GUI app with a UI thread and background thread, and nothing else. A program that's spawning and joining threads throughout the code base will certainly have many impossible-to-find intermittent bugs.
It's particularly hard in C++, for two reasons:
the current version of the standard doesn't mention threads at all. All threading libraries and platform and implementation specific.
The scope of what's considered an atomic operation is rather narrow compared to a language like Java.
cross-platform libraries like boost Threads mitigate this somewhat. The future C++0x will introduce some threading support. But boost also has good interprocess communication support you could use to avoid threads altogether.
If you know nothing else about threading than that it's hard and should be treated with respect, than you know more than 99% of programmers.
If after all that, you're still interested in starting down the long hard road towards being able to write a multi-threaded C++ program that won't segfault at random, then I recommend starting with Boost threads. They're well documented, high level, and work cross platform. The concepts (mutexes, locks, futures) are the same few key concepts present in all threading libraries.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js