Large number of simultaneous long-running operations in Qt - c++

I have some long-running operations that number in the hundreds. At the moment they are each on their own thread. My main goal in using threads is not to speed these operations up. The more important thing in this case is that they appear to run simultaneously.
I'm aware of cooperative multitasking and fibers. However, I'm trying to avoid anything that would require touching the code in the operations, e.g. peppering them with things like yieldToScheduler(). I also don't want to prescribe that these routines be stylized to be coded to emit queues of bite-sized task items...I want to treat them as black boxes.
For the moment I can live with these downsides:
Maximum # of threads tend to be O(1000)
Cost per thread is O(1MB)
To address the bad cache performance due to context-switches, I did have the idea of a timer which would juggle the priorities such that only idealThreadCount() threads were ever at Normal priority, with all the rest set to Idle. This would let me widen the timeslices, which would mean fewer context switches and still be okay for my purposes.
Question #1: Is that a good idea at all? One certain downside is it won't work on Linux (docs say no QThread::setPriority() there).
Question #2: Any other ideas or approaches? Is QtConcurrent thinking about this scenario?
(Some related reading: how-many-threads-does-it-take-to-make-them-a-bad-choice, many-threads-or-as-few-threads-as-possible, maximum-number-of-threads-per-process-in-linux)

IMHO, this is a very bad idea. If I were you, I would try really, really hard to find another way to do this. You're combining two really bad ideas: creating a truck load of threads, and messing with thread priorities.
You mention that these operations only need to appear to run simultaneously. So why not try to find a way to make them appear to run simultaneously, without literally running them simultaneously?

It's been 6 months, so I'm going to close this.
Firstly I'll say that threads serve more than one purpose. One is speedup...and a lot of people are focusing on that in the era of multi-core machines. But another is concurrency, which can be desirable even if it slows the system down when taken as a whole. Yet concurrency can be achieved using mechanisms more lightweight than threads, although it may complicate the code.
So this is just one of those situations where the tradeoff of programmer convenience against user experience must be tuned to fit the target environment. It's how Google's approach to a process-per-tab with Chrome would have been ill-advised in the era of Mosaic (even if process isolation was preferable with all else being equal). If the OS, memory, and CPU couldn't give a good browsing experience...they wouldn't do it that way now.
Similarly, creating a lot of threads when there are independent operations you want to be concurrent saves you the trouble of sticking in your own scheduler and yield() operations. It may be the cleanest way to express the code, but if it chokes the target environment then something different needs to be done.
So I think I'll settle on the idea that in the future when our hardware is better than it is today, we'll probably not have to worry about how many threads we make. But for now I'll take it on a case-by-case basis. i.e. If I have 100 of concurrent task class A, and 10 of concurrent task class B, and 3 of concurrent task class C... then switching A to a fiber-based solution and giving it a pool of a few threads is probably worth the extra complication.

Related

Benefits of a multi thread program in a unicore system [duplicate]

This question already has answers here:
How can multithreading speed up an application (when threads can't run concurrently)?
(9 answers)
Closed 9 years ago.
My professor causally mentioned that we should program multi-thread programs even if we are using a unicore processor however because of the lack of time , he did not elaborate on it .
I would like to know what are the benefits of a multi-thread program in a unicore processor ??
It won't be as significant as a multi-core system but it can still provide some benefits.
Mainly all the benefits that you are going to get will be regarding to the context switch that will happen after a input miss to the already executing thread. Executing thread may be waiting for anything such as a hardware resource or a branch mis-prediction or even data transfer after a cache miss.
At this point the waiting thread can be executed to benefit from this "waiting time". But of course context switch will take some time. Also managing threads inside the code rather than sequential computation can create some extra complexity to your program. And as it has been said, some applications needs to be multi-threaded so there is no escape from the context switch in some cases.
Some applications need to be multi-threaded. Multi-threading isn't just about improving performance by using more cores, it's also about performing multiple tasks at once.
Take Skype for example - The GUI needs to be able to accept the text you're entering, display it on the screen, listen for new messages coming from the user you're talking to, and display them. This wouldn't be a trivial task in a single threaded application.
Even if there's only one core available, the OS thread scheduler will give you the illusion of parallelism.
Usually it is about not blocking. Running many threads on a single core still gives the illusion of concurrency. So you can have, say, a thread doing IO while another one does user interactions. The user interaction thread is not blocked while the other does IO, so the user is free to carry on interacting.
Benefits could be different.
One of the widely used examples is the application with GUI, which supposed to perform some kind of computations. If you will have a single thread - the user will have to wait the result before dealing something else with the application, but if you start it in the separate thread - user interface could be still available for user during the computation process. So, multi-thread program could emulate multi-task environment even on a unicore system. That's one of the points.
As others have already mentioned, not blocking is one application. Another one is separation of logic for unrelated tasks that are to be executed simultaneously. Using threads for that leaves handling of scheduling these tasks to the OS.
However, note that it may also be possible to implement similar behavior using asynchronous operations in a single thread. "Future" and boost::asio provide ways of doing non-blocking stuff without necessarily resorting to multiple threads.
I think it depends a bit on how exactly you design your threads and which logic is actually in the thread. Some benefits you can even get on a single core:
A thread can wrap a blocking/long-during call you can't circumvent otherwise. For some operations there are polling mechanisms, but not for all.
A thread can wrap an almost standalone part of your application that has virtually no interaction with other code. For example background polling for updates, monitoring some resource (e.g. free storage), checking internet connectivity. If you keep them in a separate thread you can keep the code relatively simple in its own 'runtime' without caring too much about the impact on the main program, the sole communication with the main logic is usually a single 'event'.
In some environments you might get more processing time. This mainly depends on how your OS scheduling system works, but if this allocates time per thread, the more threads you have the more your app will be scheduled.
Some benefits long-term:
Where it's not hard to do you benefit if your hardware evolves. You never know what's going to happen, today your app runs on a single-core embedded device, tomorrow that embedded device gets a quad core. Programming threaded from the beginning improves your future scalability.
One example is an environment where you can deterministically assign work to a thread, e.g. based on some hash all related operations end up in the same thread. The advantage for single cores is 'small' but it's not hard to do as you need little synchronization primitives so the overhead stays small.
That said, I think there are situations where it's very ill advise:
As soon as your required synchronization mechanism with other threads becomes complex (e.g. multiple locks, lots of critical sections, ...). It might still be then that multi-threading gives you a benefit when effectively moving to multiple CPUs, but the overhead is huge both for your single core and your programming time.
For instance think about operations that block because of slow peripheral devices (harddisk access etc.). While these are waiting, even the single core can do other things asyncronously.
In a lot of applications the bottleneck is not CPU processing power. So when the program flow is waiting for completion of IO requests (user input, network/disk IO), critical resources to be available, or any sort of asynchroneously triggered events, the CPU can be scheduled to do other work instead of just blocking.
In this case you don't necessarily need multiple threads that can actually run in parallel. Cooperative multi-tasking concepts like asynchroneous IO, coroutines, or fibers come into mind.
If however the application's bottleneck is CPU processing power (constantly 100% CPU usage), then it makes sense to increase the number of CPUs available to the application. At that point it is easier to scale the application up to use more CPUs if it was designed to run in parallel upfront.
As far as I can see, one answer was not yet given:
You will have to write multithreaded applications in the future!
The average number of cores will double every 18 months in the future. People have learned single-threaded programming for 50 years now, and now they are confronted with devices that have multiple cores. The programming style in a multi-threaded environment differs significantly from single-threaded programming. This refers to low-level aspects like avoiding race conditions and proper synchronization, as well as the high-level aspects like the general algorithm design.
So in addition to the points already mentioned, it's also about writing future-proof software, scalability and the development of the skills that are required to achieve these goals.

Threading vs Task-Based vs Asynchronous Programming

I'm new to this concept. Are these the same or different things? What is the difference? I really like the idea of being able to run two processes at once, for example if I have several large files to load into my program I'd love to load as many of them simultaneously as possible instead of waiting for one at a time. And when working with a large file, such as wav file, it would be great to break it into pieces and do processing on several chunks at once and then put them back together. What do I want to look into to learn how to do this sort of thing?
Edit: Also, I know using more than one core on a multicore processor fits in here somewhere, but apparently asynchronous programming doesn't necessarily mean you are using multiple cores? Why would you do this if you didn't have multiple cores to take advantage of?
They are related but different.
Threading, normally called multi-threading, refers to the use of multiple threads of execution within a single process. This usually refers to the simple case of using a small set of threads each doing different tasks that need to be, or could benefit from, running simultaneously. For example, a GUI application might have one thread draw elements, another thread respond to events like mouse clicks, and another thread do some background processing.
However, when the number of threads, each doing their own thing, is taken to an extreme, we usually start to talk about an Agent-based approach.
The task-based approach refers to a specific strategy in software engineering where, in abstract terms, you dynamically create "tasks" to be accomplished, and these tasks are picked up by a task manager that assigns the tasks to threads that can accomplish them. This is more of a software architectural thing. The advantage here is that the execution of the whole program is a succession of tasks being relayed (task A finished -> trigger task B, when both task B and task C are done -> trigger task D, etc..), instead of having to write a big function or program that executes each task one after the other. This gives flexibility when it is unclear which tasks will take more time than others, and when tasks are only loosely coupled. This is usually implemented with a thread-pool (threads that are waiting to be assigned a task) and some message-passing interface (MPI) to communicate data and task "contracts".
Asynchronous programming does not refer to multi-threaded programming, although the two are very often associated (and work well together). A synchronous program must complete each step before moving on to the next. An asynchronous program starts a step, moves on to other steps that don't require the result of the first step, then checks on the result of the first step when its result is required.
That is, a synchronous program might go a little bit like this: "do this task", "wait until done", "do something with the result", and "move on to something else". By contrast, an asynchronous program might go a little more like this: "I'm gonna start a task, and I'll need the result later, but I don't need it just now", "in the meantime, I'll do something else", "I can't do anything else until I have the result of the first step now, so I'll wait for it, if it isn't ready", and "move on to something else".
Notice that "asynchronous" refers to a very broad concept, that always involves some form of "start some work and tell me when it's done" instead of the traditional "do it now!". This does not require multi-threading, in which case it just becomes a software design choice (which often involves callback functions and things like that to provide "notification" of the asynchronous result). With multiple threads, it becomes more powerful, as you can do various things in parallel while the asynchronous task is working. Taken to the extreme, it can become a more full-blown architecture like a task-based approach (which is one kind of asynchronous programming technique).
I think the thing that you want corresponds more to yet another concept: Parallel Computing (or parallel processing). This approach is more about splitting a large processing task into smaller parts and processing all parts in parallel, and then combining the results. You should look into libraries like OpenMP or OpenCL/CUDA (for GPGPU). That said, you can use multi-threading for parallel processing.
but apparently asynchronous programming doesn't necessarily mean you are using multiple cores?
Asynchronous programming does not necessarily involve anything happening concurrently in multiple threads. It could mean that the OS is doing things on your behalf behind the scenes (and will notify you when that work is finished), like in asynchronous I/O, which happens without you creating any threads. It boils down to being a software design choice.
Why would you do this if you didn't have multiple cores to take advantage of?
If you don't have multiple cores, multi-threading can still improve performance by reusing "waiting time" (e.g., don't "block" the processing waiting on file or network I/O, or waiting on the user to click a mouse button). That means the program can do useful work while waiting on those things. Beyond that, it can provide flexibility in the design and make things seem to run simultaneously, which often makes users happier. Still, you are correct that before multi-core CPUs, there wasn't as much of an incentive to do multi-threading, as the gains often do not justify the overhead.
I think in general, all these are design related rather than language related. Same apply to multicore programming.
To reflect Jim, it's not only the file load scenario. Generally, you need to design the whole software to run concurrently in order to feel the real benefit of multi-threading, task based or asynchronous programming.
Try see things from a grand picture point of view. Understand the over all modelling of a specific example and see how these methodologies are implemented. It'll easy to see the difference and help understand when and where to use which.

Thread per connection vs Reactor pattern (with a thread pool)?

I want to write a simple multiplayer game as part of my C++ learning project.
So I thought, since I am at it, I would like to do it properly, as opposed to just getting-it-done.
If I understood correctly: Apache uses a Thread-per-connection architecture, while nginx uses an event-loop and then dedicates a worker [x] for the incoming connection. I guess nginx is wiser, since it supports a higher concurrency level. Right?
I have also come across this clever analogy, but I am not sure if it could be applied to my situation. The analogy also seems to be very idealist. I have rarely seen my computer run at 100% CPU (even with a umptillion Chrome tabs open, Photoshop and what-not running simultaneously)
Also, I have come across a SO post (somehow it vanished from my history) where a user asked how many threads they should use, and one of the answers was that it's perfectly acceptable to have around 700, even up to 10,000 threads. This question was related to JVM, though.
So, let's estimate a fictional user-base of around 5,000 users. Which approach should would be the "most concurrent" one?
A reactor pattern running everything in a single thread.
A reactor pattern with a thread-pool (approximately, how big do you suggest the thread pool should be?
Creating a thread per connection and then destroying the thread the connection closes.
I admit option 2 sounds like the best solution to me, but I am very green in all of this, so I might be a bit naive and missing some obvious flaw. Also, it sounds like it could be fairly difficult to implement.
PS: I am considering using POCO C++ Libraries. Suggesting any alternative libraries (like boost) is fine with me. However, many say POCO's library is very clean and easy to understand. So, I would preferably use that one, so I can learn about the hows of what I'm using.
Reactive Applications certainly scale better, when they are written correctly. This means
Never blocking in a reactive thread:
Any blocking will seriously degrade the performance of you server, you typically use a small number of reactive threads, so blocking can also quickly cause deadlock.
No mutexs since these can block, so no shared mutable state. If you require shared state you will have to wrap it with an actor or similar so only one thread has access to the state.
All work in the reactive threads should be cpu bound
All IO has to be asynchronous or be performed in a different thread pool and the results feed back into the reactor.
This means using either futures or callbacks to process replies, this style of code can quickly become unmaintainable if you are not used to it and disciplined.
All work in the reactive threads should be small
To maintain responsiveness of the server all tasks in the reactor must be small (bounded by time)
On an 8 core machine you cannot cannot allow 8 long tasks arrive at the same time because no other work will start until they are complete
If a tasks could take a long time it must be broken up (cooperative multitasking)
Tasks in reactive applications are scheduled by the application not the operating system, that is why they can be faster and use less memory. When you write a Reactive application you are saying that you know the problem domain so well that you can organise and schedule this type of work better than the operating system can schedule threads doing the same work in a blocking fashion.
I am a big fan of reactive architectures but they come with costs. I am not sure I would write my first c++ application as reactive, I normally try to learn one thing at a time.
If you decide to use a reactive architecture use a good framework that will help you design and structure your code or you will end up with spaghetti. Things to look for are:
What is the unit of work?
How easy is it to add new work? can it only come in from an external event (eg network request)
How easy is it to break work up into smaller chunks?
How easy is it to process the results of this work?
How easy is it to move blocking code to another thread pool and still process the results?
I cannot recommend a C++ library for this, I now do my server development in Scala and Akka which provide all of this with an excellent composable futures library to keep the code clean.
Best of luck learning C++ and with which ever choice you make.
Option 2 will most efficiently occupy your hardware. Here is the classic article, ten years old but still good.
http://www.kegel.com/c10k.html
The best library combination these days for structuring an application with concurrency and asynchronous waiting is Boost Thread plus Boost ASIO. You could also try a C++11 std thread library, and std mutex (but Boost ASIO is better than mutexes in a lot of cases, just always callback to the same thread and you don't need protected regions). Stay away from std future, cause it's broken:
http://bartoszmilewski.com/2009/03/03/broken-promises-c0x-futures/
The optimal number of threads in the thread pool is one thread per CPU core. 8 cores -> 8 threads. Plus maybe a few extra, if you think it's possible that your threadpool threads might call blocking operations sometimes.
FWIW, Poco supports option 2 (ParallelReactor) since version 1.5.1
I think that option 2 is the best one. As for tuning of the pool size, I think the pool should be adaptive. It should be able to spawn more threads (with some high hard limit) and remove excessive threads in times of low activity.
as the analogy you linked to (and it's comments) suggest. this is somewhat application dependent. now what you are building here is a game server. let's analyze that.
game servers (generally) do a lot of I/O and relatively few calculations, so they are far from 100% CPU applications.
on the other hand they also usually change values in some database (a "game world" model). all players create reads and writes to this database. which is exactly the intersection problem in the analogy.
so while you may gain some from handling the I/O in separate threads, you will also lose from having separate threads accessing the same database and waiting for its locks.
so either option 1 or 2 are acceptable in your situation. for scalability reasons I would not recommend option 3.

TBB vs. Homegrown Workqueue

I know TBB (Thread Building Blocks) claim to have a sophisticated engine, but from the algorithmic point of view:
If we had (say on Linux) a workqueue that has N working-threads (POSIX threads, N is the number of cores) and a mutex-synchronized queue of tasks, each working thread then taking a task from the queue when idle, also some synchronization calls, what else could TBB offer, not counting nice C++ syntax? I don't see a better algorithm than greedy assignment of tasks to cores.
As somebody who has developed their own work-stealing scheduler, I can say the following:
Don’t write your own scheduler (and a work-queue counts here).
You’ll either do it inefficiently, or you’ll do it wrong.
In fact, it’s not that hard to write a correct scheduler. Unfortunately, it is hard if you want to do it efficiently. An efficient scheduler effectively precludes the use of locks (except perhaps in very specific, well-specified situations) and lock-free cross-thread communication is a world of pain.
As an anecdote, I actually implemented one scheduler where I essentially had to copy the existing algorithm into code and I still managed to introduce almost any race condition imaginable into the code. Debugging this code was a mixture of
writing huge, convoluted test cases (just to pick up the occasional failure which only occurred in < 1% of the runs),
spending hours on end just staring at the code, trying to figure out the error by applying logic
tracing each single line in the debugger (which would crash without stack trace once an error occurred), keeping track of the state of all variables in all threads manually just to be sure that the actual state of the program matched the expected state
reducing the code several times essentially down to zero and rebuilding, commenting out single lines or pairs of lines to see the effect (huge combinatorial space), and
running against walls, head first.
Not knowing the precise implementation of TBB, I cannot say what exactly it offers, but since you said "what could it offer"...
Among others,
It could offer lockfree queueing and unqueueing instead of one syscall and context switch per task. This is harder to implement than it sounds.
It could, in addition, offer blocking of worker threads if the queue is empty. This again, is harder to implement than it sounds.
It could offer work stealing.
It could offer LIFO task-to-thread assignment in the same way Windows completion ports work (improving cache efficiency).
It could be bug-free. This, again, is something harder to implement than you think.

Multithreading vs multiprocessing

I am new to this kind of programming and need your point of view.
I have to build an application but I can't get it to compute fast enough. I have already tried Intel TBB, and it is easy to use, but I have never used other libraries.
In multiprocessor programming, I am reading about OpenMP and Boost for the multithreading, but I don't know their pros and cons.
In C++, when is multi threaded programming advantageous compared to multiprocessor programming and vice versa?Which is best suited to heavy computations or launching many tasks...? What are their pros and cons when we build an application designed with them? And finally, which library is best to work with?
Multithreading means exactly that, running multiple threads. This can be done on a uni-processor system, or on a multi-processor system.
On a single-processor system, when running multiple threads, the actual observation of the computer doing multiple things at the same time (i.e., multi-tasking) is an illusion, because what's really happening under the hood is that there is a software scheduler performing time-slicing on the single CPU. So only a single task is happening at any given time, but the scheduler is switching between tasks fast enough so that you never notice that there are multiple processes, threads, etc., contending for the same CPU resource.
On a multi-processor system, the need for time-slicing is reduced. The time-slicing effect is still there, because a modern OS could have hundred's of threads contending for two or more processors, and there is typically never a 1-to-1 relationship in the number of threads to the number of processing cores available. So at some point, a thread will have to stop and another thread starts on a CPU that the two threads are sharing. This is again handled by the OS's scheduler. That being said, with a multiprocessors system, you can have two things happening at the same time, unlike with the uni-processor system.
In the end, the two paradigms are really somewhat orthogonal in the sense that you will need multithreading whenever you want to have two or more tasks running asynchronously, but because of time-slicing, you do not necessarily need a multi-processor system to accomplish that. If you are trying to run multiple threads, and are doing a task that is highly parallel (i.e., trying to solve an integral), then yes, the more cores you can throw at a problem, the better. You won't necessarily need a 1-to-1 relationship between threads and processing cores, but at the same time, you don't want to spin off so many threads that you end up with tons of idle threads because they must wait to be scheduled on one of the available CPU cores. On the other hand, if your parallel tasks requires some sequential component, i.e., a thread will be waiting for the result from another thread before it can continue, then you may be able to run more threads with some type of barrier or synchronization method so that the threads that need to be idle are not spinning away using CPU time, and only the threads that need to run are contending for CPU resources.
There are a few important points that I believe should be added to the excellent answer by #Jason.
First, multithreading is not always an illusion even on a single processor - there are operations that do not involve the processor. These are mainly I/O - disk, network, terminal etc. The basic form for such operation is blocking or synchronous, i.e. your program waits until the operation is completed and then proceeds. While waiting, the CPU is switched to another process/thread.
if you have anything you can do during that time (e.g. background computation while waiting for user input, serving another request etc.) you have basically two options:
use asynchronous I/O: you call a non-blocking I/O providing it with a callback function, telling it "call this function when you are done". The call returns immediately and the I/O operation continues in the background. You go on with the other stuff.
use multithreading: you have a dedicated thread for each kind of task. While one waits for the blocking I/O call, the other goes on.
Both approaches are difficult programming paradigms, each has its pros and cons.
with async I/O the logic of the program's logic is less obvious and is difficult to follow and debug. However you avoid thread-safety issues.
with threads, the challange is to write thread-safe programs. Thread safety faults are nasty bugs that are quite difficult to reproduce. Over-use of locking can actually lead to degrading instead of improving the performance.
(coming to the multi-processing)
Multithreading made popular on Windows because manipulating processes is quite heavy on Windows (creating a process, context-switching etc.) as opposed to threads which are much more lightweight (at least this was the case when I worked on Win2K).
On Linux/Unix, processes are much more lightweight. Also (AFAIK) threads on Linux are implemented actually as a kind of processes internally, so there is no gain in context-switching of threads vs. processes. However, you need to use some form of IPC (inter-process communications), as shared memory, pipes, message queue etc.
On a more lite note, look at the SQLite FAQ, which declares "Threads are evil"! :)
To answer the first question:
The best approach is to just use multithreading techniques in your code until you get to the point where even that doesn't give you enough benefit. Assume the OS will handle delegation to multiple processors if they're available.
If you actually are working on a problem where multithreading isn't enough, even with multiple processors (or if you're running on an OS that isn't using its multiple processors), then you can worry about discovering how to get more power. Which might mean spawning processes across a network to other machines.
I haven't used TBB, but I have used IPP and found it to be efficient and well-designed. Boost is portable.
Just wanted to mention that the Flow-Based Programming ( http://www.jpaulmorrison.com/fbp ) paradigm is a naturally multiprogramming/multiprocessing approach to application development. It provides a consistent application view from high level to low level. The Java and C# implementations take advantage of all the processors on your machine, but the older C++ implementation only uses one processor. However, it could fairly easily be extended to use BOOST (or pthreads, I assume) by the use of locking on connections. I had started converting it to use fibers, but I'm not sure if there's any point in continuing on this route. :-) Feedback would be appreciated. BTW The Java and C# implementations can even intercommunicate using sockets.