C++ Concurrency, Coroutines & Job Scheduling? - c++

I'm trying to get my head around multithreading in C++, to come up with a general purpose implementation that suits me. Everyone has a different implementation, Awesome CPP lists 39 libraries. It seems to me though that this is a logistical problem that is of the same ilk as any logistical scheduling problem in any field.
In my head, there are two obvious ways to repeatedly perform the job abc:
Split abc into 3 separate tasks: a, b & c. Spawn x threads. Have a queue. Jobs coming in get added to the queue. Each thread grabs the next task from the queue, and at the end of the task puts it back into the queue for the next task. They can either access the queue directly, or they can all communicate with a central 'manager' or 'scheduler' thread that serves them with their tasks.
Perform abc sequentially on x separate threads independently (parallelism.)
(1) has the problem that there is potentially a lot of overhead in keeping a queue and dealing with race conditions on it. (1) is otherwise intuitive and makes sense to me. It's what I would do in real life with a real life problem. It's literally how companies work in the real world.
(2) has the problem that any blocking causes the whole thread to block, idling the CPU thread. And (2) is far less flexible and applicable in less use-cases. On the plus side is has no overhead between tasks.
Question 1: Doesn't (1) also have the same blocking problem? If a thread reads from a file, it'll have to wait for the disk. How is that usually addressed, is there some way to yield back temporarily while its doing something like reading or writing from disk, or is this usually addressed simply by having more threads running than there are CPU threads and hoping not too many block at once?
It seems to me that (1) is clearly the better solution, except that it restricts the tasks only to medium to large scaled tasks. It would be pointless to use it to do something like parallelizing straightforward math (just an example) because handling the queue would take longer than the actual processing of the task. Hence the value of (1) for any given task is inversely proportionate to the difference between the overhead of the storage mechanism (the queue) and the size of the task. This sounds fine on the surface, until you realize that the efficiency of splitting into tasks is itself proportionate to the size of the task. To put it simply: you want each task to be small for overall efficiency in theory, but in practice you want each task to be larger so as to minimize the overhead of the queue.
Its obvious that some storage mechanism is required because you can't keep track of something without a recording mechanism, it doesn't have to be strictly a queue, but any form of recording the task in memory while it waits to be picked up. The optimization of the queue (I'm using the word loosely, not strictly a queue type) is then the #1 important factor here. The cheaper a task can receive its payload, the better.
Which leads me to Question 2: is this what C++20 coroutines are useful for? I've spent hours reading tutorials on coroutines, but it's still unclear what useful they're for. I think I get what they do. If I have it right they allow a special type of function (coroutine) to pause itself in the middle, yield its processing back to the caller along with a payload, and the caller can later resume it. But why would I want to do that? And can't I do that just by splitting the function into two?
Question 3: Are coroutines meant to be used by a task scheduler thread to somehow optimize the queuing? Or is the point just to allow you to write code linearly and then put those yields in it to break it up? In which case it wouldn't be useful for me if I already had my jobs split up into separate tasks by design?
Question 4: Am I trying to reinvent the wheel here? Has this problem already been solved? And if so, why are there so many different implementations?

Q1: No, it more likely has a different blocking problem.
Q2: Co-routines have many applications; try substituting for X in "is this what X is for?" X = { while, if, return, pointer, ... }. Don't look to standards bodies (particularly that one) for insight; they are best at punctuation and spell checking.
Q3: Co-routines can be used to optimise various constructions, but the real goal of using such a formalism is to make your program as natural an expression of the problem as possible. One of the better examples of how Co-routines can be intelligently used are the Go-routines of Go.
Q4: Probably; almost definitely; because many of the solutions are inadequate.
Q1+Q4. There is no single blocking problem, some that come to mind are: Deadlock, Livelock, unnecessarily sequential, non-Scalable, Slow. Some structures {{ threads, coroutines, threads + coroutines } * { locks, conditions, message passing }} help solve some of these problems, but induce others. My favourite is { (threads + coroutines) * (message passing) }, which is typically good for everything but Slow.

Related

Threading vs Task-Based vs Asynchronous Programming

I'm new to this concept. Are these the same or different things? What is the difference? I really like the idea of being able to run two processes at once, for example if I have several large files to load into my program I'd love to load as many of them simultaneously as possible instead of waiting for one at a time. And when working with a large file, such as wav file, it would be great to break it into pieces and do processing on several chunks at once and then put them back together. What do I want to look into to learn how to do this sort of thing?
Edit: Also, I know using more than one core on a multicore processor fits in here somewhere, but apparently asynchronous programming doesn't necessarily mean you are using multiple cores? Why would you do this if you didn't have multiple cores to take advantage of?
They are related but different.
Threading, normally called multi-threading, refers to the use of multiple threads of execution within a single process. This usually refers to the simple case of using a small set of threads each doing different tasks that need to be, or could benefit from, running simultaneously. For example, a GUI application might have one thread draw elements, another thread respond to events like mouse clicks, and another thread do some background processing.
However, when the number of threads, each doing their own thing, is taken to an extreme, we usually start to talk about an Agent-based approach.
The task-based approach refers to a specific strategy in software engineering where, in abstract terms, you dynamically create "tasks" to be accomplished, and these tasks are picked up by a task manager that assigns the tasks to threads that can accomplish them. This is more of a software architectural thing. The advantage here is that the execution of the whole program is a succession of tasks being relayed (task A finished -> trigger task B, when both task B and task C are done -> trigger task D, etc..), instead of having to write a big function or program that executes each task one after the other. This gives flexibility when it is unclear which tasks will take more time than others, and when tasks are only loosely coupled. This is usually implemented with a thread-pool (threads that are waiting to be assigned a task) and some message-passing interface (MPI) to communicate data and task "contracts".
Asynchronous programming does not refer to multi-threaded programming, although the two are very often associated (and work well together). A synchronous program must complete each step before moving on to the next. An asynchronous program starts a step, moves on to other steps that don't require the result of the first step, then checks on the result of the first step when its result is required.
That is, a synchronous program might go a little bit like this: "do this task", "wait until done", "do something with the result", and "move on to something else". By contrast, an asynchronous program might go a little more like this: "I'm gonna start a task, and I'll need the result later, but I don't need it just now", "in the meantime, I'll do something else", "I can't do anything else until I have the result of the first step now, so I'll wait for it, if it isn't ready", and "move on to something else".
Notice that "asynchronous" refers to a very broad concept, that always involves some form of "start some work and tell me when it's done" instead of the traditional "do it now!". This does not require multi-threading, in which case it just becomes a software design choice (which often involves callback functions and things like that to provide "notification" of the asynchronous result). With multiple threads, it becomes more powerful, as you can do various things in parallel while the asynchronous task is working. Taken to the extreme, it can become a more full-blown architecture like a task-based approach (which is one kind of asynchronous programming technique).
I think the thing that you want corresponds more to yet another concept: Parallel Computing (or parallel processing). This approach is more about splitting a large processing task into smaller parts and processing all parts in parallel, and then combining the results. You should look into libraries like OpenMP or OpenCL/CUDA (for GPGPU). That said, you can use multi-threading for parallel processing.
but apparently asynchronous programming doesn't necessarily mean you are using multiple cores?
Asynchronous programming does not necessarily involve anything happening concurrently in multiple threads. It could mean that the OS is doing things on your behalf behind the scenes (and will notify you when that work is finished), like in asynchronous I/O, which happens without you creating any threads. It boils down to being a software design choice.
Why would you do this if you didn't have multiple cores to take advantage of?
If you don't have multiple cores, multi-threading can still improve performance by reusing "waiting time" (e.g., don't "block" the processing waiting on file or network I/O, or waiting on the user to click a mouse button). That means the program can do useful work while waiting on those things. Beyond that, it can provide flexibility in the design and make things seem to run simultaneously, which often makes users happier. Still, you are correct that before multi-core CPUs, there wasn't as much of an incentive to do multi-threading, as the gains often do not justify the overhead.
I think in general, all these are design related rather than language related. Same apply to multicore programming.
To reflect Jim, it's not only the file load scenario. Generally, you need to design the whole software to run concurrently in order to feel the real benefit of multi-threading, task based or asynchronous programming.
Try see things from a grand picture point of view. Understand the over all modelling of a specific example and see how these methodologies are implemented. It'll easy to see the difference and help understand when and where to use which.

TBB vs. Homegrown Workqueue

I know TBB (Thread Building Blocks) claim to have a sophisticated engine, but from the algorithmic point of view:
If we had (say on Linux) a workqueue that has N working-threads (POSIX threads, N is the number of cores) and a mutex-synchronized queue of tasks, each working thread then taking a task from the queue when idle, also some synchronization calls, what else could TBB offer, not counting nice C++ syntax? I don't see a better algorithm than greedy assignment of tasks to cores.
As somebody who has developed their own work-stealing scheduler, I can say the following:
Don’t write your own scheduler (and a work-queue counts here).
You’ll either do it inefficiently, or you’ll do it wrong.
In fact, it’s not that hard to write a correct scheduler. Unfortunately, it is hard if you want to do it efficiently. An efficient scheduler effectively precludes the use of locks (except perhaps in very specific, well-specified situations) and lock-free cross-thread communication is a world of pain.
As an anecdote, I actually implemented one scheduler where I essentially had to copy the existing algorithm into code and I still managed to introduce almost any race condition imaginable into the code. Debugging this code was a mixture of
writing huge, convoluted test cases (just to pick up the occasional failure which only occurred in < 1% of the runs),
spending hours on end just staring at the code, trying to figure out the error by applying logic
tracing each single line in the debugger (which would crash without stack trace once an error occurred), keeping track of the state of all variables in all threads manually just to be sure that the actual state of the program matched the expected state
reducing the code several times essentially down to zero and rebuilding, commenting out single lines or pairs of lines to see the effect (huge combinatorial space), and
running against walls, head first.
Not knowing the precise implementation of TBB, I cannot say what exactly it offers, but since you said "what could it offer"...
Among others,
It could offer lockfree queueing and unqueueing instead of one syscall and context switch per task. This is harder to implement than it sounds.
It could, in addition, offer blocking of worker threads if the queue is empty. This again, is harder to implement than it sounds.
It could offer work stealing.
It could offer LIFO task-to-thread assignment in the same way Windows completion ports work (improving cache efficiency).
It could be bug-free. This, again, is something harder to implement than you think.

Testing concurrent data structure

What are some methods for testing concurrent data structures to make sure the data structs behave correctly when accessed from multiple threads ?
All of the other answers have focused on actually testing the code by putting it through its paces and actually running it in one form or another or politely saying "don't do it yourself, use an existing library".
This is great and all, but IMO, the most important (practical tests are important too) test is to look at the code line by line and for every line of code ask "what happens if I get interrupted by another thread here?" Imagine another thread, running just about any of the other lines/functions during this interruption. Do things still stay consistent? When competing for resources, does the other thread[s] block or spin?
This is what we did in school when learning about concurrency and it is a surprisingly effective approach. Bottom line, I feel that taking the time to prove to yourself that things are consistent and work as expected in all states is the first technique you should use when dealing with this stuff.
Concurrent systems are probabilistic and errors are often difficult to replicate. Therefore you need to run various input/output cases, each tested over time (hours, days, etc) in order to detect possible errors.
Tests for concurrent data structure involves examining the container's state before and after expected events such as insert and delete.
Use a pre-existing, pre-tested library that meets your needs if possible.
Make sure that the code has appropriate self-consistency checks (preferably fast sanity checks), and run your code on as many different types of hardware as possible to help narrow down interesting timing problems.
Have multiple people peer review the code, preferably without a pre-explanation of how it's supposed to work. That way they have to grok the code which should help catch more bugs.
Set up a bunch of threads that do nothing but random operations on the data structures and check for consistency at some rate.
Start with the assumption that your calls to access/modify data are not thread safe and use locks to ensure only a single thread can access/modify any part of the data at a time. Only after you can prove to yourself that a specific type of access is safe outside of the lock by multiple threads at once should you move that code outside of the lock.
Assume worst case scenarios, e.g. that your code will stop right in the middle of some pointer manipulation or another critical point, and that another thread will encounter that data in mid-transition. If that would have a bad result, leave it within the lock.
I normally test these kinds of things by interjecting sleep() calls at appropriate places in the distributed threads/processes.
For instance, to test a lock, put sleep(2) in all your threads at the point of contention, and spawn two threads roughly 1 second apart. The first one should obtain the lock, and the second should have to wait for it.
Most race conditions can be tested by extending this method, but if your system has too many components it may be difficult or impossible to know every possible condition that needs to be tested.
Run your concurrent threads for one or a few days and look what happens. (Sounds strange, but finding out race conditions is such a complex topic that simply trying it is the best approach).

Large number of simultaneous long-running operations in Qt

I have some long-running operations that number in the hundreds. At the moment they are each on their own thread. My main goal in using threads is not to speed these operations up. The more important thing in this case is that they appear to run simultaneously.
I'm aware of cooperative multitasking and fibers. However, I'm trying to avoid anything that would require touching the code in the operations, e.g. peppering them with things like yieldToScheduler(). I also don't want to prescribe that these routines be stylized to be coded to emit queues of bite-sized task items...I want to treat them as black boxes.
For the moment I can live with these downsides:
Maximum # of threads tend to be O(1000)
Cost per thread is O(1MB)
To address the bad cache performance due to context-switches, I did have the idea of a timer which would juggle the priorities such that only idealThreadCount() threads were ever at Normal priority, with all the rest set to Idle. This would let me widen the timeslices, which would mean fewer context switches and still be okay for my purposes.
Question #1: Is that a good idea at all? One certain downside is it won't work on Linux (docs say no QThread::setPriority() there).
Question #2: Any other ideas or approaches? Is QtConcurrent thinking about this scenario?
(Some related reading: how-many-threads-does-it-take-to-make-them-a-bad-choice, many-threads-or-as-few-threads-as-possible, maximum-number-of-threads-per-process-in-linux)
IMHO, this is a very bad idea. If I were you, I would try really, really hard to find another way to do this. You're combining two really bad ideas: creating a truck load of threads, and messing with thread priorities.
You mention that these operations only need to appear to run simultaneously. So why not try to find a way to make them appear to run simultaneously, without literally running them simultaneously?
It's been 6 months, so I'm going to close this.
Firstly I'll say that threads serve more than one purpose. One is speedup...and a lot of people are focusing on that in the era of multi-core machines. But another is concurrency, which can be desirable even if it slows the system down when taken as a whole. Yet concurrency can be achieved using mechanisms more lightweight than threads, although it may complicate the code.
So this is just one of those situations where the tradeoff of programmer convenience against user experience must be tuned to fit the target environment. It's how Google's approach to a process-per-tab with Chrome would have been ill-advised in the era of Mosaic (even if process isolation was preferable with all else being equal). If the OS, memory, and CPU couldn't give a good browsing experience...they wouldn't do it that way now.
Similarly, creating a lot of threads when there are independent operations you want to be concurrent saves you the trouble of sticking in your own scheduler and yield() operations. It may be the cleanest way to express the code, but if it chokes the target environment then something different needs to be done.
So I think I'll settle on the idea that in the future when our hardware is better than it is today, we'll probably not have to worry about how many threads we make. But for now I'll take it on a case-by-case basis. i.e. If I have 100 of concurrent task class A, and 10 of concurrent task class B, and 3 of concurrent task class C... then switching A to a fiber-based solution and giving it a pool of a few threads is probably worth the extra complication.

More threads, better performance?

When I write a message driven app. much like a standard windows app only that it extensively uses messaging for internal operations, what would be the best approach regarding to threading?
As I see it, there are basically three approaches (if you have any other setup in mind, please share):
Having a single thread process all of the messages.
Having separate threads for separate message types (General, UI, Networking, etc...)
Having multiple threads that share and process a single message queue.
So, would there be any significant performance differences between the three?
Here are some general thoughts:
Obviously, the last two options benefit from a situation where there's more than one processor. Plus, if any thread is waiting for an external event, other threads can still process unrelated messages. But ignoring that, seems that multiple threads only add overhead (Thread switches, not to mention more complicated sync situations).
And another question: Would you recommend to implement such a system upon the standard Windows messaging system, or to implement a separate queue mechanism, and why?
The specific choice of threading model should be driven by the nature of the problem you are trying to solve. There isn't necessarily a single "correct" approach to designing the threading model for such an application. However, if we adopt the following assumptions:
messages arrive frequently
messages are independent and don't rely too heavily on shared resources
it is desirable to respond to an arriving message as quickly as possible
you want the app to scale well across processing architectures (i.e. multicode/multi-cpu systems)
scalability is the key design requirement (e.g. more message at a faster rate)
resilience to thread failure / long operations is desirable
In my experience, the most effective threading architecture would be to employ a thread pool. All messages arrive on a single queue, multiple threads wait on the queue and process messages as they arrive. A thread pool implementation can model all three thread-distribution examples you have.
#1 Single thread processes all messages => thread pool with only one thread
#2 Thread per N message types => thread pool with N threads, each thread peeks at the queue to find appropriate message types
#3 Multiple threads for all messages => thread pool with multiple threads
The benefits of this design is that you can scale the number of threads in the thread in proportion to the processing environment or the message load. The number of threads can even scale at runtime to adapt to the realtime message load being experienced.
There are many good thread pooling libraries available for most platforms, including .NET, C++/STL, Java, etc.
As to your second question, whether to use standard windows message dispatch mechanism. This mechanism comes with significant overhead and is really only intended for pumping messages through an windows application's UI loop. Unless this is the problem you are trying to solve, I would advise against using it as a general message dispatching solution. Furthermore, windows messages carry very little data - it is not an object-based model. Each windows message has a code, and a 32-bit parameter. This may not be enough to base a clean messaging model on. Finally, the windows message queue is not design to handle cases like queue saturation, thread starvation, or message re-queuing; these are cases that often arise in implementing a decent message queing solution.
We can't tell you much for sure without knowing the workload (ie, the statistical distribution of events over time) but in general
single queue with multiple servers is at least as fast, and usually faster, so 1,3 would be preferable to 2.
multiple threads in most languages add complexity because of the need to avoid contention and multiple-writer problems
long duration processes can block processing for other things that could get done quicker.
So horseback guess is that having a single event queue, with several server threads taking events off the queue, might be a little faster.
Make sure you use a thread-safe data structure for the queue.
It all depends.
For example:
Events in a GUI queue are best done by a single thread as there is an implied order in the events thus they need to be done serially. Which is why most GUI apps have a single thread to handle events, though potentially multiple events to create them (and it does not preclude the event thread from creating a job and handling it off to a worker pool (see below)).
Events on a socket can potentially by done in parallel (assuming HTTP) as each request is stateless and can thus by done independently (OK I know that is over simplifying HTTP).
Work Jobs were each job is independent and placed on queue. This is the classic case of using a set of worker threads. Each thread does a potentially long operation independently of the other threads. On completion comes back to the queue for another job.
In general, don't worry about the overhead of threads. It's not going to be an issue if you're talking about merely a handful of them. Race conditions, deadlocks, and contention are a bigger concern, and if you don't know what I'm talking about, you have a lot of reading to do before you tackle this.
I'd go with option 3, using whatever abstractions my language of choice offers.
Note that there are two different performance goals, and you haven't stated which you are targetting: throughput and responsiveness.
If you're writing a GUI app, the UI needs to be responsive. You don't care how many clicks per second you can process, but you do care about showing some response within a 10th of a second or so (ideally less). This is one of the reasons it's best to have a single thread devoted to handling the GUI (other reasons have been mentioned in other answers). The GUI thread needs to basically convert windows messages into work-items and let your worker queue handle the heavy work. Once the worker is done, it notifies the GUI thread, which then updates the display to reflect any changes. It does things like painting a window, but not rendering the data to be displayed. This gives the app a quick "snapiness" that is what most users want when they talk about performance. They don't care if it takes 15 seconds to do something hard, as long as when they click on a button or a menu, it reacts instantly.
The other performance characteristic is throughput. This is the number of jobs you can process in a specific amount of time. Usually this type of performance tuning is only needed on server type applications, or other heavy-duty processing. This measures how many webpages can be served up in an hour, or how long it takes to render a DVD. For these sort of jobs, you want to have 1 active thread per CPU. Fewer than that, and you're going to be wasting idle clock cycles. More than that, and the threads will be competing for CPU time and tripping over each other. Take a look at the second graph in this article DDJ articles for the trade-off you're dealing with. Note that the ideal thread count is higher than the number of available CPUs due to things like blocking and locking. The key is the number of active threads.
A good place to start is to ask yourself why you need multiple threads.
The well-thought-out answer to this question will lead you to the best answer to the subsequent question, "how should I use multiple threads in my application?"
And that must be a subsequent question; not a primary question. The fist question must be why, not how.
I think it depends on how long each thread will be running. Does each message take the same amount of time to process? Or will certain messages take a few seconds for example. If I knew that Message A was going to take 10 seconds to complete I would definitely use a new thread because why would I want to hold up the queue for a long running thread...
My 2 cents.
I think option 2 is the best. Having each thread doing independant tasks would give you best results. 3rd approach can cause more delays if multiple threads are doing some I/O operation like disk reads, reading common sockets and so on.
Whether to use Windows messaging framework for processing requests depends on the work load each thread would have. I think windows restricts the no. of messages that can be queued at the most to 10000. For most of the cases this should not be an issue. But if you have lots of messages to be queued this might be some thing to take into consideration.
Seperate queue gives a better control in a sense that you may reorder it the way you want (may be depending on priority)
Yes, there will be performance differences between your choices.
(1) introduces a bottle-neck for message processing
(3) introduces locking contention because you'll need to synchronize access to your shared queue.
(2) is starting to go in the right direction... though a queue for each message type is a little extreme. I'd probably recommend starting with a queue for each model in your app and adding queues where it makes since to do so for improved performance.
If you like option #2, it sounds like you would be interested in implementing a SEDA architecture. It is going to take some reading to understand what is going on, but I think the architecture fits well with your line of thinking.
BTW, Yield is a good C++/Python hybrid implementation.
I'd have a thread pool servicing the message queue, and make the number of threads in the pool easily configurable (perhaps even at runtime). Then test it out with expected load.
That way you can see what the actual correlation is - and if your initial assumptions change, you can easily change your approach.
A more sophisticated approach would be for the system to introspect its own performance traits and adapt it's use of resources, threads in particular, as it goes. Probably overkill for most custom application code, but I'm sure there are products that do that out there.
As for the windows events question - I think that's probably an application specific question that there is no right or wrong answer to in the general case. That said, I usually implement my own queue as I can tailor it to the specific characteristics of the task at hand. Sometimes that might involve routing events via the windows message queue.