When should I use concurrency in Go? - concurrency

So besides handling multiple server requests is there any other time that concurrency is relevant? I ask because it's so built into the language that I feel wasteful if I don't use it but I can barely find a use for it.

Not an expert in Go (yet) but I'd say:
Whenever it is easiest to do so.
The beauty of the concurrency model in Go is that it is not fundamentally a multi-core architecture with checks and balances where things usually break - it is a multi-threaded paradigm that not only fits well into a multi-core architecture, it also fits well into a distributed system architecture.
You do not have to make special arrangements for multiple goroutines to work together harmoniously - they just do!
Here's an example of a naturally concurrent algorithm - I want to merge multiple channels into one. Once all of the input channels are exhausted I want to close the output channel.
It is just simpler to use concurrency - in fact it doesn't even look like concurrency - it looks almost procedural.
/*
Multiplex a number of channels into one.
*/
func Mux(channels []chan big.Int) chan big.Int {
// Count down as each channel closes. When hits zero - close ch.
var wg sync.WaitGroup
wg.Add(len(channels))
// The channel to output to.
ch := make(chan big.Int, len(channels))
// Make one go per channel.
for _, c := range channels {
go func(c <-chan big.Int) {
// Pump it.
for x := range c {
ch <- x
}
// It closed.
wg.Done()
}(c)
}
// Close the channel when the pumping is finished.
go func() {
// Wait for everyone to be done.
wg.Wait()
// Close.
close(ch)
}()
return ch
}
The only concession I have to make to concurrency here is to use a sync.WaitGroup as a counter for concurrent counting.
Note that this is not purely my own work - I had a great deal of help with this here.

Here is a good example from one of Go's inventors, Rob Pike, of using concurrency because it is an easier way to express the solution to a problem:
Lexical Scanning in Go
Generalizing on that a bit, any producer-consumer problem is a natural fit for 2 goroutines using a channel to pass outputs from the producer to the consumer.
Another good use for concurrency is interacting with multiple input/output sources (disks, network, terminal, etc.). Your program should be able to wake up and do some work whenever a result comes from any of these sources. It is possible to do this with one thread and a system call like poll(2) or select(2). When your thread wakes up, it must figure out which result came in, find where it left off in the relevant task, and pick up from there. That's a lot of code you need to write.
Writing that code is much easier using one goroutine per task. Then the state of that task is captured implicitly in the goroutine, and picking up where it left off is as simple as waking up and running.

My 2 cents... If you think about channels/goroutines only in the context of concurrency, you are missing the boat.
While go is not an object language or strictly a functional language, it does allow you to take design features from both and apply them.
One of the basic tenets of object oriented design is the Single Responsibility
Principle. Applying this principle forces you to think about design in terms of messages, rather than complex object behavior. These same design constraints can be used in go, to allow you to start thinking about "messages on channels" connecting single purpose functions.
This is just one example, but if you start thinking this way, you'll see many more.

Also not an expert at Go, so some of my approaches may be non-canonical, but here are some ways I've found concurrency useful so far:
Doing operations while waiting for a network request, disk I/O, or database query to finish
Executing divide-and-conquer algorithms more quickly
Since goroutines are functions, and functions are first-class citizens in Go, you can pass them around as variables. This is convenient when your program has many autonomous pieces. (For example, I'm playing around with simulating a city's traffic system. Each vehicle is its own goroutine, and they communicate with intersections and other vehicles by using channels. Each one does its own thing.)
Simultaneous I/O operations across different devices
Used concurrency to perform Dijkstra's algorithm on a set of points in an image to draw "intelligent scissor" lines -- one goroutine per point made this implementation significantly faster.
GoConvey uses concurrency to run tests across packages at the same time to respond faster to changes while debugging using the web UI. (As an inherent bonus, this adds some pseudo-randomness to the testing sequence so your test results are truly consistent.)
Concurrency could be (read: "is sometimes, but not necessarily always") useful when you have operations that could run independently of each other but would otherwise run sequentially. Even if those operations depend on data or some sort of signal from other goroutines at certain points, you can communicate that all with channels.
For some inspiration and an important distinction in these matters, and for some funny gopher pictures, see Concurrency is not Parallelism.

Any time an item of data doesn't depend on the previous one.

Related

C++ Concurrency, Coroutines & Job Scheduling?

I'm trying to get my head around multithreading in C++, to come up with a general purpose implementation that suits me. Everyone has a different implementation, Awesome CPP lists 39 libraries. It seems to me though that this is a logistical problem that is of the same ilk as any logistical scheduling problem in any field.
In my head, there are two obvious ways to repeatedly perform the job abc:
Split abc into 3 separate tasks: a, b & c. Spawn x threads. Have a queue. Jobs coming in get added to the queue. Each thread grabs the next task from the queue, and at the end of the task puts it back into the queue for the next task. They can either access the queue directly, or they can all communicate with a central 'manager' or 'scheduler' thread that serves them with their tasks.
Perform abc sequentially on x separate threads independently (parallelism.)
(1) has the problem that there is potentially a lot of overhead in keeping a queue and dealing with race conditions on it. (1) is otherwise intuitive and makes sense to me. It's what I would do in real life with a real life problem. It's literally how companies work in the real world.
(2) has the problem that any blocking causes the whole thread to block, idling the CPU thread. And (2) is far less flexible and applicable in less use-cases. On the plus side is has no overhead between tasks.
Question 1: Doesn't (1) also have the same blocking problem? If a thread reads from a file, it'll have to wait for the disk. How is that usually addressed, is there some way to yield back temporarily while its doing something like reading or writing from disk, or is this usually addressed simply by having more threads running than there are CPU threads and hoping not too many block at once?
It seems to me that (1) is clearly the better solution, except that it restricts the tasks only to medium to large scaled tasks. It would be pointless to use it to do something like parallelizing straightforward math (just an example) because handling the queue would take longer than the actual processing of the task. Hence the value of (1) for any given task is inversely proportionate to the difference between the overhead of the storage mechanism (the queue) and the size of the task. This sounds fine on the surface, until you realize that the efficiency of splitting into tasks is itself proportionate to the size of the task. To put it simply: you want each task to be small for overall efficiency in theory, but in practice you want each task to be larger so as to minimize the overhead of the queue.
Its obvious that some storage mechanism is required because you can't keep track of something without a recording mechanism, it doesn't have to be strictly a queue, but any form of recording the task in memory while it waits to be picked up. The optimization of the queue (I'm using the word loosely, not strictly a queue type) is then the #1 important factor here. The cheaper a task can receive its payload, the better.
Which leads me to Question 2: is this what C++20 coroutines are useful for? I've spent hours reading tutorials on coroutines, but it's still unclear what useful they're for. I think I get what they do. If I have it right they allow a special type of function (coroutine) to pause itself in the middle, yield its processing back to the caller along with a payload, and the caller can later resume it. But why would I want to do that? And can't I do that just by splitting the function into two?
Question 3: Are coroutines meant to be used by a task scheduler thread to somehow optimize the queuing? Or is the point just to allow you to write code linearly and then put those yields in it to break it up? In which case it wouldn't be useful for me if I already had my jobs split up into separate tasks by design?
Question 4: Am I trying to reinvent the wheel here? Has this problem already been solved? And if so, why are there so many different implementations?
Q1: No, it more likely has a different blocking problem.
Q2: Co-routines have many applications; try substituting for X in "is this what X is for?" X = { while, if, return, pointer, ... }. Don't look to standards bodies (particularly that one) for insight; they are best at punctuation and spell checking.
Q3: Co-routines can be used to optimise various constructions, but the real goal of using such a formalism is to make your program as natural an expression of the problem as possible. One of the better examples of how Co-routines can be intelligently used are the Go-routines of Go.
Q4: Probably; almost definitely; because many of the solutions are inadequate.
Q1+Q4. There is no single blocking problem, some that come to mind are: Deadlock, Livelock, unnecessarily sequential, non-Scalable, Slow. Some structures {{ threads, coroutines, threads + coroutines } * { locks, conditions, message passing }} help solve some of these problems, but induce others. My favourite is { (threads + coroutines) * (message passing) }, which is typically good for everything but Slow.

Threading vs Task-Based vs Asynchronous Programming

I'm new to this concept. Are these the same or different things? What is the difference? I really like the idea of being able to run two processes at once, for example if I have several large files to load into my program I'd love to load as many of them simultaneously as possible instead of waiting for one at a time. And when working with a large file, such as wav file, it would be great to break it into pieces and do processing on several chunks at once and then put them back together. What do I want to look into to learn how to do this sort of thing?
Edit: Also, I know using more than one core on a multicore processor fits in here somewhere, but apparently asynchronous programming doesn't necessarily mean you are using multiple cores? Why would you do this if you didn't have multiple cores to take advantage of?
They are related but different.
Threading, normally called multi-threading, refers to the use of multiple threads of execution within a single process. This usually refers to the simple case of using a small set of threads each doing different tasks that need to be, or could benefit from, running simultaneously. For example, a GUI application might have one thread draw elements, another thread respond to events like mouse clicks, and another thread do some background processing.
However, when the number of threads, each doing their own thing, is taken to an extreme, we usually start to talk about an Agent-based approach.
The task-based approach refers to a specific strategy in software engineering where, in abstract terms, you dynamically create "tasks" to be accomplished, and these tasks are picked up by a task manager that assigns the tasks to threads that can accomplish them. This is more of a software architectural thing. The advantage here is that the execution of the whole program is a succession of tasks being relayed (task A finished -> trigger task B, when both task B and task C are done -> trigger task D, etc..), instead of having to write a big function or program that executes each task one after the other. This gives flexibility when it is unclear which tasks will take more time than others, and when tasks are only loosely coupled. This is usually implemented with a thread-pool (threads that are waiting to be assigned a task) and some message-passing interface (MPI) to communicate data and task "contracts".
Asynchronous programming does not refer to multi-threaded programming, although the two are very often associated (and work well together). A synchronous program must complete each step before moving on to the next. An asynchronous program starts a step, moves on to other steps that don't require the result of the first step, then checks on the result of the first step when its result is required.
That is, a synchronous program might go a little bit like this: "do this task", "wait until done", "do something with the result", and "move on to something else". By contrast, an asynchronous program might go a little more like this: "I'm gonna start a task, and I'll need the result later, but I don't need it just now", "in the meantime, I'll do something else", "I can't do anything else until I have the result of the first step now, so I'll wait for it, if it isn't ready", and "move on to something else".
Notice that "asynchronous" refers to a very broad concept, that always involves some form of "start some work and tell me when it's done" instead of the traditional "do it now!". This does not require multi-threading, in which case it just becomes a software design choice (which often involves callback functions and things like that to provide "notification" of the asynchronous result). With multiple threads, it becomes more powerful, as you can do various things in parallel while the asynchronous task is working. Taken to the extreme, it can become a more full-blown architecture like a task-based approach (which is one kind of asynchronous programming technique).
I think the thing that you want corresponds more to yet another concept: Parallel Computing (or parallel processing). This approach is more about splitting a large processing task into smaller parts and processing all parts in parallel, and then combining the results. You should look into libraries like OpenMP or OpenCL/CUDA (for GPGPU). That said, you can use multi-threading for parallel processing.
but apparently asynchronous programming doesn't necessarily mean you are using multiple cores?
Asynchronous programming does not necessarily involve anything happening concurrently in multiple threads. It could mean that the OS is doing things on your behalf behind the scenes (and will notify you when that work is finished), like in asynchronous I/O, which happens without you creating any threads. It boils down to being a software design choice.
Why would you do this if you didn't have multiple cores to take advantage of?
If you don't have multiple cores, multi-threading can still improve performance by reusing "waiting time" (e.g., don't "block" the processing waiting on file or network I/O, or waiting on the user to click a mouse button). That means the program can do useful work while waiting on those things. Beyond that, it can provide flexibility in the design and make things seem to run simultaneously, which often makes users happier. Still, you are correct that before multi-core CPUs, there wasn't as much of an incentive to do multi-threading, as the gains often do not justify the overhead.
I think in general, all these are design related rather than language related. Same apply to multicore programming.
To reflect Jim, it's not only the file load scenario. Generally, you need to design the whole software to run concurrently in order to feel the real benefit of multi-threading, task based or asynchronous programming.
Try see things from a grand picture point of view. Understand the over all modelling of a specific example and see how these methodologies are implemented. It'll easy to see the difference and help understand when and where to use which.

Thread per connection vs Reactor pattern (with a thread pool)?

I want to write a simple multiplayer game as part of my C++ learning project.
So I thought, since I am at it, I would like to do it properly, as opposed to just getting-it-done.
If I understood correctly: Apache uses a Thread-per-connection architecture, while nginx uses an event-loop and then dedicates a worker [x] for the incoming connection. I guess nginx is wiser, since it supports a higher concurrency level. Right?
I have also come across this clever analogy, but I am not sure if it could be applied to my situation. The analogy also seems to be very idealist. I have rarely seen my computer run at 100% CPU (even with a umptillion Chrome tabs open, Photoshop and what-not running simultaneously)
Also, I have come across a SO post (somehow it vanished from my history) where a user asked how many threads they should use, and one of the answers was that it's perfectly acceptable to have around 700, even up to 10,000 threads. This question was related to JVM, though.
So, let's estimate a fictional user-base of around 5,000 users. Which approach should would be the "most concurrent" one?
A reactor pattern running everything in a single thread.
A reactor pattern with a thread-pool (approximately, how big do you suggest the thread pool should be?
Creating a thread per connection and then destroying the thread the connection closes.
I admit option 2 sounds like the best solution to me, but I am very green in all of this, so I might be a bit naive and missing some obvious flaw. Also, it sounds like it could be fairly difficult to implement.
PS: I am considering using POCO C++ Libraries. Suggesting any alternative libraries (like boost) is fine with me. However, many say POCO's library is very clean and easy to understand. So, I would preferably use that one, so I can learn about the hows of what I'm using.
Reactive Applications certainly scale better, when they are written correctly. This means
Never blocking in a reactive thread:
Any blocking will seriously degrade the performance of you server, you typically use a small number of reactive threads, so blocking can also quickly cause deadlock.
No mutexs since these can block, so no shared mutable state. If you require shared state you will have to wrap it with an actor or similar so only one thread has access to the state.
All work in the reactive threads should be cpu bound
All IO has to be asynchronous or be performed in a different thread pool and the results feed back into the reactor.
This means using either futures or callbacks to process replies, this style of code can quickly become unmaintainable if you are not used to it and disciplined.
All work in the reactive threads should be small
To maintain responsiveness of the server all tasks in the reactor must be small (bounded by time)
On an 8 core machine you cannot cannot allow 8 long tasks arrive at the same time because no other work will start until they are complete
If a tasks could take a long time it must be broken up (cooperative multitasking)
Tasks in reactive applications are scheduled by the application not the operating system, that is why they can be faster and use less memory. When you write a Reactive application you are saying that you know the problem domain so well that you can organise and schedule this type of work better than the operating system can schedule threads doing the same work in a blocking fashion.
I am a big fan of reactive architectures but they come with costs. I am not sure I would write my first c++ application as reactive, I normally try to learn one thing at a time.
If you decide to use a reactive architecture use a good framework that will help you design and structure your code or you will end up with spaghetti. Things to look for are:
What is the unit of work?
How easy is it to add new work? can it only come in from an external event (eg network request)
How easy is it to break work up into smaller chunks?
How easy is it to process the results of this work?
How easy is it to move blocking code to another thread pool and still process the results?
I cannot recommend a C++ library for this, I now do my server development in Scala and Akka which provide all of this with an excellent composable futures library to keep the code clean.
Best of luck learning C++ and with which ever choice you make.
Option 2 will most efficiently occupy your hardware. Here is the classic article, ten years old but still good.
http://www.kegel.com/c10k.html
The best library combination these days for structuring an application with concurrency and asynchronous waiting is Boost Thread plus Boost ASIO. You could also try a C++11 std thread library, and std mutex (but Boost ASIO is better than mutexes in a lot of cases, just always callback to the same thread and you don't need protected regions). Stay away from std future, cause it's broken:
http://bartoszmilewski.com/2009/03/03/broken-promises-c0x-futures/
The optimal number of threads in the thread pool is one thread per CPU core. 8 cores -> 8 threads. Plus maybe a few extra, if you think it's possible that your threadpool threads might call blocking operations sometimes.
FWIW, Poco supports option 2 (ParallelReactor) since version 1.5.1
I think that option 2 is the best one. As for tuning of the pool size, I think the pool should be adaptive. It should be able to spawn more threads (with some high hard limit) and remove excessive threads in times of low activity.
as the analogy you linked to (and it's comments) suggest. this is somewhat application dependent. now what you are building here is a game server. let's analyze that.
game servers (generally) do a lot of I/O and relatively few calculations, so they are far from 100% CPU applications.
on the other hand they also usually change values in some database (a "game world" model). all players create reads and writes to this database. which is exactly the intersection problem in the analogy.
so while you may gain some from handling the I/O in separate threads, you will also lose from having separate threads accessing the same database and waiting for its locks.
so either option 1 or 2 are acceptable in your situation. for scalability reasons I would not recommend option 3.

Issue with Mutual Execution of Concurrent Go Routines

In my code there are three concurrent routines. I try to give a brief overview of my code,
Routine 1 {
do something
*Send int to Routine 2
Send int to Routine 3
Print Something
Print Something*
do something
}
Routine 2 {
do something
*Send int to Routine 1
Send int to Routine 3
Print Something
Print Something*
do something
}
Routine 3 {
do something
*Send int to Routine 1
Send int to Routine 2
Print Something
Print Something*
do something
}
main {
routine1
routine2
routine3
}
I want that, while codes between two do something (codes between two star marks) is executing, flow of control must not go to other go routines. For example, when routine1 is executing the events between two stars (sending and printing events), routine 2 and 3 must be blocked (means flow of execution does not pass to routine 2 or 3 from routine 1).After completing last print event, flow of execution may pass to routine 2 or 3.Can anybody help me by specifying, how can I achieve this ? Is it possible to implement above specification by WaitGroup ? Can anybody show me by giving a simple example how to implement above specified example by using WaitGroup. Thanks.
NB:May be this is repeat question of this. I tried by using that sync-lock mechanism, however, may be because I have a large code that's why I could not put lock-unlock properly, and it's creating deadlock situation (or may be my method is error producing). Can anybody help me by a simple procedure thus I can achieve this. I give a simple example of my code here where Here I want to put two prints and sending event inside mutex (for routine 1) thus routine 2 can't interrupt it. Can you help me how is it possible. One possible solution given,
http://play.golang.org/p/-uoQSqBJKS which gives error.
Why do you want to do this?
The deadlock problem is, if you don't allow other goroutines to be scheduled, then your channel sends can't proceed, unless there's buffering. Go's channels have finite buffering, so you end up with a race condition on draining before they get sent on while full. You could introduce infinite buffering, or put each send in its own goroutine, but it again comes down to: why are you trying to do this; what are you trying to achieve?
Another thing: if you only want to ensure mutual exclusion of the three sets of code between *s, then yes, you can use mutexes. If you want to ensure that no code interrupts your block, regardless of where it was suspended, then you might need to use runtime.LockOSThread and runtime.UnlockOSThread. These are fairly low level and you need to know what you're doing, and they're rarely needed. Of you want there to be no other goroutines running, you'll have to have runtime.GOMAXPROCS(1), which is currently the default.
The problem in answering your question is that it seems no one understands what your problem really is. I see you're asking repeatedly about roughly the same, though no progress has been done. There's no offense in saying this. It's an attempt to help you by a suggestion to reformulate your problem in a way comprehensible to others. As a possible nice side effect, some problems do solve themselves while being explained to others in an understandable way. I've experienced that many times by myself.
Another hint could be in the suspicious mix of explicit syncing and channel communication. That doesn't mean the design is necessarily broken. It just doesn't happen in a typical/simple case. Once again, your problem might be atypical/non trivial.
Perhaps it's somehow possible to redesign your problem using only channels. Actually I believe that every problem involving explicit synchronization (in Go) could be coded while using only channels. That said, it is true some problems are written with explicit synchronization very easily. Also channel communication, as cheap as it is, is not as cheap as most synchronization primitives. But that could be looked after later, when the code works. If the "pattern" for some say sync.Mutex will visibly emerge in the code, it should be possible to switch to it and much more easy to do that when the code already works and hopefully has tests to watch your steps while making the adjustments.
Try to think about your goroutines like independently acting agents which:
Exclusively own the data received from the channel. The language will
not enforce this, you must deploy own's discipline.
Don't anymore touch the data they've sent to a channel. It follows from first rule, but important enough to be explicit.
Interact with other agents (goroutines) by data types, which encapsulate a whole unit of workflow/computation. This eliminates e.g. your earlier struggle with geting the right number of channel messages before the "unit" is complete.
For every channel they use, it must be absolutely clear in before if the channel must be unbuffered, must be buffered for fixed number of items or if it may be unbound.
Don't have to think (know) about what other agents are doing above getting a message from them if that is needed for the agent to do its own task - part of the bigger picture.
Using even such few rules of thumb should hopefully produce code which is more easy to reason about and which usually doesn't requires any other synchronization. (I'm intentionally ignoring performance issues of mission critical applications now.)

Large number of simultaneous long-running operations in Qt

I have some long-running operations that number in the hundreds. At the moment they are each on their own thread. My main goal in using threads is not to speed these operations up. The more important thing in this case is that they appear to run simultaneously.
I'm aware of cooperative multitasking and fibers. However, I'm trying to avoid anything that would require touching the code in the operations, e.g. peppering them with things like yieldToScheduler(). I also don't want to prescribe that these routines be stylized to be coded to emit queues of bite-sized task items...I want to treat them as black boxes.
For the moment I can live with these downsides:
Maximum # of threads tend to be O(1000)
Cost per thread is O(1MB)
To address the bad cache performance due to context-switches, I did have the idea of a timer which would juggle the priorities such that only idealThreadCount() threads were ever at Normal priority, with all the rest set to Idle. This would let me widen the timeslices, which would mean fewer context switches and still be okay for my purposes.
Question #1: Is that a good idea at all? One certain downside is it won't work on Linux (docs say no QThread::setPriority() there).
Question #2: Any other ideas or approaches? Is QtConcurrent thinking about this scenario?
(Some related reading: how-many-threads-does-it-take-to-make-them-a-bad-choice, many-threads-or-as-few-threads-as-possible, maximum-number-of-threads-per-process-in-linux)
IMHO, this is a very bad idea. If I were you, I would try really, really hard to find another way to do this. You're combining two really bad ideas: creating a truck load of threads, and messing with thread priorities.
You mention that these operations only need to appear to run simultaneously. So why not try to find a way to make them appear to run simultaneously, without literally running them simultaneously?
It's been 6 months, so I'm going to close this.
Firstly I'll say that threads serve more than one purpose. One is speedup...and a lot of people are focusing on that in the era of multi-core machines. But another is concurrency, which can be desirable even if it slows the system down when taken as a whole. Yet concurrency can be achieved using mechanisms more lightweight than threads, although it may complicate the code.
So this is just one of those situations where the tradeoff of programmer convenience against user experience must be tuned to fit the target environment. It's how Google's approach to a process-per-tab with Chrome would have been ill-advised in the era of Mosaic (even if process isolation was preferable with all else being equal). If the OS, memory, and CPU couldn't give a good browsing experience...they wouldn't do it that way now.
Similarly, creating a lot of threads when there are independent operations you want to be concurrent saves you the trouble of sticking in your own scheduler and yield() operations. It may be the cleanest way to express the code, but if it chokes the target environment then something different needs to be done.
So I think I'll settle on the idea that in the future when our hardware is better than it is today, we'll probably not have to worry about how many threads we make. But for now I'll take it on a case-by-case basis. i.e. If I have 100 of concurrent task class A, and 10 of concurrent task class B, and 3 of concurrent task class C... then switching A to a fiber-based solution and giving it a pool of a few threads is probably worth the extra complication.