C++ threading vs. visibility issues - what's the common engineering practice?

C++ threading vs. visibility issues - what's the common engineering practice? - c++

From my studies I know the concepts of starvation, deadlock, fairness and other concurrency issues. However, theory differs from practice, to an extent, and real engineering tasks often involve greater detail than academic blah blah...
As a C++ developer I've been concerned about threading issues for a while...
Suppose you have a shared variable x which refers to some larger portion of the program's memory. The variable is shared between two threads A and B.
Now, if we consider read/write operations on x from both A and B threads, possibly at the same time, there is a need to synchronize those operations, right? So the access to x needs some form of synchronization which can be achieved for example by using mutexes.
Now lets consider another scenario where x is initially written by thread A, then passed to thread B (somehow) and that thread only reads x. The thread B then produces a response to x called y and passes it back to the thread A (again, somehow). My question is: what synchronization primitives should I use to make this scenario thread-safe. I've read about atomics and, more importantly, memory fences - are these the tools I should rely on?
This is not a typical scenario in which there is a "critical section". Instead some data is passed between threads with no possibility of concurrent writes in the same memory location. So, after being written, the data should first be "flushed" somehow, so that the other threads could see it in a valid and consistent state before reading. How is it called in the literature, is it "visibility"?
What about pthread_once and its Boost/std counterpart i.e. call_once. Does it help if both x and y are passed between threads through a sort of "message queue" which is accessed by means of "once" functionality. AFAIK it serves as a sort of memory fence but I couldn't find any confirmation for this.
What about CPU caches and their coherency? What should I know about that from the engineering point of view? Does such knowledge help in the scenario mentioned above, or any other scenario commonly encountered in C++ development?
I know I might be mixing a lot of topics but I'd like to better understand what is the common engineering practice so that I could reuse the already known patterns.
This question is primarily related to the situation in C++03 as this is my daily environment at work. Since my project mainly involves Linux then I may only use pthreads and Boost, including Boost.Atomic. But I'm also interested if anything concerning such matters has changed with the advent of C++11.
I know the question is abstract and not that precise but any input could be useful.

you have a shared variable x
That's where you've gone wrong. Threading is MUCH easier if you hand off ownership of work items using some sort of threadsafe consumer-producer queue, and from the perspective of the rest of the program, including all the business logic, nothing is shared.
Message passing also helps prevent cache collisions (because there is no true sharing -- except of the producer-consumer queue itself, and that has trivial effect on performance if the unit of work is large -- and organizing the data into messages help reduce false sharing).
Parallelism scales best when you separate the problem into subproblems. Small subproblems are also much easier to reason about.
You seem to already be thinking along these lines, but no, threading primitives like atomics, mutexes, and fences are not very good for applications using message passing. Find a real queue implementation (queue, circular ring, Disruptor, they go under different names but all meet the same need). The primitives will be used inside the queue implementation, but never by application code.

Related

Lock-free data structures in C++ = just use atomics and memory-ordering?

I used to see the term "lock free data structure" and think "ooooo that must be really complex". However, I have been reading "C++ Concurrency in Action" and it seems to write a lock-free data structure all you do is stop using mutexes/locks and replace them with atomic code (along with possible memory-ordering barriers).
So my question is- am I missing something here? Is it really that much simpler due to C++11? Is writing a lock-free data structure just a case of replacing the locks with atomic operations?

Ooooo but that is really complex.
If you don't see the difference between a mutex and an atomic access, there is something wrong with the way you look at parallel processing, and there will soon be something wrong with the code you write.
Most likely it will run slower than the equivalent blocking version, and if you (or rather your coworkers) are really unlucky, it will spout the occasional inconsistent data and crash randomly.
Even more likely, it will propagate real-time constraints to large parts of your application, forcing your coworkers to waste a sizeable amount of their time coping with arbitrary requirements they and their software would have quite happily lived without, and resort to various superstitions good practices to obfuscate their code into submission.
Oh well, as long as the template guys and the wait-free guys had their little fun...
Parallel processing, be it blocking or supposedly wait-free, is inherently resource consuming,complex and costly to implement. Designing a software architecture that takes a real advantage from non-trivial parallel processing is a job for specialists.
A good software design should on the contrary limit the parallelism to the bare minimum, leaving most of the programmers free to implement linear, sequential code.
As for C++, I find this whole philosophy of wrapping indifferently a string, a thread and a coffee machine in the same syntactic goo a disastrous design choice.
C++ is allowing you to create a multiprocessor synchronization object out of about anything, like you would allocate a mere string, which is akin to presenting an assault rifle next to a squirt gun in the same display case.
No doubt a lot of people are making a living by selling the idea that an assault rifle and a squirt gun are, after all, not so different. But still, they are.

Two things to consider:
Only a single operations is atomic when using C++11 atomic. But often when you want to use mutexes to protect a larger region of code.
If you use std::atomic with a type that the compiler cannot convert to an atomic operation in machine code, then the compiler will have to insert a mutex for that operation.
Overall you probably want to stick with using mutexes and only use lock-free code for performance critical sections, or if you were implementing your own structures to use for synchronization.

You are missing something. While lock free data structures do use the primitives you mention, simply invoking their existance will not provide you with a lock free queue.

Lock-free codes are not simpler due to C++, without C++, operating systems often provides similar stuff for memory ordering and fencing in C/Assembly.
C++ provides a better & easier to use interface (and of course more standardized so you can use in multiple OS, multiple machine structures with the same interface), but programming lock-free codes in C++ won't be simpler than without C++ if you target only one specific type of OS/machine structure.

What is the purpose of thread-safe data structure classes? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
AFAIK, a major goal of multi-threaded programming is increasing performance by utilizing multiple processing cores. The point is maximizing parallel execution.
When I see thread-safe generic data structure classes, I feel some irony. Because thread-safety means enforcing serial execution (lock, atomic operation, or whatever), so it's anti-parallel. Thread-safe classes means that serialization is encapsulated and hidden into the class, so we will get more chance to force serial execution - losing performance. It would be better to manage those critical section in larger (or largest) unit - application logic.
So why do people want thread-safe classes? What's the real benefit of them?
P.S.
I meant thread-safe class is a class has only thread-safe methods which is safe to be called from multiple threads simultaneously. Safe means it guarantees correct read/write result. Correct means its result is equal to the result under single-threaded execution. (for example avoiding ABA problem)
So I think the term thread-safety in my question contains serial execution by definition. And that's why I was confused for its purpose and asked this question.

I think you're question has a false assumption: synchronization operations are simply not anti-parallel. There is simply no way to build a parallel mutable data structure without some form of synchronization. Yes heavy usages of those synchronization mechanisms will detract from the ability of the code to run in parallel. But without those mechanisms it wouldn't be possible to write the code in the first place.
The one form of thread safe data structure that doesn't require synchronization are immutable values. However they only work for a subset of scenarios (parallel reads, data passing, etc ...)

Thread safe data structures can be implemented without serialization. It's tricky to get right, but it's doable and is done. Then you have the benefits of concurrency without any bottleneck.

Thread-safe classes means that serialization is encapsulated and hidden into the class, so we will get more chance to force serial execution - losing performance.
Making thread safety the client's responsibility defeats encapsulation (not always). Depending on the context/design, thread safety can be either very complex, or is susceptible to change over time (breaks your program when APIs change), or they are simply not uniform. Abstracting synchronization does not have to equate to a loss; it also has the potential for great benefits -- especially because it is not a subject for novices.
It would be better to manage those critical section in larger (or largest) unit - application logic.
I'm not sure who told you that, but that is not necessarily ideal for all scenarios. Once you get down to implementing concurrent systems, you will realize that choosing the best granularity of synchronization within your designs can make a huge difference in how it operates. Note that the 'best' general design is not always the best for a given usage.
There is not a hard and fast rule here -- Small and shortest (potentially using and acquiring a higher number of locks, however) is better for many designs, whereas largest-unit can increase contention and result in significant blocking. It's really easy to begin an update then spend a lot of time doing things within that update which do not require sustained synchronization of the entire structure during the update. Locking down the whole graph at each access is not always better, and certain components of the structure may be thread safe independent of other components. Therefore, the largest unit approach is frequently likely to enforce serialization which impacts performance, especially as size and complexity grows.
So why do people want thread-safe classes? What's the real benefit of them?
A few good reasons come to mind:
They can be hard to implement correctly, diagnose, and test. High performance concurrent designs are not concepts learned by attending a talk or going through a few online tutorials. It takes a lot of mistakes and time-invested to understand what goes into a good design.
Some structures are very specialized. These may be non-blocking, rely on atomics, or use less typical concurrent patterns or synchronization forms. Example: By default, you may just reach for a mutex when you need a lock, but sometimes a rwlock or spinlock would be better. Sometimes immutability may be better.
Some contexts or domains are very specialized. Designing a single component is often a simple task, but designing an entire system and how components interact is a much larger challenge, and the system may need to operate under special constraints -- relying on that design's synchronization can save you a lot of headache. You may not take the time to benchmark under many different workloads, whereas the person who wrote it has invested the time to understand the implementation and its execution.
It just works. Some people don't want to spend their energy obsessing over concurrency issues. They would rather use a proven, reliable implementation and focus on other aspects of their program. In some cases, people whose software you wind up using may not understand some of these concepts well enough, and you will be grateful when they had chosen to use a proven (or even familiar) design.
Encapsulation. Sometimes encapsulation can result in big performance boosts in concurrent systems. Example: a member or parameter may be conditionally immutable, and that trait may be taken advantage of. In other cases, encapsulation can result in lower acquisitions or reduced blocking. Another case is that encapsulation can reduce the complexity of using the interface -- entire categories of potential threading issues may be removed (although you may be left with a smaller set of constraints).
Less to comprehend. Reuse a well known implementation and understand how it operates, and you have less to learn compared to reviewing an implementation which was hand written (e.g. by your colleague who departed last year).
There are of course downsides, but that is not your question ;)

Often this is why performance-critical multi-threaded code avoids using "thread-safe" containers. Containers like std::vector, etc., are not thread safe. If an application needs shared access to these containers amongst different threads, then the application is responsible for managing that access.
On the other hand, sometimes performance is not the driver for multi-threading. GUI programs benefit from keeping the UI thread separate from the thread that is doing the work. Other threads may be spun off for all sorts of reasons. Generally this can allow a nice separation of responsibilities in the code, and give better overall liveliness to the application. In these cases the goal often isn't high performance per se. Using thread-safe containers may be a perfectly natural choice for these applications.
Of course the best option is to have your cake and eat it too, like some lock-free queue implementations, which allow one thread to feed the queue, another to consume, with no locking (relying only on the atomic nature of certain basic operations).

It all rather depends on what the class is.
Consider a queue. Not every queue needs to be thread-safe. But there is most certainly a need in some cases for a data structure that you can push "stuff" into from one thread, and have the other thread pull "stuff" out of. This improves the parallelism of the threads, because it focuses inter-thread communication into a single location: the inter-thread queue. One side stuffs a sequence of commands in, and the other reads them and executes them when it can. If there are no commands available, it blocks or does whatever.
That demands, on some level, to have a thread-safe class. And since users will likely want to customize it with different kinds of "stuff", a generic implementation provided by the standard library is not unreasonable. Granted, no such thing exists in the C++ standard today, but it's almost certainly coming.
This is not "anti-parallel"; it improves parallelism. Without it, you would have to find some other way for the two threads to communicate. One that will more than likely force one of them to block more often.
Consider a shared_ptr. The cost of making shared_ptr's reference counter thread-safe is trivial next to the very likely possibility of someone screwing it up. It isn't free of course; an atomic increment/decrement isn't free. But it's far from "enforcing serial execution", since any moment of "serial execution" is so short as to be irrelevant in any real program.
So no, these things are not "anti-parallel".

Alternatives for locks for synchronisation

I'm currently in the process of developing my own little threading library, mainly for learning purposes, and am at the part of the message queue which will involve a lot of synchronisation in various places. Previously I've mainly used locks, mutexes and condition variables a bit which all are variations of the same theme, a lock for a section that should only be used by one thread at a time.
Are there any different solutions to synchronisation than using locks? I've read lock-free synchronization at places, but some consider hiding the locks in containers to be lock-free, which I disagree with. you just don't explicitly use the locks yourself.

Lock-free algorithms typically involve using compare-and-swap (CAS) or similar CPU instructions that update some value in memory not only atomically, but also conditionally and with an indicator of success. That way you can code something like this:
1 do
2 {
3 current_value = the_varibale
4 new_value = ...some expression using current_value...
5 } while(!compare_and_swap(the_variable, current_value, new_value));
compare_and_swap() atomically checks whether the_variable's value is still current_value, and only if that's so will it update the_variable's value to new_value and return true
exact calling syntax will vary with the CPU, and may involve assembly language or system/compiler-provided wrapper functions (use the latter if available - there may be other compiler optimisations or issues that their usage restricts to safe behaviours); generally, check your docs
The significance is that when another thread updates the variable after the read on line 3 but before the CAS on line 5 attempts the update, the compare and swap instruction will fail because the state from which you're updating is not the one you used to calculate the desired target state. Such do/while loops can be said to "spin" rather than lock, as they go round and round the loop until CAS succeeds.
Crucially, your existing threading library can be expected to have a two-stage locking approach for mutex, read-write locks etc. involving:
First stage: spinning using CAS or similar (i.e. spin on { read the current value, if it's not set then cas(current = not set, new = set) }) - which means other threads doing a quick update often won't result in your thread swapping out to wait, and all the relatively time-consuming overheads associated with that.
The second stage is only used if some limit of loop iterations or elapsed time is exceeded: it asks the operating system to queue the thread until it knows (or at least suspects) the lock is free to acquire.
The implication of this is that if you're using a mutex to protect access to a variable, then you are unlikely to do any better by implementing your own CAS-based "mutex" to protect the same variable.
Lock free algorithms come into their own when you are working directly on a variable that's small enough to update directly with the CAS instruction itself. Instead of being...
get a mutex (by spinning on CAS, falling back on slower OS queue)
update variable
release mutex
...they're simplified (and made faster) by simply having the spin on CAS do the variable update directly. Of course, you may find the work to calculate new value from old painful to repeat speculatively, but unless there's a LOT of contention you're not wasting that effort often.
This ability to update only a single location in memory has far-reaching implications, and work-arounds can require some creativity. For example, if you had a container using lock-free algorithms, you may decide to calculate a potential change to an element in the container, but can't sync that with updating a size variable elsewhere in memory. You may need to live without size, or be able to use an approximate size where you do a CAS-spin to increment or decrement the size later, but any given read of size may be slightly wrong. You may need to merge two logically-related data structures - such as a free list and the element-container - to share an index, then bit-pack the core fields for each into the same atomically-sized word at the start of each record. These kinds of data optimisations can be very invasive, and sometimes won't get you the behavioural characteristics you'd like. Mutexes et al are much easier in this regard, and at least you know you won't need a rewrite to mutexes if requirements evolve just that step too far. That said, clever use of a lock-free approach really can be adequate for a lot of needs, and yield a very gratifying performance and scalability improvement.
A core (good) consequence of lock-free algorithms is that one thread can't be holding the mutex then happen to get swapped out by the scheduler, such that other threads can't work until it resumes; rather - with CAS - they can spin safely and efficiently without an OS fallback option.
Things that lock free algorithms can be good for include updating usage/reference counters, modifying pointers to cleanly switch the pointed-to data, free lists, linked lists, marking hash-table buckets used/unused, and load-balancing. Many others of course.
As you say, simply hiding use of mutexes behind some API is not lock free.

There are a lot of different approaches to synchronization. There are various variants of message-passing (for example, CSP) or transactional memory.
Both of these may be implemented using locks, but that's an implementation detail.
And then of course, for some purposes, there are lock-free algorithms or data-structures, which make do with just a few atomic instructions (such as compare-and-swap), but this isn't really a general-purpose replacement for locks.

There are several implementations of some data structures, which can be implemented in a lock free configuration. For example, the producer/consumer pattern can often be implemented using lock-free linked list structures.
However, most lock-free solutions require significant thought on the part of the person designing the specific program/specific problem domain. They aren't generally applicable for all problems. For examples of such implementations, take a look at Intel's Threading Building Blocks library.
Most important to note is that no lock-free solution is free. You're going to give something up to make that work, at the bare minimum in implementation complexity, and probably performance in scenarios where you're running on a single core (for example, a linked list is MUCH slower than a vector). Make sure you benchmark before using lock free on the base assumption that it would be faster.
Side note: I really hope you're not using condition variables, because there's no way to ensure that their access operates as you wish in C and C++.

Yet another library to add to your reading list: Fast Flow
What's interesting in your case is that they are based on lock-free queues. They have implemented a simple lock-free queue and then have built more complex queues out of it.
And since the code is free, you can peruse it and get the code for the lock-free queue, which is far from trivial to get right.

performance penalty of message passing as opposed to shared data

There is a lot of buzz these days about not using locks and using Message passing approaches like Erlang. Or about using immutable datastructures like in Functional programming vs. C++/Java.
But what I am concerned with is the following:
AFAIK, Erlang does not guarantee Message delivery. Messages might be lost. Won't the algorithm and code bloat and be complicated again if you have to worry about loss of messages? Whatever distributed algorithm you use must not depend on guaranteed delivery of messages.
What if the Message is a complicated object? Isn't there a huge performance penalty in copying and sending the messages vs. say keeping it in a shared location (like a DB that both processes can access)?
Can you really totally do away with shared states? I don't think so. For e.g. in a DB, you have to access and modify the same record. You cannot use message passing there. You need to have locking or assume Optimistic concurrency control mechanisms and then do rollbacks on errors. How does Mnesia work?
Also, it is not the case that you always need to worry about concurrency. Any project will also have a large piece of code that doesn't have to do anything with concurrency or transactions at all (but they do have performance and speed as a concern). A lot of these algorithms depend on shared states (that's why pass-by-reference or pointers are so useful).
Given this fact, writing programs in Erlang etc is a pain because you are prevented from doing any of these things. May be, it makes programs robust, but for things like Solving a Linear Programming problem or Computing the convex hulll etc. performance is more important and forcing immutability etc. on the algorithm when it has nothing to do with Concurrency/Transactions is a poor decision. Isn't it?

That's real life : you need to account for this possibility regardless of the language / platform. In a distributed world (the real world), things fail: live with it.
Of course there is a cost: nothing is free in our universe. But shouldn't you use another medium (e.g. file, db) instead of shuttling "big objects" in communication pipes? You can always use "message" to refer to "big objects" stored somewhere.
Of course not: the idea behind functional programming / Erlang OTP is to "isolate" as much as possible the areas were "shared state" is manipulated. Futhermore, having clearly marked places where shared state is mutated helps testability & traceability.
I believe you are missing the point: there is no such thing as a silver bullet. If your application cannot be successfully built using Erlang then don't do it. You can always some other part of the overall system in another fashion i.e. use a different language / platform. Erlang is no different from another language in this respect: use the right tool for the right job.
Remember: Erlang was designed to help solve concurrent, asynchronous and distributed problems. It isn't optimized for working efficiently on a shared block of memory for example... unless you count interfacing with nif functions working on shared blocks part of the game :-)

Real-world systems are always hybrids anyway: I don't believe the modern paradigms try, in practice, to get rid of mutable data and shared state.
The objective, however, is not to need concurrent access to this shared state. Programs can be divided into the concurrent and the sequential, and use message-passing and the new paradigms for the concurrent parts.
Not every code will get the same investment: There is concern that threads are fundamentally "considered harmful". Something like Apache may need traditional concurrent threads and a key piece of technology like that may be carefully refined over a period of years so it can blast away with fully concurrent shared state. Operating system kernels are another example where "solve the problem no matter how expensive it is" may make sense.
There is no benefit to fast-but-broken: But for new code, or code that doesn't get so much attention, it may be the case that it simply isn't thread-safe, and it will not handle true concurrency, and so the relative "efficiency" is irrelevant. One way works, and one way doesn't.
Don't forget testability: Also, what value can you place on testing? Thread-based shared-memory concurrency is simply not testable. Message-passing concurrency is. So now you have the situation where you can test one paradigm but not the other. So, what is the value in knowing that the code has been tested? The danger in not even knowing if the other code will work in every situation?

A few comments on the misunderstanding you have of Erlang:
Erlang guarantees that messages will not be lost, and that they will arrive in the order sent. A basic error situation is that machine A can not speak to machine B. When that happens process monitors and links will trigger, and system node-down messages will be sent to the processes that registered for it. Nothing will be silently dropped. Processes will "crash" and supervisors (if any) tries to restart them.
Objects can not be mutated, so they are always copied. One way to secure immutability is by copying values to other erlang process' heaps. Another way is to allocate objects in a shared heap, message references to them and simply not have any operations that mutate them. Erlang does the first for performance! Realtime suffers if you need to stop all processes to garbage collect a shared heap. Ask Java.
There is shared state in Erlang. Erlang is not proud of it, but it is pragmatic about it. One example is the local process registry which is a global map that maps a name to a process so that system processes can be restarted and claim their old name. Erlang just tries to avoid shared state if it possibly can. ETS tables that are public are another example.
Yes, sometimes Erlang is too slow. This happens all languages. Sometimes Java is too slow. Sometimes C++ is too slow. Just because a tight loop in a game had to drop down to assembly to kick off some serious SIMD-based vector mathematics you can't deduce that everything should be written in assembly because it is the only language that is fast when it matters. What matters is being able to write systems that have good performance, and Erlang manages quite well. See benchmarks on yaws or rabbitmq.
Your facts are not facts about Erlang. Even if you think Erlang programming is a pain, you will find other people create some awesome software thanks to it. You should attempt writing an IRC server in Erlang, or something else very concurrent. Even if you're never going to use Erlang again, you would have learned to think about concurrency another way. But of course, you will, because Erlang is awesome easy.
Those that do not understand Erlang are doomed to re-implement it badly.
Okay, the original was about Lisp, but... its true!

There are some implicit assumption in your questions - you assume that all the data can fit
on one machine and that the application is intrinsically localised to one place.
What happens if the application is so large it cannot fit on one machine? What happens if the application outgrows one machine?
You don't want to have one way to program an application if it fits on one machine and
a completely different way of programming it as soon as it outgrows one machine.
What happens if you want make a fault-tolerant application? To make something fault-tolerant you need at least two physically separated machines and no sharing.
When you talk about sharing and data bases you omit to mention that things like mySQL
cluster achieve fault-tolerence precisely by maintaining synchronised copies of the
data in physically separated machines - there is a lot of message passing and
copying that you don't see on the surface - Erlang just exposes this.
The way you program should not suddenly change to accommodate fault-tolerance and scalability.
Erlang was designed primarily for building fault-tolerant applications.
Shared data on a multi-core has it's own set of problems - when you access shared data
you need to acquire a lock - if you use a global lock (the easiest approach) you can end up
stopping all the cores while you access the shared data. Shared data access on a multicore
can be problematic due to caching problems, if the cores have local data caches then accessing "far away" data (in some other processors cache) can be very expensive.
Many problems are intrinsically distributed and the data is never available in one place
at the same time so - these kind of problems fit well with the Erlang way of thinking.
In a distributed setting "guaranteeing message delivery" is impossible - the destination machine might have crashed. Erlang cannot thus guarantee message delivery -
it takes a different approach - the system will tell you if it failed to deliver a message
(but only if you have used the link mechanism) - then you can write you own custom error
recovery.)
For pure number crunching Erlang is not appropriate - but in a hybrid system Erlang
is good at managing how computations get distributed to available processors, so we see a lot of systems where Erlang manages the distribution and fault-tolerent aspects of the problem, but the problem itself is solved in a different language.
and other languages are used

For e.g. in a DB, you have to access and modify the same record
But that is handled by the DB. As a user of the database, you simply execute your query, and the database ensures it is executed in isolation.
As for performance, one of the most important things about eliminating shared state is that it enables new optimizations. Shared state is not particularly efficient. You get cores fighting over the same cache lines, and data has to be written through to memory where it could otherwise stay in a register or in CPU cache.
Many compiler optimizations rely on absence of side effects and shared state as well.
You could say that a stricter language guaranteeing these things requires more optimizations to be performant than something like C, but it also makes these optimizations much much easier for the compiler to implement.
Many concerns similar to concurrency issues arise in singlethreaded code. Modern CPUs are pipelined, execute instructions out of order, and can run 3-4 of them per cycle. So even in a single-threaded program, it is vital that the compiler and CPU is able to determine which instructions can be interleaved and executed in parallel.

For correctness, shared is the way to go, and keep the data as normalized as possible. For immediacy, send messages to inform of changes, but always back them up with polling. Messages get dropped, duplicated, re-ordered, delayed - don't rely on them.
If speed is what you're worried about, first do it single-thread and tune the daylights out of it. Then if you've got multiple cores and know how to split up the work, use parallelism.

Erlang provides supervisors and gen_server callbacks for synchronous calls, so you will know about it if a message isn't delivered: either the gen_server call returns a timeout, or your whole node will be brought down and up if the supervisor is triggered.
usually if the processes are on the same node, message-passing languages optimise away the data copying, so it's almost like shared memory, except if the object is changed used by both afterward, which can not be done using shared memory either anyways
There is some state which is kept by processes by passing it around to themselves in the recursive tail-calls, also some state can be of course passed through messages. I don't use mnesia much, but it is a transactional database, so once you have passed the operation to mnesia (and it has returned) you are pretty much guaranteed it will go through..
Which is why it is easy to tie such applications into erlang with the use of ports or drivers. The easiest are the ports, it's much like a unix pipe, though I think performance isn't that great...and as said, message-passing usually ends up just being pointer passing anyways as the VM/compiler optimise the memory copy out.

What are the "things to know" when diving into multi-threaded programming in C++

I'm currently working on a wireless networking application in C++ and it's coming to a point where I'm going to want to multi-thread pieces of software under one process, rather than have them all in separate processes. Theoretically, I understand multi-threading, but I've yet to dive in practically.
What should every programmer know when writing multi-threaded code in C++?

I would focus on design the thing as much as partitioned as possible so you have the minimal amount of shared things across threads. If you make sure you don't have statics and other resources shared among threads (other than those that you would be sharing if you designed this with processes instead of threads) you would be fine.
Therefore, while yes, you have to have in mind concepts like locks, semaphores, etc, the best way to tackle this is to try to avoid them.

I am no expert at all in this subject. Just some rule of thumb:
Design for simplicity, bugs really are hard to find in concurrent code even in the simplest examples.
C++ offers you a very elegant paradigm to manage resources(mutex, semaphore,...): RAII. I observed that it is much easier to work with boost::thread than to work with POSIX threads.
Build your code as thread-safe. If you don't do so, your program could behave strangely

I am exactly in this situation: I wrote a library with a global lock (many threads, but only one running at a time in the library) and am refactoring it to support concurrency.
I have read books on the subject but what I learned stands in a few points:
think parallel: imagine a crowd passing through the code. What happens when a method is called while already in action ?
think shared: imagine many people trying to read and alter shared resources at the same time.
design: avoid the problems that points 1 and 2 can raise.
never think you can ignore edge cases, they will bite you hard.
Since you cannot proof-test a concurrent design (because thread execution interleaving is not reproducible), you have to ensure that your design is robust by carefully analyzing the code paths and documenting how the code is supposed to be used.
Once you understand how and where you should bottleneck your code, you can read the documentation on the tools used for this job:
Mutex (exclusive access to a resource)
Scoped Locks (good pattern to lock/unlock a Mutex)
Semaphores (passing information between threads)
ReadWrite Mutex (many readers, exclusive access on write)
Signals (how to 'kill' a thread or send it an interrupt signal, how to catch these)
Parallel design patterns: boss/worker, producer/consumer, etc (see schmidt)
platform specific tools: openMP, C blocks, etc
Good luck ! Concurrency is fun, just take your time...

You should read about locks, mutexes, semaphores and condition variables.
One word of advice, if your app has any form of UI make sure you always change it from the UI thread. Most UI toolkits/frameworks will crash (or behave unexpectedly) if you access them from a background thread. Usually they provide some form of dispatching method to execute some function in the UI thread.

Never assume that external APIs are threadsafe. If it is not explicitly stated in their docs, do not call them concurrently from multiple threads. Instead, limit your use of them to a single thread or use a mutex to prevent concurrent calls (this is rather similar to the aforementioned GUI libraries).
Next point is language-related. Remember, C++ has (currently) no well-defined approach to threading. The compiler/optimizer does not know if code might be called concurrently. The volatile keyword is useful to prevent certain optimizations (i.e. caching of memory fields in CPU registers) in multi-threaded contexts, but it is no synchronization mechanism.
I'd recommend boost for synchronization primitives. Don't mess with platform APIs. They make your code difficult to port because they have similar functionality on all major platforms, but slightly different detail behaviour. Boost solves these problems by exposing only common functionality to the user.
Furthermore, if there's even the smallest chance that a data structure could be written to by two threads at the same time, use a synchronization primitive to protect it. Even if you think it will only happen once in a million years.

One thing I've found very useful is to make the application configurable with regard to the actual number of threads it uses for various tasks. For example, if you have multiple threads accessing a database, make the number of those threads be configurable via a command line parameter. This is extremely handy when debugging - you can exclude threading issues by setting the number to 1, or force them by setting it to a high number. It's also very handy when working out what the optimal number of threads is.

Make sure you test your code in a single-cpu system and a multi-cpu system.
Based on the comments:-
Single socket, single core
Single socket, two cores
Single socket, more than two cores
Two sockets, single core each
Two sockets, combination of single, dual and multi core cpus
Mulitple sockets, combination of single, dual and multi core cpus
The limiting factor here is going to be cost. Ideally, concentrate on the types of system your code is going to run on.

In addition to the other things mentioned, you should learn about asynchronous message queues. They can elegantly solve the problems of data sharing and event handling. This approach works well when you have concurrent state machines that need to communicate with each other.
I'm not aware of any message passing frameworks tailored to work only at the thread level. I've only seen home-brewed solutions. Please comment if you know of any existing ones.
EDIT:
One could use the lock-free queues from Intel's TBB, either as-is, or as the basis for a more general message-passing queue.

Since you are a beginner, start simple. First make it work correctly, then worry about optimizations. I've seen people try to optimize by increasing the concurrency of a particular section of code (often using dubious tricks), without ever looking to see if there was any contention in the first place.
Second, you want to be able to work at as high a level as you can. Don't work at the level of locks and mutexs if you can using an existing master-worker queue. Intel's TBB looks promising, being slightly higher level than pure threads.
Third, multi-threaded programming is hard. Reduce the areas of your code where you have to think about it as much as possible. If you can write a class such that objects of that class are only ever operated on in a single thread, and there is no static data, it greatly reduces the things that you have to worry about in the class.

A few of the answers have touched on this, but I wanted to emphasize one point:
If you can, make sure that as much of your data as possible is only accessible from one thread at a time. Message queues are a very useful construct to use for this.
I haven't had to write much heavily-threaded code in C++, but in general, the producer-consumer pattern can be very helpful in utilizing multiple threads efficiently, while avoiding the race conditions associated with concurrent access.
If you can use someone else's already-debugged code to handle thread interaction, you're in good shape. As a beginner, there is a temptation to do things in an ad-hoc fashion - to use a "volatile" variable to synchronize between two pieces of code, for example. Avoid that as much as possible. It's very difficult to write code that's bulletproof in the presence of contending threads, so find some code you can trust, and minimize your use of the low-level primitives as much as you can.

My top tips for threading newbies:
If you possibly can, use a task-based parallelism library, Intel's TBB being the most obvious one. This insulates you from the grungy, tricky details and is more efficient than anything you'll cobble together yourself. The main downside is this model doesn't support all uses of multithreading; it's great for exploiting multicores for compute power, less good if you wanted threads for waiting on blocking I/O.
Know how to abort threads (or in the case of TBB, how to make tasks complete early when you decide you didn't want the results after all). Newbies seem to be drawn to thread kill functions like moths to a flame. Don't do it... Herb Sutter has a great short article on this.

Make sure to explicitly know what objects are shared and how they are shared.
As much as possible make your functions purely functional. That is they have inputs and outputs and no side effects. This makes it much simpler to reason about your code. With a simpler program it isn't such a big deal but as the complexity rises it will become essential. Side effects are what lead to thread-safety issues.
Plays devil's advocate with your code. Look at some code and think how could I break this with some well timed thread interleaving. At some point this case will happen.
First learn thread-safety. Once you get that nailed down then you move onto the hard part: Concurrent performance. This is where moving away from global locks is essential. Figuring out ways to minimize and remove locks while still maintaining the thread-safety is hard.

Keep things dead simple as much as possible. It's better to have a simpler design (maintenance, less bugs) than a more complex solution that might have slightly better CPU utilization.
Avoid sharing state between threads as much as possible, this reduces the number of places that must use synchronization.
Avoid false-sharing at all costs (google this term).
Use a thread pool so you're not frequently creating/destroying threads (that's expensive and slow).
Consider using OpenMP, Intel and Microsoft (possibly others) support this extension to C++.
If you are doing number crunching, consider using Intel IPP, which internally uses optimized SIMD functions (this isn't really multi-threading, but is parallelism of a related sorts).
Have tons of fun.

Stay away from MFC and it's multithreading + messaging library.
In fact if you see MFC and threads coming toward you - run for the hills (*)
(*) Unless of course if MFC is coming FROM the hills - in which case run AWAY from the hills.

The biggest "mindset" difference between single-threaded and multi-threaded programming in my opinion is in testing/verification. In single-threaded programming, people will often bash out some half-thought-out code, run it, and if it seems to work, they'll call it good, and often get away with it using it in a production environment.
In multithreaded programming, on the other hand, the program's behavior is non-deterministic, because the exact combination of timing of which threads are running for which periods of time (relative to each other) will be different every time the program runs. So just running a multithreaded program a few times (or even a few million times) and saying "it didn't crash for me, ship it!" is entirely inadequate.
Instead, when doing a multithreaded program, you always should be trying to prove (at least to your own satisfaction) that not only does the program work, but that there is no way it could possibly not work. This is much harder, because instead of verifying a single code-path, you are effectively trying to verify a near-infinite number of possible code-paths.
The only realistic way to do that without having your brain explode is to keep things as bone-headedly simple as you can possibly make them. If you can avoid using multithreading totally, do that. If you must do multithreading, share as little data between threads as possible, and use proper multithreading primitives (e.g. mutexes, thread-safe message queues, wait conditions) and don't try to get away with half-measures (e.g. trying to synchronize access to a shared piece of data using only boolean flags will never work reliably, so don't try it)
What you want to avoid is the multithreading hell scenario: the multithreaded program that runs happily for weeks on end on your test machine, but crashes randomly, about once a year, at the customer's site. That kind of race-condition bug can be nearly impossible to reproduce, and the only way to avoid it is to design your code extremely carefully to guarantee it can't happen.
Threads are strong juju. Use them sparingly.

You should have an understanding of basic systems programing, in particular:
Synchronous vs Asynchronous I/O (blocking vs. non-blocking)
Synchronization mechanisms, such as lock and mutex constructs
Thread management on your target platform

I found viewing the introductory lectures on OS and systems programming here by John Kubiatowicz at Berkeley useful.

Part of my graduate study area relates to parallelism.
I read this book and found it a good summary of approaches at the design level.
At the basic technical level, you have 2 basic options: threads or message passing. Threaded applications are the easiest to get off the ground, since pthreads, windows threads or boost threads are ready to go. However, it brings with it the complexity of shared memory.
Message-passing usability seems mostly limited at this point to the MPI API. It sets up an environment where you can run jobs and partition your program between processors. It's more for supercomputer/cluster environments where there's no intrinsic shared memory. You can achieve similar results with sockets and so forth.
At another level, you can use language type pragmas: the popular one today is OpenMP. I've not used it, but it appears to build threads in via preprocessing or a link-time library.
The classic problem is synchronization here; all the problems in multiprogramming come from the non-deterministic nature of multiprograms, which can not be avoided.
See the Lamport timing methods for a further discussion of synchronizations and timing.
Multithreading is not something that only Ph.D.`s and gurus can do, but you will have to be pretty decent to do it without making insane bugs.

I'm in the same boat as you, I am just starting multi threading for the first time as part of a project and I've been looking around the net for resources. I found this blog to be very informative. Part 1 is pthreads, but I linked starting on the boost section.

I have written a multithreaded server application and a multithreaded shellsort. They were both written in C and use NT's threading functions "raw" that is without any function library in-between to muddle things. They were two quite different experiences with different conclusions to be drawn. High performance and high reliability were the main priorities although coding practices had a higher priority if one of the first two was judged to be threatened in the long term.
The server application had both a server and a client part and used iocps to manage requests and responses. When using iocps it is important never to use more threads than you have cores. Also I found that requests to the server part needed a higher priority so as not to lose any requests unnecessarily. Once they were "safe" I could use lower priority threads to create the server responses. I judged that the client part could have an even lower priority. I asked the questions "what data can't I lose?" and "what data can I allow to fail because I can always retry?" I also needed to be able to interface to the application's settings through a window and it had to be responsive. The trick was that the UI had normal priority, the incoming requests one less and so on. My reasoning behind this was that since I will use the UI so seldom it can have the highest priority so that when I use it it will respond immediately. Threading here turned out to mean that all separate parts of the program in the normal case would/could be running simultaneously but when the system was under higher load, processing power would be shifted to the vital parts due to the prioritization scheme.
I've always liked shellsort so please spare me from pointers about quicksort this or that or blablabla. Or about how shellsort is ill-suited for multithreading. Having said that, the problem I had had to do with sorting a semi-largelist of units in memory (for my tests I used a reverse-sorted list of one million units of forty bytes each. Using a single-threaded shellsort I could sort them at a rate of roughly one unit every two us (microseconds). My first attempt to multithread was with two threads (though I soon realized that I wanted to be able to specify the number of threads) and it ran at about one unit every 3.5 seconds, that is to say SLOWER. Using a profiler helped a lot and one bottleneck turned out to be the statistics logging (i e compares and swaps) where the threads would bump into each other. Dividing up the data between the threads in an efficient way turned out to be the biggest challenge and there is definitley more I can do there such as dividing the vector containing the indeces to the units in cache-line size adapted chunks and perhaps also comparing all indeces in two cache lines before moving to the next line (at least I think there is something I can do there - the algorithms get pretty complicated). In the end, I achieved a rate of one unit every microsecond with three simultaneous threads (four threads about the same, I only had four cores available).
As to the original question my advice to you would be
If you have the time, learn the threading mechanism at the lowest possible level.
If performance is important learn the related mechanisms that the OS provides. Multi-threading by itself is seldom enough to achieve an application's full potential.
Use profiling to understand the quirks of multiple threads working on the same memory.
Sloppy architectural work will kill any app, regardless of how many cores and systems you have executing it and regardless of the brilliance of your programmers.
Sloppy programming will kill any app, regardless of the brilliance of the architectural foundation.
Understand that using libraries lets you reach the development goal faster but at the price of less understanding and (usually) lower performance .

Before giving any advice on do's and dont's about multi-thread programming in C++, I would like to ask the question Is there any particular reason you want to start writing the application in C++?
There are other programming paradigms where you utilize the multi-cores without getting into multi-threaded programming. One such paradigm is functional programming. Write each piece of your code as functions without any side effects. Then it is easy to run it in multiple thread without worrying about synchronization.
I am using Erlang for my development purpose. It has increased by productivity by at least 50%. Code running may not be as fast as the code written in C++. But I have noticed that for most of the back-end offline data processing, speed is not as important as distribution of work and utilizing the hardware as much as possible. Erlang provides a simple concurrency model where you can execute a single function in multiple-threads without worrying about the synchronization issue. Writing multi-threaded code is easy, but debugging that is time consuming. I have done multi-threaded programming in C++, but I am currently happy with Erlang concurrency model. It is worth looking into.

Make sure you know what volatile means and it's uses(which may not be obvious at first).
Also, when designing multithreaded code, it helps to imagine that an infinite amount of processors is executing every single line of code in your application at once. (er, every single line of code that is possible according to your logic in your code.) And that everything that isn't marked volatile the compiler does a special optimization on it so that only the thread that changed it can read/set it's true value and all the other threads get garbage.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js