How do I detect memory access violation and/or memory race conditions?

How do I detect memory access violation and/or memory race conditions? - c++

I have a target platform reporting when memory is read from or written to as well as when locks(think mutex for example) are taken/freed. It reports the program counter, data address and read/write flag. I am writing a program to use this information on a separate host machine where the reports are received so it does not interfere with the target. The target already reports this data so I am not changing the target code at all.
Are there any references or already available algorithms that do this kind of detection? For example, some way of detecting race conditions when multiple threads try to write to a global variable without protecting it first.
I am currently brewing my own but I convince myself there is definitely some code out there that does this already. Or at least some proven algorithm of how to go about it.
Note This is not to detect memory leaks.
Note Implementation language is C++
I am trying to make the detection code I write platform agnostic so I am using STL and just Standard C++ with libraries like boost, poco, loki.
Any leads will help
thanks.

It is probably too late to talk you out of this, but this does not work. Threading races are caused by subtle timing issues between threads. You can never diagnose timing related problems with logging. Heisenbergian, just logging alters the timing of a thread. Especially the kind you are contemplating. Infamously, there's plenty of software that shipped with logging kept turned on because it would nosedive with it turned off.
Flushing out threading bugs is hard. The kind of tool that works is one that intentionally injects random delays in code. Microsoft CHESS is an example, works on native code too.

To address only part of your question, race conditions are extremely nasty precisely because there is no good way to test for them. By definition they're unpredictable sequences of events that are quite difficult to diagnose. Detection code depends on the fact that the race condition is actually happening, and in that case it's likely that you'll see errant behavior anyway. Any test code you add may make them more or less likely to appear, or possibly even change the timing such that they never appear at all.
Instead of trying to detect race conditions, what about attempting program design that helps make you more resilient to having them in the first place?
For example if your global variable were simply encapsulated in an object that knows all the proper protection that needs to happen on access, then it's impossible for threads to concurrently write to it, because such a interface doesn't exist. Programmatically preventing race conditions is going to be easier than trying to detect them algorithmically (chances are you'll still catch some during unit/subsystem testing).

Related

Race condition detection tools

I would like to test a big and complex (over 1.3M LOC) server application for race conditions. The application is written in C and C++ and running on a 64 bit Linux. I did some research and came up with some dynamic tools (e.g., Intel inspector, Tsan, Helgrind & DRD) and some static tools (e.g., RELAY, RacerX).
The dynamic tools are supposed to be more accurate (less false positives) and can handle custom synchronization mechanisms, but impose a significant runtime overhead that will trigger the application's timeouts. The problem with the static tools is that it seems mostly academic and not maintained (e.g., RELAY's latest version is from 2010).
Currently I'm thinking to use Tsan and stretch the application's timers to accommodate for the added overhead. Did anyone face similar challenges and have some insights I might have missed?

Unfortunately, I think this might be a past the line of "opinion-based" questions, but I'll take a shot.
Without understanding anything about the application, it is nearly impossible to say what you might need to consider when using tsan. On a smaller (103k LOC) project that I work on, designed for high throughput network stuff, it's nearly always been sufficient to design tests to exercise various code paths and test them. I've never needed to stretch timers or timeouts. I imagine this might be problematic if you have some hard real-time constraints (I do not). I haven't experienced tsan overhead to be prohibitively large.
One thing I will note is that tsan does not play well with concurrent data structures (such as those provided by concurrencykit and others). This is because the implementation of these concurrent data structures frequently rely on detection of data races to determine execution behavior.
Consider, for instance, a full ring buffer with two concurrent consumers. The readers will likely be flagged as racing on temporary reads of the front of the ring, because they do. However, the consumers linearize on an atomic comare-and-swap operation to set an incremented, racy-read value to the next index of the ring. If the swap fails, the operation is retried. Therefore, although the reads and writes may race, correctness is guaranteed.
From the perspective of tsan, these aren't considered false positives because they are actual data races. On the other hand, they are false positives for all practical purposes because they don't actually cause any incorrect or undefined behavior. There are ways you can instrument your code to avoid this, but it has been more hassle than it's worth when I've tried it. It depends on how noisy your output is.
Also note that if your application is calling into uninstrumented libraries (libc, openssl, whatever), you will miss potential races. If a race happens with concurrent calls to an uninstrumented library, you will miss the race.
If using tsan, don't forget to use -fno-omit-frame-pointer (and don't forget to place that after any -Olevel option). Otherwise you'll be in hell with addr2line, or forced to rebuild.
Unfortunately, I don't have any experience with the other utilities you've listed, but since your question seems to be about tsan specifically, I hope this is helpful.

Is it not possible to make a C++ application "Crash Proof"?

Let's say we have an SDK in C++ that accepts some binary data (like a picture) and does something. Is it not possible to make this SDK "crash-proof"? By crash I primarily mean forceful termination by the OS upon memory access violation, due to invalid input passed by the user (like an abnormally short junk data).
I have no experience with C++, but when I googled, I found several means that sounded like a solution (use a vector instead of an array, configure the compiler so that automatic bounds check is performed, etc.).
When I presented this to the developer, he said it is still not possible.. Not that I don't believe him, but if so, how is language like Java handling this? I thought the JVM performs everytime a bounds check. If so, why can't one do the same thing in C++ manually?
UPDATE
By "Crash proof" I don't mean that the application does not terminate. I mean it should not abruptly terminate without information of what happened (I mean it will dump core etc., but is it not possible to display a message like "Argument x was not valid" etc.?)

You can check the bounds of an array in C++, std::vector::at does this automatically.
This doesn't make your app crash proof, you are still allowed to deliberately shoot yourself in the foot but nothing in C++ forces you to pull the trigger.

No. Even assuming your code is bug free. For one, I have looked at many a crash reports automatically submitted and I can assure you that the quality of the hardware out there is much bellow what most developers expect. Bit flips are all too common on commodity machines and cause random AVs. And, even if you are prepared to handle access violations, there are certain exceptions that the OS has no choice but to terminate the process, for example failure to commit a stack guard page.

By crash I primarily mean forceful termination by the OS upon memory access violation, due to invalid input passed by the user (like an abnormally short junk data).
This is what usually happens. If you access some invalid memory usually OS aborts your program.
However the question what is invalid memory... You may freely fill with garbage all the memory in heap and stack and this is valid from OS point of view, it would not be valid from your point of view as you created garbage.
Basically - you need to check the input data carefully and relay on this. No OS would do this for you.
If you check your input data carefully you would likely to manage the data ok.

I primarily mean forceful termination
by the OS upon memory access
violation, due to invalid input passed
by the user
Not sure who "the user" is.
You can write programs that won't crash due to invalid end-user input. On some systems, you can be forcefully terminated due to using too much memory (or because some other program is using too much memory). And as Remus says, there is no language which can fully protect you against hardware failures. But those things depend on factors other than the bytes of data provided by the user.
What you can't easily do in C++ is prove that your program won't crash due to invalid input, or go wrong in even worse ways, creating serious security flaws. So sometimes[*] you think that your code is safe against any input, but it turns out not to be. Your developer might mean this.
If your code is a function that takes for example a pointer to the image data, then there's nothing to stop the caller passing you some invalid pointer value:
char *image_data = malloc(1);
free(image_data);
image_processing_function(image_data);
So the function on its own can't be "crash-proof", it requires that the rest of the program doesn't do anything to make it crash. Your developer also might mean this, so perhaps you should ask him to clarify.
Java deals with this specific issue by making it impossible to create an invalid reference - you don't get to manually free memory in Java, so in particular you can't retain a reference to it after doing so. It deals with a lot of other specific issues in other ways, so that the situations which are "undefined behavior" in C++, and might well cause a crash, will do something different in Java (probably throw an exception).
[*] let's face it: in practice, in large software projects, "often".

I think this is a case of C++ codes not being managed codes.
Java, C# codes are managed, that is they are effectively executed by an Interpreter which is able to perform bound checking and detect crash conditions.
With the case of C++, you need to perform bound and other checking yourself. However, you have the luxury of using Exception Handling, which will prevent crash during events beyond your control.
The bottom line is, C++ codes themselves are not crash proof, but a good design and development can make them to be so.

In general, you can't make a C++ API crash-proof, but there are techniques that can be used to make it more robust. Off the top of my head (and by no means exhaustive) for your particular example:
Sanity check input data where possible
Buffer limit checks in the data processing code
Edge and corner case testing
Fuzz testing
Putting problem inputs in the unit test for regression avoidance

If "crash proof" only mean that you want to ensure that you have enough information to investigate crash after it occurred solution can be simple. Most cases when debugging information is lost during crash resulted from corruption and/or loss of stack data due to illegal memory operation by code running in one of threads. If you have few places where you call library or SDK that you don't trust you can simply save the stack trace right before making call into that library at some memory location pointed to by global variable that will be included into partial or full memory dump generated by system when your application crashes. On windows such functionality provided by CrtDbg API.On Linux you can use backtrace API - just search doc on show_stackframe(). If you loose your stack information you can then instruct your debugger to use that location in memory as top of the stack after you loaded your dump file. Well it is not very simple after all, but if you haunted by memory dumps without any clue what happened it may help.
Another trick often used in embedded applications is cycled memory buffer for detailed logging. Logging to the buffer is very cheap since it is never saved, but you can get idea on what happen milliseconds before crash by looking at content of the buffer in your memory dump after the crash.

Actually, using bounds checking makes your application more likely to crash!
This is good design because it means that if your program is working, it's that much more likely to be working /correctly/, rather than working incorrectly.
That said, a given application can't be made "crash proof", strictly speaking, until the Halting Problem has been solved. Good luck!

Testing approach for multi-threaded software

I have a piece of mature geospatial software that has recently had areas rewritten to take better advantage of the multiple processors available in modern PCs. Specifically, display, GUI, spatial searching, and main processing have all been hived off to seperate threads. The software has a pretty sizeable GUI automation suite for functional regression, and another smaller one for performance regression. While all automated tests are passing, I'm not convinced that they provide nearly enough coverage in terms of finding bugs relating race conditions, deadlocks, and other nasties associated with multi-threading. What techniques would you use to see if such bugs exist? What techniques would you advocate for rooting them out, assuming there are some in there to root out?
What I'm doing so far is running the GUI functional automation on the app running under a debugger, such that I can break out of deadlocks and catch crashes, and plan to make a bounds checker build and repeat the tests against that version. I've also carried out a static analysis of the source via PC-Lint with the hope of locating potential dead locks, but not had any worthwhile results.
The application is C++, MFC, mulitple document/view, with a number of threads per doc. The locking mechanism I'm using is based on an object that includes a pointer to a CMutex, which is locked in the ctor and freed in the dtor. I use local variables of this object to lock various bits of code as required, and my mutex has a time out that fires my a warning if the timeout is reached. I avoid locking where possible, using resource copies where possible instead.
What other tests would you carry out?
Edit: I have cross posted this question on a number of different testing and programming forums, as I'm keen to see how the different mind-sets and schools of thought would approach this issue. So apologies if you see it cross-posted elsewhere. I'll provide a summary links to responses after a week or so

Some suggestions:
Utilize the law of large numbers and perform the operation under test not only once, but many times.
Stress-test your code by exaggerating the scenarios. E.g. to test your mutex-holding class, use scenarios where the mutex-protected code:
is very short and fast (a single instruction)
is time-consuming (Sleep with a large value)
contains explicit context switches (Sleep (0))
Run your test on various different architectures. (Even if your software is Windows-only, test it on single- and multicore processors with and without hyperthreading, and a wide range of clock speeds)
Try to design your code such that most of it is not exposed to multithreading issues. E.g. instead of accessing shared data (which requires locking or very carefully designed lock-avoidance techniques), let your worker threads operate on copies of the data, and communicate with them using queues. Then you only have to test your queue class for thread-safety
Run your tests when the system is idle as well as when it is under load from other tasks (e.g. our build server frequently runs multiple builds in parallel. This alone revealed many multithreading bugs that happened when the system was under load.)
Avoid asserting on timeouts. If such an assert fails, you don't know whether the code is broken or whether the timeout was too short. Instead, use a very generous timeout (just to ensure that the test eventually fails). If you want to test that an operation doesn't take longer than a certain time, measure the duration, but don't use a timeout for this.

Whilst I agree with #rstevens answer in that there's currently no way to unit test threading issues with 100% certainty there are some things that I've found useful.
Firstly whatever tests you have make sure you run them on lots of different spec boxes. I have several build machines, all different, multi-core, single core, fast, slow, etc. The good thing about how diverse they are is that different ones will throw up different threading issues. I've regularly been surprised to add a new build machine to my farm and suddenly have a new threading bug exposed; and I'm talking about a new bug being exposed in code that has run 10000s of times on the other build machines and which shows up 1 in 10 on the new one...
Secondly most of the unit testing that you do on your code needn't involve threading at all. The threading is, generally, orthogonal. So step one is to tease the code apart so that you can test the actual code that does the work without worrying too much about the threaded nature. This usually means creating an interface that the threading code uses to drive the real code. You can then test the real code in isolation.
Thridly you can test where the threaded code interacts with the main body of code. This means writing a mock for the interface that you developed to separate the two blocks of code. By now the threading code is likely much simpler and you can then often place synchronisation objects in the mock that you've made so that you can control the code under test. So, you'd spin up your thread and wait for it to set an event by calling into your mock and then have it block on another event which your test code controls. The test code can then step the threaded code from one point in your interface to the next.
Finally (if you've decoupled things enough that you can do the earlier stuff then this is easy) you can then run larger pieces of the multi-threaded parts of the app under test and make sure you get the results that you expect; you can play with the priority of the threads and maybe even add a couple of test threads that simply eat CPU to stir things up a bit.
Now you run all of these tests many many times on different hardware...
I've also found that running the tests (or the app) under something like DevPartner BoundsChecker can help a lot as it messes with the thread scheduling such that it sometimes shakes out hard to find bugs. I also wrote a deadlock detection tool which checks for lock inversions during program execution but I only use that rarely.
You can see an example of how I test multi-threaded C++ code here: http://www.lenholgate.com/blog/2004/05/practical-testing.html

Not really an answer:
Testing multithreaded bugs is very difficult. Most bugs only show up if two (or more) threads go to specific places in code in a specific order.
If and when this condition is met may depend on the timing of the process running. This timing may change due to one of the following pre-conditions:
Type of processor
Processor speed
Number of processors/cores
Optimization level
Running inside or outside the debugger
Operating system
There are for sure more pre-conditions that I forgot.
Because MT-bugs so highly depend on the exact timing of the code running Heisenberg's "Uncertainty principle" comes in here: If you want to test for MT bugs you change the timing by your "measures" which may prevent the bug from occurring...
The timing thing is what makes MT bugs so highly non-deterministic.
In other words: You may have a software that runs for months and then crashes some day and after that may run for years. If you don't have some debug logs/core dumps etc. you may never know why it crashes.
So my conclusion is: There is no really good way to Unit-Test for thread-safety. You always have to keep your eyes open when programming.
To make this clear I will give you a (simplified) example from real life (I encountered this when changing my employer and looking on the existing code there):
Imagine you have a class. You want that class to automatically deleted if no-one uses it anymore. So you build a reference-counter into that class:
(I know it is a bad style to delete an instance of a class in one of it's methods. This is because of the simplification of the real code which uses a Ref class to handle counted references.)
class A {
private:
int refcount;
public:
A() : refcount(0) {
}
void Ref() {
refcount++;
}
void Release() {
refcount--;
if (refcount == 0) {
delete this;
}
}
};
This seams pretty simple and nothing to worry about. But this is not thread-safe!
It's because "refcount++" and "refcount--" are not atomic operations but both are three operations:
read refcount from memory to register
increment/decrement register
write refcount from register to memory
Each of those operations can be interrupted and another thread may, at the same time manipulate the same refcount. So if for example two threads want to incremenet refcount the following COULD happen:
Thread A: read refcount from memory to register (refcount: 8)
Thread A: increment register
CONTEXT CHANGE -
Thread B: read refcount from memory to register (refcount: 8)
Thread B: increment register
Thread B: write refcount from register to memory (refcount: 9)
CONTEXT CHANGE -
Thread A: write refcount from register to memory (refcount: 9)
So the result is: refcount = 9 but it should have been 10!
This can only be solved by using atomic operations (i.e. InterlockedIncrement() & InterlockedDecrement() on Windows).
This bug is simply untestable! The reason is that it is so highly unlikely that there are two threads at the same time trying to modify the refcount of the same instance and that there are context switches in between that code.
But it can happen! (The probability increases if you have a multi-processor or multi-core system because there is no context switch needed to make it happen).
It will happen in some days, weeks or months!

Looks like you are using Microsoft tools. There's a group at Microsoft Research that has been working on a tool specifically designed to shake out concurrency bugz. Check out CHESS. Other research projects, in their early stages, are Cuzz and Featherlite.
VS2010 includes a very good looking concurrency profiler, video is available here.

As Len Holgate mentions, I would suggest refactoring (if needed) and creating interfaces for the parts of the code where different threads interact with objects carrying a state. These parts of the code can then be tested separate from the code containing the actual functionality. To verify such a unit test, I would consider using a code coverage tool (I use gcov and lcov for this) to verify that everything in the thread safe interface is covered.
I think this is a pretty convenient way of verifying that new code is covered in the tests.
The next step is then to follow the advice of the other answers regarding how to run the tests.

Firstly, many thanks for the responses. For the responses posted across different forumes see;
http://www.sqaforums.com/showflat.php?Cat=0&Number=617621&an=0&page=0#Post617621
Testing approach for multi-threaded software
http://www.softwaretestingclub.com/forum/topics/testing-approach-for?xg_source=activity
and the following mailing list; software-testing#yahoogroups.com
The testing took significantly longer than expected, hence this late reply, leading me to the conclusion that adding multi-threading to existing apps is liable to be very expensive in terms of testing, even if the coding is quite straightforward. This could prove interesting for the SQA community, as there is increasingly more multi-threaded development going on out there.
As per Joe Strazzere's advice, I found the most effective way of hitting bugs was via automation with varied input. I ended up doing this on three PCs, which have ran a bank of tests repeatedly with varied input over about six weeks. Initially, I was seeing crashes one or two times per PC per day. As I tracked these down, it ended up with one or two per week between the three PCs, and we haven't had any further problems for the last two weeks. For the last two weeks we have also had a version with users beta testing, and are using the software in-house.
In addition to varying the input under automation, I also got good results from the following;
Adding a test option that allowed mutex time-outs to be read from a configuration file, which in turn could be controlled by my automation.
Extending mutex time-outs beyond the typical time expected to execute a section of thread code, and firing a debug exception on time-out.
Running the automation in conjunction with a debugger (VS2008) such that when a problem occurred there was a better chance of tracking it down.
Running without a debugger to ensure that the debugger was not hiding other timing related bugs.
Running the automation against normal release, debug, and fully optimised build. FWIW, the optimised build threw up errors not reproducible in the other builds.
The type of bugs uncovered tended to be serious in nature, e.g. dereferencing invalid pointers, and even under the debugger took quite a bit of tracking down. As has been discussed elsewhere, the SuspendThread and ResumeThread functions ended up being major culprits, and all use of these functions were replaced by mutexes. Similarly all critical sections were removed due to lack of time-outs. Closing documents and exiting the program were also a bug source, where in one instance a document was destroyed with a worker thread still active. To overcome this a single mutex was added per thread to control the life of the thread, and aquired by the document destructor to ensure the thread had terminated as expected.
Once again, many thanks for the all the detailed and varied responses. Next time I take on this type of activity, I'll be better prepared.

How best to test a Mutex implementation?

What is the best way to test an implementation of a mutex is indeed correct? (It is necessary to implement a mutex, reuse is not a viable option)
The best I have come up with is to have many (N) concurrent threads iteratively attempting to access the protected region (I) times, which has a side effect (e.g. update to a global) so that the number of accesses + writes can be counted to ensure that the number of updates to the global is exactly (N)*(I).
Any other suggestions?

Formal proof is better than testing for this kind of thing.
Testing will show you that -- as long as you're not unlucky -- it all worked. But a test is a blunt instrument; it can fail to execute the exact right sequence to cause failure.
It's too hard to test every possible sequence of operations available in the hardware to be sure your mutex works under all circumstances.
Testing is not without value; it shows that you didn't make any obvious coding errors.
But you really need a more formal code inspection to demonstrate that it does the right things at the right times so that one client will atomically seize the lock resource required for proper mutex. In many platforms, there are special instructions to implement this, and if you're using one of those, you have a fighting chance of getting it right.
Similarly, you have to show that the release is atomic.

With something like a mutex, we get back to the old rule that testing can only demonstrate the presence of bugs, not the absence. A year of testing will probably tell you less than simply putting the code up for inspection, and asking if anybody sees a problem with it.

I'm with everyone else that this is incredibly difficult to prove conclusively, I have no idea how to do it - not helpful I know!
When you say implement a mutex and that reuse is not an option, is that for technical reasons like there is no Mutex implementation on the platform/OS that you are using or some other reason? Is wrapping some form of OS level 'lock' and calling it your Mutex implementation an option, eg CriticalSection on windoze, posix Condition Variables? If you can wrap a lower level OS lock then your chance of getting it right are much higher.
In case you have not already done so go and read Herb Sutter's Effective Concurrency articles. There should be some stuff in these worth something to you.
Anyway, some things to consider in your tests:
If your mutex is recursive (ie can the same thread lock it multiple
times) then ensure you do some
reference counting tests.
With your global variable that you
modify it would be best if this was a
varible that can't be written to
atomically. For example if you are
on an 8 bit platform then use a 16 or
32 bit variable that requires
multiple assembly instructions to write.
Carefully examine the assembly listing. Depending on the hardware platform though the assembly does not translate directly to how the code might be optimised...
Get someone else, who didn't write the code, to also write some tests.
test on as many different machines with different specs as you can (assuming that this is 'general purpose' and not for a specific hardware setup)
Good luck!

This reminds me of this question about FIFO semaphore test. In a nutshell my answer was:
Even if you have a specification, maybe it doesn't convey your intention exactly
You can prove that the algorithm fulfills the specification, but not the code (D. Knuth)
Test reveal only the presence of bug, not their absence (Dijkstra)
So your proposition seem reasonably the best to do. If you want to increase your confidence, use fuzzing to randomize scheduling and input.

If the proof stuff doesn't work out for you, then go with the testing one. Be sure to test all the possible use cases. Find out how exactly this thing will be used, who will be using it and, again, how it will be used. When you go the testing route be sure to run each test for each scenario a ton of times (millions, billions, as many as you can possibly get in with the testing time you have).
Try to be random because randomness will give you the best chance to cover all scenarios in a limited number of tests. Be sure to use data that will be used and data that may not be used but could be used and make sure the data doesn't mess up the locks.
BTW, unless you know a ton about mathematics and formal methods you will have no chance of actually coming up with a proof.

What are the "things to know" when diving into multi-threaded programming in C++

I'm currently working on a wireless networking application in C++ and it's coming to a point where I'm going to want to multi-thread pieces of software under one process, rather than have them all in separate processes. Theoretically, I understand multi-threading, but I've yet to dive in practically.
What should every programmer know when writing multi-threaded code in C++?

I would focus on design the thing as much as partitioned as possible so you have the minimal amount of shared things across threads. If you make sure you don't have statics and other resources shared among threads (other than those that you would be sharing if you designed this with processes instead of threads) you would be fine.
Therefore, while yes, you have to have in mind concepts like locks, semaphores, etc, the best way to tackle this is to try to avoid them.

I am no expert at all in this subject. Just some rule of thumb:
Design for simplicity, bugs really are hard to find in concurrent code even in the simplest examples.
C++ offers you a very elegant paradigm to manage resources(mutex, semaphore,...): RAII. I observed that it is much easier to work with boost::thread than to work with POSIX threads.
Build your code as thread-safe. If you don't do so, your program could behave strangely

I am exactly in this situation: I wrote a library with a global lock (many threads, but only one running at a time in the library) and am refactoring it to support concurrency.
I have read books on the subject but what I learned stands in a few points:
think parallel: imagine a crowd passing through the code. What happens when a method is called while already in action ?
think shared: imagine many people trying to read and alter shared resources at the same time.
design: avoid the problems that points 1 and 2 can raise.
never think you can ignore edge cases, they will bite you hard.
Since you cannot proof-test a concurrent design (because thread execution interleaving is not reproducible), you have to ensure that your design is robust by carefully analyzing the code paths and documenting how the code is supposed to be used.
Once you understand how and where you should bottleneck your code, you can read the documentation on the tools used for this job:
Mutex (exclusive access to a resource)
Scoped Locks (good pattern to lock/unlock a Mutex)
Semaphores (passing information between threads)
ReadWrite Mutex (many readers, exclusive access on write)
Signals (how to 'kill' a thread or send it an interrupt signal, how to catch these)
Parallel design patterns: boss/worker, producer/consumer, etc (see schmidt)
platform specific tools: openMP, C blocks, etc
Good luck ! Concurrency is fun, just take your time...

You should read about locks, mutexes, semaphores and condition variables.
One word of advice, if your app has any form of UI make sure you always change it from the UI thread. Most UI toolkits/frameworks will crash (or behave unexpectedly) if you access them from a background thread. Usually they provide some form of dispatching method to execute some function in the UI thread.

Never assume that external APIs are threadsafe. If it is not explicitly stated in their docs, do not call them concurrently from multiple threads. Instead, limit your use of them to a single thread or use a mutex to prevent concurrent calls (this is rather similar to the aforementioned GUI libraries).
Next point is language-related. Remember, C++ has (currently) no well-defined approach to threading. The compiler/optimizer does not know if code might be called concurrently. The volatile keyword is useful to prevent certain optimizations (i.e. caching of memory fields in CPU registers) in multi-threaded contexts, but it is no synchronization mechanism.
I'd recommend boost for synchronization primitives. Don't mess with platform APIs. They make your code difficult to port because they have similar functionality on all major platforms, but slightly different detail behaviour. Boost solves these problems by exposing only common functionality to the user.
Furthermore, if there's even the smallest chance that a data structure could be written to by two threads at the same time, use a synchronization primitive to protect it. Even if you think it will only happen once in a million years.

One thing I've found very useful is to make the application configurable with regard to the actual number of threads it uses for various tasks. For example, if you have multiple threads accessing a database, make the number of those threads be configurable via a command line parameter. This is extremely handy when debugging - you can exclude threading issues by setting the number to 1, or force them by setting it to a high number. It's also very handy when working out what the optimal number of threads is.

Make sure you test your code in a single-cpu system and a multi-cpu system.
Based on the comments:-
Single socket, single core
Single socket, two cores
Single socket, more than two cores
Two sockets, single core each
Two sockets, combination of single, dual and multi core cpus
Mulitple sockets, combination of single, dual and multi core cpus
The limiting factor here is going to be cost. Ideally, concentrate on the types of system your code is going to run on.

In addition to the other things mentioned, you should learn about asynchronous message queues. They can elegantly solve the problems of data sharing and event handling. This approach works well when you have concurrent state machines that need to communicate with each other.
I'm not aware of any message passing frameworks tailored to work only at the thread level. I've only seen home-brewed solutions. Please comment if you know of any existing ones.
EDIT:
One could use the lock-free queues from Intel's TBB, either as-is, or as the basis for a more general message-passing queue.

Since you are a beginner, start simple. First make it work correctly, then worry about optimizations. I've seen people try to optimize by increasing the concurrency of a particular section of code (often using dubious tricks), without ever looking to see if there was any contention in the first place.
Second, you want to be able to work at as high a level as you can. Don't work at the level of locks and mutexs if you can using an existing master-worker queue. Intel's TBB looks promising, being slightly higher level than pure threads.
Third, multi-threaded programming is hard. Reduce the areas of your code where you have to think about it as much as possible. If you can write a class such that objects of that class are only ever operated on in a single thread, and there is no static data, it greatly reduces the things that you have to worry about in the class.

A few of the answers have touched on this, but I wanted to emphasize one point:
If you can, make sure that as much of your data as possible is only accessible from one thread at a time. Message queues are a very useful construct to use for this.
I haven't had to write much heavily-threaded code in C++, but in general, the producer-consumer pattern can be very helpful in utilizing multiple threads efficiently, while avoiding the race conditions associated with concurrent access.
If you can use someone else's already-debugged code to handle thread interaction, you're in good shape. As a beginner, there is a temptation to do things in an ad-hoc fashion - to use a "volatile" variable to synchronize between two pieces of code, for example. Avoid that as much as possible. It's very difficult to write code that's bulletproof in the presence of contending threads, so find some code you can trust, and minimize your use of the low-level primitives as much as you can.

My top tips for threading newbies:
If you possibly can, use a task-based parallelism library, Intel's TBB being the most obvious one. This insulates you from the grungy, tricky details and is more efficient than anything you'll cobble together yourself. The main downside is this model doesn't support all uses of multithreading; it's great for exploiting multicores for compute power, less good if you wanted threads for waiting on blocking I/O.
Know how to abort threads (or in the case of TBB, how to make tasks complete early when you decide you didn't want the results after all). Newbies seem to be drawn to thread kill functions like moths to a flame. Don't do it... Herb Sutter has a great short article on this.

Make sure to explicitly know what objects are shared and how they are shared.
As much as possible make your functions purely functional. That is they have inputs and outputs and no side effects. This makes it much simpler to reason about your code. With a simpler program it isn't such a big deal but as the complexity rises it will become essential. Side effects are what lead to thread-safety issues.
Plays devil's advocate with your code. Look at some code and think how could I break this with some well timed thread interleaving. At some point this case will happen.
First learn thread-safety. Once you get that nailed down then you move onto the hard part: Concurrent performance. This is where moving away from global locks is essential. Figuring out ways to minimize and remove locks while still maintaining the thread-safety is hard.

Keep things dead simple as much as possible. It's better to have a simpler design (maintenance, less bugs) than a more complex solution that might have slightly better CPU utilization.
Avoid sharing state between threads as much as possible, this reduces the number of places that must use synchronization.
Avoid false-sharing at all costs (google this term).
Use a thread pool so you're not frequently creating/destroying threads (that's expensive and slow).
Consider using OpenMP, Intel and Microsoft (possibly others) support this extension to C++.
If you are doing number crunching, consider using Intel IPP, which internally uses optimized SIMD functions (this isn't really multi-threading, but is parallelism of a related sorts).
Have tons of fun.

Stay away from MFC and it's multithreading + messaging library.
In fact if you see MFC and threads coming toward you - run for the hills (*)
(*) Unless of course if MFC is coming FROM the hills - in which case run AWAY from the hills.

The biggest "mindset" difference between single-threaded and multi-threaded programming in my opinion is in testing/verification. In single-threaded programming, people will often bash out some half-thought-out code, run it, and if it seems to work, they'll call it good, and often get away with it using it in a production environment.
In multithreaded programming, on the other hand, the program's behavior is non-deterministic, because the exact combination of timing of which threads are running for which periods of time (relative to each other) will be different every time the program runs. So just running a multithreaded program a few times (or even a few million times) and saying "it didn't crash for me, ship it!" is entirely inadequate.
Instead, when doing a multithreaded program, you always should be trying to prove (at least to your own satisfaction) that not only does the program work, but that there is no way it could possibly not work. This is much harder, because instead of verifying a single code-path, you are effectively trying to verify a near-infinite number of possible code-paths.
The only realistic way to do that without having your brain explode is to keep things as bone-headedly simple as you can possibly make them. If you can avoid using multithreading totally, do that. If you must do multithreading, share as little data between threads as possible, and use proper multithreading primitives (e.g. mutexes, thread-safe message queues, wait conditions) and don't try to get away with half-measures (e.g. trying to synchronize access to a shared piece of data using only boolean flags will never work reliably, so don't try it)
What you want to avoid is the multithreading hell scenario: the multithreaded program that runs happily for weeks on end on your test machine, but crashes randomly, about once a year, at the customer's site. That kind of race-condition bug can be nearly impossible to reproduce, and the only way to avoid it is to design your code extremely carefully to guarantee it can't happen.
Threads are strong juju. Use them sparingly.

You should have an understanding of basic systems programing, in particular:
Synchronous vs Asynchronous I/O (blocking vs. non-blocking)
Synchronization mechanisms, such as lock and mutex constructs
Thread management on your target platform

I found viewing the introductory lectures on OS and systems programming here by John Kubiatowicz at Berkeley useful.

Part of my graduate study area relates to parallelism.
I read this book and found it a good summary of approaches at the design level.
At the basic technical level, you have 2 basic options: threads or message passing. Threaded applications are the easiest to get off the ground, since pthreads, windows threads or boost threads are ready to go. However, it brings with it the complexity of shared memory.
Message-passing usability seems mostly limited at this point to the MPI API. It sets up an environment where you can run jobs and partition your program between processors. It's more for supercomputer/cluster environments where there's no intrinsic shared memory. You can achieve similar results with sockets and so forth.
At another level, you can use language type pragmas: the popular one today is OpenMP. I've not used it, but it appears to build threads in via preprocessing or a link-time library.
The classic problem is synchronization here; all the problems in multiprogramming come from the non-deterministic nature of multiprograms, which can not be avoided.
See the Lamport timing methods for a further discussion of synchronizations and timing.
Multithreading is not something that only Ph.D.`s and gurus can do, but you will have to be pretty decent to do it without making insane bugs.

I'm in the same boat as you, I am just starting multi threading for the first time as part of a project and I've been looking around the net for resources. I found this blog to be very informative. Part 1 is pthreads, but I linked starting on the boost section.

I have written a multithreaded server application and a multithreaded shellsort. They were both written in C and use NT's threading functions "raw" that is without any function library in-between to muddle things. They were two quite different experiences with different conclusions to be drawn. High performance and high reliability were the main priorities although coding practices had a higher priority if one of the first two was judged to be threatened in the long term.
The server application had both a server and a client part and used iocps to manage requests and responses. When using iocps it is important never to use more threads than you have cores. Also I found that requests to the server part needed a higher priority so as not to lose any requests unnecessarily. Once they were "safe" I could use lower priority threads to create the server responses. I judged that the client part could have an even lower priority. I asked the questions "what data can't I lose?" and "what data can I allow to fail because I can always retry?" I also needed to be able to interface to the application's settings through a window and it had to be responsive. The trick was that the UI had normal priority, the incoming requests one less and so on. My reasoning behind this was that since I will use the UI so seldom it can have the highest priority so that when I use it it will respond immediately. Threading here turned out to mean that all separate parts of the program in the normal case would/could be running simultaneously but when the system was under higher load, processing power would be shifted to the vital parts due to the prioritization scheme.
I've always liked shellsort so please spare me from pointers about quicksort this or that or blablabla. Or about how shellsort is ill-suited for multithreading. Having said that, the problem I had had to do with sorting a semi-largelist of units in memory (for my tests I used a reverse-sorted list of one million units of forty bytes each. Using a single-threaded shellsort I could sort them at a rate of roughly one unit every two us (microseconds). My first attempt to multithread was with two threads (though I soon realized that I wanted to be able to specify the number of threads) and it ran at about one unit every 3.5 seconds, that is to say SLOWER. Using a profiler helped a lot and one bottleneck turned out to be the statistics logging (i e compares and swaps) where the threads would bump into each other. Dividing up the data between the threads in an efficient way turned out to be the biggest challenge and there is definitley more I can do there such as dividing the vector containing the indeces to the units in cache-line size adapted chunks and perhaps also comparing all indeces in two cache lines before moving to the next line (at least I think there is something I can do there - the algorithms get pretty complicated). In the end, I achieved a rate of one unit every microsecond with three simultaneous threads (four threads about the same, I only had four cores available).
As to the original question my advice to you would be
If you have the time, learn the threading mechanism at the lowest possible level.
If performance is important learn the related mechanisms that the OS provides. Multi-threading by itself is seldom enough to achieve an application's full potential.
Use profiling to understand the quirks of multiple threads working on the same memory.
Sloppy architectural work will kill any app, regardless of how many cores and systems you have executing it and regardless of the brilliance of your programmers.
Sloppy programming will kill any app, regardless of the brilliance of the architectural foundation.
Understand that using libraries lets you reach the development goal faster but at the price of less understanding and (usually) lower performance .

Before giving any advice on do's and dont's about multi-thread programming in C++, I would like to ask the question Is there any particular reason you want to start writing the application in C++?
There are other programming paradigms where you utilize the multi-cores without getting into multi-threaded programming. One such paradigm is functional programming. Write each piece of your code as functions without any side effects. Then it is easy to run it in multiple thread without worrying about synchronization.
I am using Erlang for my development purpose. It has increased by productivity by at least 50%. Code running may not be as fast as the code written in C++. But I have noticed that for most of the back-end offline data processing, speed is not as important as distribution of work and utilizing the hardware as much as possible. Erlang provides a simple concurrency model where you can execute a single function in multiple-threads without worrying about the synchronization issue. Writing multi-threaded code is easy, but debugging that is time consuming. I have done multi-threaded programming in C++, but I am currently happy with Erlang concurrency model. It is worth looking into.

Make sure you know what volatile means and it's uses(which may not be obvious at first).
Also, when designing multithreaded code, it helps to imagine that an infinite amount of processors is executing every single line of code in your application at once. (er, every single line of code that is possible according to your logic in your code.) And that everything that isn't marked volatile the compiler does a special optimization on it so that only the thread that changed it can read/set it's true value and all the other threads get garbage.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js