Let nvidia K20c use old stream management way?

Let nvidia K20c use old stream management way? - concurrency

From K20 different streams becomes fully concurrent(used to be concurrent on the edge).
However My program need the old way. Or I need to do a lot of synchronization to solve the dependency problem.
Is it possible to switch stream management to the old way?

CUDA C Programming Guide section on Asynchronous Current Execution
A stream is a sequence of commands (possibly issued by different host
threads) that execute in order. Different streams, on the other hand,
may execute their commands out of order with respect to one another or
concurrently; this behavior is not guaranteed and should therefore not
be relied upon for correctness (e.g., inter-kernel communication is
undefined).
If the application relied on Compute Capability 2.* and 3.0 implementation of streams then the program violates the definition of streams and any change to the CUDA driver (e.g. queuing of per stream requests) or new hardware will break the program.
If you need a temporary workaround then I would suggest moving all work to a single user defined stream. This may impact performance but it is likely the only temporary workaround.

Can you express the kernel dependencies with cudaEvent_t objects?
The Streams and Concurrency Webinar shows some quick code snippets on how to use events. Some of the details of that presentation are only applicable to pre-Kepler hardware, but I'm assuming from the original question that you're familiar with how things have changed since Fermi now that there are multiple command queues.

Related

Libav multi-threaded decoding

According to the documentation here, Libav provides the "infrastructure" for multithreaded decoding. But the docs are vague and confusing regarding how multithreaded decoding is implemented. Is it internally supported and just requires setting a flag in the struct, or does the user have to provide his own implementation with the provided functions? I searched a lot but could not find even one example of multithreaded video decoding with libav.

The link you have referred to looks like a description towards codec developers rather than to end-user of FFmpeg libraries using existing codecs.
Multi-threaded support is indeed implemented by framework itself - it requires FFmpeg to be built with thread support (like --enable-pthreads or --enable-w32threads configure options), varies across specific codecs (e.g. one codec may support multiple threads while others don't) and implement different approaches (decoding multiple frames in parallel or multiple slices within a single frame).
End-user application may configure the number of threads to utilize (via AVCodecContext::thread_count property set before avcodec_open2()) and threaded mode (AVCodecContext::thread_type set to FF_THREAD_FRAME or FF_THREAD_SLICE). Thread pool will be managed by FFmpeg itself, although some answers say that it is also possible using application-provided pool.
Some documents refer that AVCodecContext::thread_count default value set to 0 allows FFmpeg to automatically decide how many threads to use (which will be done based on a number of logical CPUs in system), but I've never tried this (always set this parameter manually). So probably it already does multi-threaded decoding on your system - check CPU load in task manager.
What FFmpeg doesn't do is managing multiple threads for reading packets from a file, decoding different streams in different threads and other similar things which a video player normally does - this is normally implemented by application itself. Although I recall some features have been integrated to FFmpeg simplifying implementation of these routines (like a packets queue).

Locking a process to Cuda core

I'm just getting into GPU processing.
I was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
For example you may have a small C program that performs an image filter on an index of images. Can you have that program running on each CUDA core that essentially runs forever - reading/writing from it's own memory to system memory and disk?
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
My semantics here are probably way off. I apologize if what i've said requries some interpretation. I'm not that used to GPU stuff yet.
Thanks.

All of my comments here should be prefaced with "at the moment". Technology is constantly evolving.
was wondering if it's possible to lock a new process, or 'launch' a process that is locked to a CUDA core?
process is mostly a (host) operating system term. CUDA doesn't define a process separately from the host operating system definition of it, AFAIK. CUDA threadblocks, once launched on a Streaming Multiprocessor (or SM, a hardware execution resource component inside a GPU), in many cases will stay on that SM for their "lifetime", and the SM includes an array of "CUDA cores" (a bit of a loose or conceptual term). However, there is at least one documented exception today to this in the case of CUDA Dynamic Parallelism, so in the most general sense, it is not possible to "lock" a CUDA thread of execution to a CUDA core (using core here to refer to that thread of execution forever remaining on a given warp lane within a SM).
Can you have that program running on each CUDA core that essentially runs forever
You can have a CUDA program that runs essentially forever. It is a recognized programming technique sometimes referred to as persistent threads. Such a program will naturally occupy/require one or more CUDA cores (again, using the term loosely). As already stated, that may or may not imply that the program permanently occupies a particular set of physical execution resources.
reading/writing from it's own memory to system memory
Yes, that's possible, extending the train of thought. Writing to it's own memory is obviously possible, by definition, and writing to system memory is possible via the zero-copy mechanism (slides 21/22), given a reasonable assumption of appropriate setup activity for this mechanism.
and disk?
No, that's not directly possible today, without host system interaction, and/or without a significant assumption of atypical external resources such as a disk controller of some sort connected via a GPUDirect interface (with a lot of additional assumptions and unspecified framework). The GPUDirect exception requires so much additional framework, that I would say, for typical usage, the answer is "no", not without host system activity/intervention. The host system (normally) owns the disk drive, not the GPU.
If this is possible, what are the implications for CPU performance - can we totally offset CPU usage or does the CPU still need to have some input/output?
In my opinion, the CPU must still be considered. One consideration is if you need to write to disk. Even if you don't, most programs derive I/O from somewhere (e.g. MPI) and so the implication of a larger framework of some sort is there. Secondly, and relatedly, the persistent threads programming model usually implies a producer/consumer relationship, and a work queue. The GPU is on the processing side (consumer side) of the work queue, but something else (usually) is on the producer side, typically the host system CPU. Again, it could be another GPU, either locally or via MPI, that is on the producer side of the work queue, but that still usually implies an ultimate producer somewhere else (i.e. the need for system I/O).
Additionally:
Can CUDA threads send packets over a network?
This is like the disk question. These questions could be viewed in a general way, in which case the answer might be "yes". But restricting ourselves to formal definitions of what a CUDA thread can do, I believe the answer is more reasonably "no". CUDA provides no direct definitions for I/O interfaces to disk or network (or many other things, such as a display!). It's reasonable to conjecture or presume the existence of a lightweight host process that simply copies packets between a CUDA GPU and a network interface. With this presumption, the answer might be "yes" (and similarly for disk I/O). But without this presumption (and/or a related, perhaps more involved presumption of a GPUDirect framework), I think the most reasonable answer is "no". According to the CUDA programming model, there is no definition of how to access a disk or network resource directly.

The actor model: Why is Erlang/OTP special? Could you use another language?

I've been looking into learning Erlang/OTP, and as a result, have been reading (okay, skimming) about the actor model.
From what I understand, the actor model is simply a set of functions (run within lightweight threads called "processes" in Erlang/OTP), which communicate with each other only via message passing.
This seems fairly trivial to implement in C++, or any other language:
class BaseActor {
std::queue<BaseMessage*> messages;
CriticalSection messagecs;
BaseMessage* Pop();
public:
void Push(BaseMessage* message)
{
auto scopedlock = messagecs.AquireScopedLock();
messagecs.push(message);
}
virtual void ActorFn() = 0;
virtual ~BaseActor() {} = 0;
}
With each of your processes being an instance of a derived BaseActor. Actors communicate with each other only via message-passing. (namely, pushing). Actors register themselves with a central map on initialization which allows other actors to find them, and allows a central function to run through them.
Now, I understand I'm missing, or rather, glossing over one important issue here, namely:
lack of yielding means a single Actor can unfairly consume excessive time. But are cross-platform coroutines the primary thing that makes this hard in C++? (Windows for instance has fibers.)
Is there anything else I'm missing, though, or is the model really this obvious?

The C++ code does not deal with fairness, isolation, fault detection or distribution which are all things which Erlang brings as part of its actor model.
No actor is allowed to starve any other actor (fairness)
If one actor crashes, it should only affect that actor (isolation)
If one actor crashes, other actors should be able to detect and react to that crash (fault detection)
Actors should be able to communicate over a network as if they were on the same machine (distribution)
Also the beam SMP emulator brings JIT scheduling of the actors, moving them to the core which is at the moment the one with least utilization and also hibernates the threads on certain cores if they are no longer needed.
In addition all the libraries and tools written in Erlang can assume that this is the way the world works and be designed accordingly.
These things are not impossible to do in C++, but they get increasingly hard if you add the fact that Erlang works on almost all of the major hw and os configurations.
edit: Just found a description by Ulf Wiger about what he sees erlang style concurrency as.

I don't like to quote myself, but from Virding's First Rule of Programming
Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.
With respect to Greenspun. Joe (Armstrong) has a similar rule.
The problem is not to implement actors, that's not that difficult. The problem is to get everything working together: processes, communication, garbage collection, language primitives, error handling, etc ... For example using OS threads scales badly so you need to do it yourself. It would be like trying to "sell" an OO language where you can only have 1k objects and they are heavy to create and use. From our point of view concurrency is the basic abstraction for structuring applications.
Getting carried away so I will stop here.

This is actually an excellent question, and has received excellent answers that perhaps are yet unconvincing.
To add shade and emphasis to the other great answers already here, consider what Erlang takes away (compared to traditional general purpose languages such as C/C++) in order to achieve fault-tolerance and uptime.
First, it takes away locks. Joe Armstrong's book lays out this thought experiment: suppose your process acquires a lock and then immediately crashes (a memory glitch causes the process to crash, or the power fails to part of the system). The next time a process waits for that same lock, the system has just deadlocked. This could be an obvious lock, as in the AquireScopedLock() call in the sample code; or it could be an implicit lock acquired on your behalf by a memory manager, say when calling malloc() or free().
In any case, your process crash has now halted the entire system from making progress. Fini. End of story. Your system is dead. Unless you can guarantee that every library you use in C/C++ never calls malloc and never acquires a lock, your system is not fault tolerant. Erlang systems can and do kill processes at will when under heavy load in order make progress, so at scale your Erlang processes must be killable (at any single point of execution) in order to maintain throughput.
There is a partial workaround: using leases everywhere instead of locks, but you have no guarantee that all the libraries you utilize also do this. And the logic and reasoning about correctness gets really hairy quickly. Moreover leases recover slowly (after the timeout expires), so your entire system just got really slow in the face of failure.
Second, Erlang takes away static typing, which in turn enables hot code swapping and running two versions of the same code simultaneously. This means you can upgrade your code at runtime without stopping the system. This is how systems stay up for nine 9's or 32 msec of downtime/year. They are simply upgraded in place. Your C++ functions will have to be manually re-linked in order to be upgraded, and running two versions at the same time is not supported. Code upgrades require system downtime, and if you have a large cluster that cannot run more than one version of code at once, you'll need to take the entire cluster down at once. Ouch. And in the telecom world, not tolerable.
In addition Erlang takes away shared memory and shared shared garbage collection; each light weight process is garbage collected independently. This is a simple extension of the first point, but emphasizes that for true fault tolerance you need processes that are not interlocked in terms of dependencies. It means your GC pauses compared to java are tolerable (small instead of pausing a half-hour for a 8GB GC to complete) for big systems.

There are actual actor libraries for C++:
http://actor-framework.org/
http://www.theron-library.com/
And a list of some libraries for other languages.

It is a lot less about the actor model and a lot more about how hard it is to properly write something analogous to OTP in C++. Also, different operating systems provide radically different debugging and system tooling, and Erlang's VM and several language constructs support a uniform way of figuring out just what all those processes are up to which would be very hard to do in a uniform way (or maybe do at all) across several platforms. (It is important to remember that Erlang/OTP predates the current buzz over the term "actor model", so in some cases these sort of discussions are comparing apples and pterodactyls; great ideas are prone to independent invention.)
All this means that while you certainly can write an "actor model" suite of programs in another language (I know, I have done this for a long time in Python, C and Guile without realizing it before I encountered Erlang, including a form of monitors and links, and before I'd ever heard the term "actor model"), understanding how the processes your code actually spawns and what is happening amongst them is extremely difficult. Erlang enforces rules that an OS simply can't without major kernel overhauls -- kernel overhauls that would probably not be beneficial overall. These rules manifest themselves as both general restrictions on the programmer (which can always be gotten around if you really need to) and basic promises the system guarantees for the programmer (which can be deliberately broken if you really need to also).
For example, it enforces that two processes cannot share state to protect you from side effects. This does not mean that every function must be "pure" in the sense that everything is referentially transparent (obviously not, though making as much of your program referentially transparent as practical is a clear design goal of most Erlang projects), but rather that two processes aren't constantly creating race conditions related to shared state or contention. (This is more what "side effects" means in the context of Erlang, by the way; knowing that may help you decipher some of the discussion questioning whether Erlang is "really functional or not" when compared with Haskell or toy "pure" languages.)
On the other hand, the Erlang runtime guarantees delivery of messages. This is something sorely missed in an environment where you must communicate purely over unmanaged ports, pipes, shared memory and common files which the OS kernel is the only one managing (and OS kernel management of these resources is necessarily extremely minimal compared to what the Erlang runtime provides). This doesn't meant that Erlang guarantees RPC (anyway, message passing is not RPC, nor is it method invocation!), it doesn't promise that your message is addressed correctly, and it doesn't promise that a process you're trying to send a message to exists or is alive, either. It just guarantees delivery if the thing your sending to happens to be valid at that moment.
Built on this promise is the promise that monitors and links are accurate. And based on that the Erlang runtime makes the entire concept of "network cluster" sort of melt away once you grasp what is going on with the system (and how to use erl_connect...). This permits you to hop over a set of tricky concurrency cases already, which gives one a big head start on coding for the successful case instead of getting mired in the swamp of defensive techniques required for naked concurrent programming.
So its not really about needing Erlang, the language, its about the runtime and OTP already existing, being expressed in a rather clean way, and implementing anything close to it in another language being extremely hard. OTP is just a hard act to follow. In the same vein, we don't really need C++, either, we could just stick to raw binary input, Brainfuck and consider Assembler our high level language. We also don't need trains or ships, as we all know how to walk and swim.
All that said, the VM's bytecode is well documented, and a number of alternative languages have emerged that compile to it or work with the Erlang runtime. If we break the question into a language/syntax part ("Do I have to understand Moon Runes to do concurrency?") and a platform part ("Is OTP the most mature way to do concurrency, and will it guide me around the trickiest, most common pitfalls to be found in a concurrent, distributed environment?") then the answer is ("no", "yes").

Casablanca is another new kid on the actor model block. A typical asynchronous accept looks like this:
PID replyTo;
NameQuery request;
accept_request().then([=](std::tuple<NameQuery,PID> request)
{
if (std::get<0>(request) == FirstName)
std::get<1>(request).send("Niklas");
else
std::get<1>(request).send("Gustafsson");
}
(Personally, I find that CAF does a better job at hiding the pattern matching behind a nice interface.)

What is the defacto standard for sharing variables between programs in different languages?

I've never had formal training in this area so I'm wondering what do they teach in school (if they do).
Say you have two programs in written in two different languages: C++ and Python or some other combination and you want to share a constantly updated variable on the same machine, what would you use and why? The information need not be secured but must be isochronous should be reliable.
Eg. Program A will get a value from a hardware device and update variable X every 0.1ms, I'd like to be able to access this X from Program B as often as possible and obtain the latest values. Program A and B are written and compiled in two different (robust) languages. How do I access X from program B? Assume I have the source code from A and B and I do not want to completely rewrite or port either of them.
The method's I've seen used thus far include:
File Buffer - Read and write to a
single file (eg C:\temp.txt).
Create a wrapper - From A to B or B
to A.
Memory Buffer - Designate a specific
memory address (mutex?).
UDP packets via sockets - Haven't
tried it yet but looks good.
Firewall?
Sorry for just throwing this out there, I don't know what the name of this technique is so I have trouble searching.

Well you can write XML and use some basic message queuing (like rabbitMQ) to pass messages around

Don't know if this will be helpful, but I'm also a student, and this is what I think you mean.
I've used marshalling to get a java class and import it into a C# program.
With marshalling you use xml to transfer code in a way so that it can be read by other coding environments.

When asking particular questions, you should aim at providing as much information as possible. You have added a use case, but the use case is incomplete.
Your particular use case seems like a very small amount of data that has to be available at a high frequency 10kHz. I would first try to determine whether I can actually make both pieces of code part of a single process, rather than two different processes. Depending on the languages (missing from the question) it might even be simple, or turn the impossible into possible --depending on the OS (missing from the question), the scheduler might not be fast enough switching from one process to another, and it might impact the availability of the latest read. Switching between threads is usually much faster.
If you cannot turn them into a single process, then you will have to use some short of IPC (Inter Process Communication). Due to the frequency I would rule out most heavy weight protocols (avoid XML, CORBA) as the overhead will probably be too high. If the receiving end needs only access to the latest value, and that access may be less frequent than 0.1 ms, then you don't want to use any protocol that includes queueing as you do not want to read the next element in the queue, you only care about the last, if you did not read the element when it was good, avoid the cost of processing it when it is already stale --i.e. it does not make sense to loop extracting from the queue and discarding.
I would be inclined to use shared memory, or a memory mapped shared file (they are probably quite similar, depends on the platform missing from the question). Depending on the size of the element and the exact hardware architecture (missing from the question) you may be able to avoid locking with a mutex. As an example in current intel processors, read/write access to 32 bit integers from memory is guaranteed to be atomic if the variable is correctly aligned, so in that case you would not be locking.

At my school they teach CORBA. They shouldn't, it's an ancient hideous language from the eon of mainframes, it's a classic case of design-by-committee, every feature possible that you don't want is included, and some that you probably do (asynchronous calls?) aren't. If you think the c++ specification is big, think again.
Don't use it.
That said though, it does have a nice, easy-to-use interface for doing simple things.
But don't use it.

It almost always pass through C binding.

What are the "things to know" when diving into multi-threaded programming in C++

I'm currently working on a wireless networking application in C++ and it's coming to a point where I'm going to want to multi-thread pieces of software under one process, rather than have them all in separate processes. Theoretically, I understand multi-threading, but I've yet to dive in practically.
What should every programmer know when writing multi-threaded code in C++?

I would focus on design the thing as much as partitioned as possible so you have the minimal amount of shared things across threads. If you make sure you don't have statics and other resources shared among threads (other than those that you would be sharing if you designed this with processes instead of threads) you would be fine.
Therefore, while yes, you have to have in mind concepts like locks, semaphores, etc, the best way to tackle this is to try to avoid them.

I am no expert at all in this subject. Just some rule of thumb:
Design for simplicity, bugs really are hard to find in concurrent code even in the simplest examples.
C++ offers you a very elegant paradigm to manage resources(mutex, semaphore,...): RAII. I observed that it is much easier to work with boost::thread than to work with POSIX threads.
Build your code as thread-safe. If you don't do so, your program could behave strangely

I am exactly in this situation: I wrote a library with a global lock (many threads, but only one running at a time in the library) and am refactoring it to support concurrency.
I have read books on the subject but what I learned stands in a few points:
think parallel: imagine a crowd passing through the code. What happens when a method is called while already in action ?
think shared: imagine many people trying to read and alter shared resources at the same time.
design: avoid the problems that points 1 and 2 can raise.
never think you can ignore edge cases, they will bite you hard.
Since you cannot proof-test a concurrent design (because thread execution interleaving is not reproducible), you have to ensure that your design is robust by carefully analyzing the code paths and documenting how the code is supposed to be used.
Once you understand how and where you should bottleneck your code, you can read the documentation on the tools used for this job:
Mutex (exclusive access to a resource)
Scoped Locks (good pattern to lock/unlock a Mutex)
Semaphores (passing information between threads)
ReadWrite Mutex (many readers, exclusive access on write)
Signals (how to 'kill' a thread or send it an interrupt signal, how to catch these)
Parallel design patterns: boss/worker, producer/consumer, etc (see schmidt)
platform specific tools: openMP, C blocks, etc
Good luck ! Concurrency is fun, just take your time...

You should read about locks, mutexes, semaphores and condition variables.
One word of advice, if your app has any form of UI make sure you always change it from the UI thread. Most UI toolkits/frameworks will crash (or behave unexpectedly) if you access them from a background thread. Usually they provide some form of dispatching method to execute some function in the UI thread.

Never assume that external APIs are threadsafe. If it is not explicitly stated in their docs, do not call them concurrently from multiple threads. Instead, limit your use of them to a single thread or use a mutex to prevent concurrent calls (this is rather similar to the aforementioned GUI libraries).
Next point is language-related. Remember, C++ has (currently) no well-defined approach to threading. The compiler/optimizer does not know if code might be called concurrently. The volatile keyword is useful to prevent certain optimizations (i.e. caching of memory fields in CPU registers) in multi-threaded contexts, but it is no synchronization mechanism.
I'd recommend boost for synchronization primitives. Don't mess with platform APIs. They make your code difficult to port because they have similar functionality on all major platforms, but slightly different detail behaviour. Boost solves these problems by exposing only common functionality to the user.
Furthermore, if there's even the smallest chance that a data structure could be written to by two threads at the same time, use a synchronization primitive to protect it. Even if you think it will only happen once in a million years.

One thing I've found very useful is to make the application configurable with regard to the actual number of threads it uses for various tasks. For example, if you have multiple threads accessing a database, make the number of those threads be configurable via a command line parameter. This is extremely handy when debugging - you can exclude threading issues by setting the number to 1, or force them by setting it to a high number. It's also very handy when working out what the optimal number of threads is.

Make sure you test your code in a single-cpu system and a multi-cpu system.
Based on the comments:-
Single socket, single core
Single socket, two cores
Single socket, more than two cores
Two sockets, single core each
Two sockets, combination of single, dual and multi core cpus
Mulitple sockets, combination of single, dual and multi core cpus
The limiting factor here is going to be cost. Ideally, concentrate on the types of system your code is going to run on.

In addition to the other things mentioned, you should learn about asynchronous message queues. They can elegantly solve the problems of data sharing and event handling. This approach works well when you have concurrent state machines that need to communicate with each other.
I'm not aware of any message passing frameworks tailored to work only at the thread level. I've only seen home-brewed solutions. Please comment if you know of any existing ones.
EDIT:
One could use the lock-free queues from Intel's TBB, either as-is, or as the basis for a more general message-passing queue.

Since you are a beginner, start simple. First make it work correctly, then worry about optimizations. I've seen people try to optimize by increasing the concurrency of a particular section of code (often using dubious tricks), without ever looking to see if there was any contention in the first place.
Second, you want to be able to work at as high a level as you can. Don't work at the level of locks and mutexs if you can using an existing master-worker queue. Intel's TBB looks promising, being slightly higher level than pure threads.
Third, multi-threaded programming is hard. Reduce the areas of your code where you have to think about it as much as possible. If you can write a class such that objects of that class are only ever operated on in a single thread, and there is no static data, it greatly reduces the things that you have to worry about in the class.

A few of the answers have touched on this, but I wanted to emphasize one point:
If you can, make sure that as much of your data as possible is only accessible from one thread at a time. Message queues are a very useful construct to use for this.
I haven't had to write much heavily-threaded code in C++, but in general, the producer-consumer pattern can be very helpful in utilizing multiple threads efficiently, while avoiding the race conditions associated with concurrent access.
If you can use someone else's already-debugged code to handle thread interaction, you're in good shape. As a beginner, there is a temptation to do things in an ad-hoc fashion - to use a "volatile" variable to synchronize between two pieces of code, for example. Avoid that as much as possible. It's very difficult to write code that's bulletproof in the presence of contending threads, so find some code you can trust, and minimize your use of the low-level primitives as much as you can.

My top tips for threading newbies:
If you possibly can, use a task-based parallelism library, Intel's TBB being the most obvious one. This insulates you from the grungy, tricky details and is more efficient than anything you'll cobble together yourself. The main downside is this model doesn't support all uses of multithreading; it's great for exploiting multicores for compute power, less good if you wanted threads for waiting on blocking I/O.
Know how to abort threads (or in the case of TBB, how to make tasks complete early when you decide you didn't want the results after all). Newbies seem to be drawn to thread kill functions like moths to a flame. Don't do it... Herb Sutter has a great short article on this.

Make sure to explicitly know what objects are shared and how they are shared.
As much as possible make your functions purely functional. That is they have inputs and outputs and no side effects. This makes it much simpler to reason about your code. With a simpler program it isn't such a big deal but as the complexity rises it will become essential. Side effects are what lead to thread-safety issues.
Plays devil's advocate with your code. Look at some code and think how could I break this with some well timed thread interleaving. At some point this case will happen.
First learn thread-safety. Once you get that nailed down then you move onto the hard part: Concurrent performance. This is where moving away from global locks is essential. Figuring out ways to minimize and remove locks while still maintaining the thread-safety is hard.

Keep things dead simple as much as possible. It's better to have a simpler design (maintenance, less bugs) than a more complex solution that might have slightly better CPU utilization.
Avoid sharing state between threads as much as possible, this reduces the number of places that must use synchronization.
Avoid false-sharing at all costs (google this term).
Use a thread pool so you're not frequently creating/destroying threads (that's expensive and slow).
Consider using OpenMP, Intel and Microsoft (possibly others) support this extension to C++.
If you are doing number crunching, consider using Intel IPP, which internally uses optimized SIMD functions (this isn't really multi-threading, but is parallelism of a related sorts).
Have tons of fun.

Stay away from MFC and it's multithreading + messaging library.
In fact if you see MFC and threads coming toward you - run for the hills (*)
(*) Unless of course if MFC is coming FROM the hills - in which case run AWAY from the hills.

The biggest "mindset" difference between single-threaded and multi-threaded programming in my opinion is in testing/verification. In single-threaded programming, people will often bash out some half-thought-out code, run it, and if it seems to work, they'll call it good, and often get away with it using it in a production environment.
In multithreaded programming, on the other hand, the program's behavior is non-deterministic, because the exact combination of timing of which threads are running for which periods of time (relative to each other) will be different every time the program runs. So just running a multithreaded program a few times (or even a few million times) and saying "it didn't crash for me, ship it!" is entirely inadequate.
Instead, when doing a multithreaded program, you always should be trying to prove (at least to your own satisfaction) that not only does the program work, but that there is no way it could possibly not work. This is much harder, because instead of verifying a single code-path, you are effectively trying to verify a near-infinite number of possible code-paths.
The only realistic way to do that without having your brain explode is to keep things as bone-headedly simple as you can possibly make them. If you can avoid using multithreading totally, do that. If you must do multithreading, share as little data between threads as possible, and use proper multithreading primitives (e.g. mutexes, thread-safe message queues, wait conditions) and don't try to get away with half-measures (e.g. trying to synchronize access to a shared piece of data using only boolean flags will never work reliably, so don't try it)
What you want to avoid is the multithreading hell scenario: the multithreaded program that runs happily for weeks on end on your test machine, but crashes randomly, about once a year, at the customer's site. That kind of race-condition bug can be nearly impossible to reproduce, and the only way to avoid it is to design your code extremely carefully to guarantee it can't happen.
Threads are strong juju. Use them sparingly.

You should have an understanding of basic systems programing, in particular:
Synchronous vs Asynchronous I/O (blocking vs. non-blocking)
Synchronization mechanisms, such as lock and mutex constructs
Thread management on your target platform

I found viewing the introductory lectures on OS and systems programming here by John Kubiatowicz at Berkeley useful.

Part of my graduate study area relates to parallelism.
I read this book and found it a good summary of approaches at the design level.
At the basic technical level, you have 2 basic options: threads or message passing. Threaded applications are the easiest to get off the ground, since pthreads, windows threads or boost threads are ready to go. However, it brings with it the complexity of shared memory.
Message-passing usability seems mostly limited at this point to the MPI API. It sets up an environment where you can run jobs and partition your program between processors. It's more for supercomputer/cluster environments where there's no intrinsic shared memory. You can achieve similar results with sockets and so forth.
At another level, you can use language type pragmas: the popular one today is OpenMP. I've not used it, but it appears to build threads in via preprocessing or a link-time library.
The classic problem is synchronization here; all the problems in multiprogramming come from the non-deterministic nature of multiprograms, which can not be avoided.
See the Lamport timing methods for a further discussion of synchronizations and timing.
Multithreading is not something that only Ph.D.`s and gurus can do, but you will have to be pretty decent to do it without making insane bugs.

I'm in the same boat as you, I am just starting multi threading for the first time as part of a project and I've been looking around the net for resources. I found this blog to be very informative. Part 1 is pthreads, but I linked starting on the boost section.

I have written a multithreaded server application and a multithreaded shellsort. They were both written in C and use NT's threading functions "raw" that is without any function library in-between to muddle things. They were two quite different experiences with different conclusions to be drawn. High performance and high reliability were the main priorities although coding practices had a higher priority if one of the first two was judged to be threatened in the long term.
The server application had both a server and a client part and used iocps to manage requests and responses. When using iocps it is important never to use more threads than you have cores. Also I found that requests to the server part needed a higher priority so as not to lose any requests unnecessarily. Once they were "safe" I could use lower priority threads to create the server responses. I judged that the client part could have an even lower priority. I asked the questions "what data can't I lose?" and "what data can I allow to fail because I can always retry?" I also needed to be able to interface to the application's settings through a window and it had to be responsive. The trick was that the UI had normal priority, the incoming requests one less and so on. My reasoning behind this was that since I will use the UI so seldom it can have the highest priority so that when I use it it will respond immediately. Threading here turned out to mean that all separate parts of the program in the normal case would/could be running simultaneously but when the system was under higher load, processing power would be shifted to the vital parts due to the prioritization scheme.
I've always liked shellsort so please spare me from pointers about quicksort this or that or blablabla. Or about how shellsort is ill-suited for multithreading. Having said that, the problem I had had to do with sorting a semi-largelist of units in memory (for my tests I used a reverse-sorted list of one million units of forty bytes each. Using a single-threaded shellsort I could sort them at a rate of roughly one unit every two us (microseconds). My first attempt to multithread was with two threads (though I soon realized that I wanted to be able to specify the number of threads) and it ran at about one unit every 3.5 seconds, that is to say SLOWER. Using a profiler helped a lot and one bottleneck turned out to be the statistics logging (i e compares and swaps) where the threads would bump into each other. Dividing up the data between the threads in an efficient way turned out to be the biggest challenge and there is definitley more I can do there such as dividing the vector containing the indeces to the units in cache-line size adapted chunks and perhaps also comparing all indeces in two cache lines before moving to the next line (at least I think there is something I can do there - the algorithms get pretty complicated). In the end, I achieved a rate of one unit every microsecond with three simultaneous threads (four threads about the same, I only had four cores available).
As to the original question my advice to you would be
If you have the time, learn the threading mechanism at the lowest possible level.
If performance is important learn the related mechanisms that the OS provides. Multi-threading by itself is seldom enough to achieve an application's full potential.
Use profiling to understand the quirks of multiple threads working on the same memory.
Sloppy architectural work will kill any app, regardless of how many cores and systems you have executing it and regardless of the brilliance of your programmers.
Sloppy programming will kill any app, regardless of the brilliance of the architectural foundation.
Understand that using libraries lets you reach the development goal faster but at the price of less understanding and (usually) lower performance .

Before giving any advice on do's and dont's about multi-thread programming in C++, I would like to ask the question Is there any particular reason you want to start writing the application in C++?
There are other programming paradigms where you utilize the multi-cores without getting into multi-threaded programming. One such paradigm is functional programming. Write each piece of your code as functions without any side effects. Then it is easy to run it in multiple thread without worrying about synchronization.
I am using Erlang for my development purpose. It has increased by productivity by at least 50%. Code running may not be as fast as the code written in C++. But I have noticed that for most of the back-end offline data processing, speed is not as important as distribution of work and utilizing the hardware as much as possible. Erlang provides a simple concurrency model where you can execute a single function in multiple-threads without worrying about the synchronization issue. Writing multi-threaded code is easy, but debugging that is time consuming. I have done multi-threaded programming in C++, but I am currently happy with Erlang concurrency model. It is worth looking into.

Make sure you know what volatile means and it's uses(which may not be obvious at first).
Also, when designing multithreaded code, it helps to imagine that an infinite amount of processors is executing every single line of code in your application at once. (er, every single line of code that is possible according to your logic in your code.) And that everything that isn't marked volatile the compiler does a special optimization on it so that only the thread that changed it can read/set it's true value and all the other threads get garbage.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js