Parallel and Concurrent Programming

Parallel and Concurrent Programming - concurrency

I am planning to enhance my knowledge about parallel and concurrent programming.
Can somebody help me to find out some online learning resources?
Thanks,

If you're using a POSIX-based system(Linux, FreeBSD, Mac OS X, etc.), you'll want to check out pthreads(link to tutorial). Pthreads have been around for a long time and are the de-facto standard for concurrent programming on POSIX-based platforms.
There is a newcomer though, known as Grand Central Dispatch(link to tutorial). The technology was developed by Apple(in Snow Leopard) in an attempt to solve some of the tedious problems associated with pthreads and multithreaded programming in general. Concretely:
Blocks(anonymous functions) are introduced to the C language(by extension, C++ and Objective-C). This allows you to completely avoid using context structs. In an example(heavily using pseudocode), you might write something like this using pthreads:
typedef struct { int val1; int val2; } context;
int main(){
int firstval = 5;
int secondval = 2;
context *c = malloc(sizeof(context));
c->val1 = firstval;
c->val2 = secondval;
create_new_thread(context, myFunct);
}
void myFunct(context *c){
printf("Contrived example %d %d", c->val1, c->val2);
}
That involved a lot of work - we had to create the context, setup the values, and make sure our function handled receiving the context correctly. Not so with GCD. We can instead write the following code:
int main(){
int firstval = 5;
int secondval = 2;
dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_async(queue, ^{
printf("Contrived example %d %d", firstval, secondval);
});
}
Notice how much simpler that was! No contexts, not even a separate function.
GCD lets the kernel manage the thread-count. Each thread on your system consumes some kernel resources. On a portable, excess threads translates into reduced battery life. On any computer, excess threads translates to reduced performance. What does "excess" mean? Spawning 100s of threads on a 2-core machine. When using pthreads, you had to manage the threadcount explicitly, making sure that you weren't overloading the system. Of course, this is very hard to do. With GCD, you simply tell the kernel "Execute this block of work when you have the chance" - the kernel decides when it has enough free resources to run the bit of code - you don't have to worry about this.
In addition to providing great basic multithreading support, GCD also allows your program to interact with "sources" via blocks. So, you can enqueue a file descriptor and tell GCD "run this block of code when there's new data to be read." And so the kernel will let your program sit idle until a significant enough amount of data comes in, and then enqueue your block automatically!
And I've only touched the surface on what GCD can do. It's a truly amazing technology, and I highly recommend you check out the docs. It's currently available on Mac OS X and FreeBSD, and it's open source - so if you want it to run on Linux, you can port it :).
If you're looking for raw power for data-parallel applications, Apple developed another great technology(also for Snow Leopard) called OpenCL, which allows you to harness the power of the GPU in a very simple C-like(it's almost exactly C with a few caveats) language. I haven't had too much experience with this, but from everything I've heard, it's easy to use and very powerful. OpenCL is an open standard, with implementations on Mac OS X, and Windows.
So, to sum it up: pthreads for all POSIX-based systems(it's ugly, but it's the de-facto standard), GCD for Mac OS X and FreeBSD, and OpenCL for data-parallel applications where you need all the power you can get!

Have a look at 'Concurrency in practice'. A standard book.

Herb Sutter writes a number of good articles about this subject. His site may be a good place to start.

Related

Fastest way to run a program in a 64 bit environment?

It's been a couple of decades since I've done any programming. As a matter of fact the last time I programmed was in an MS-DOS environment before Windows came out. I've had this programming idea that I have wanted to try for a few years now and I thought I would give it a try. The amount of calculations are enormous. Consequently I want to run it in the fastest environment I can available to a general hobby programmer.
I'll be using a 64 bit machine. Currently it is running Windows 7. Years ago a program ran much slower in the windows environment then then in MS-DOS mode. My personal programming experience has been in Fortran, Pascal, Basic, and machine language for the 6800 Motorola series processors. I'm basically willing to try anything. I've fooled around with Ubuntu also. No objections to learning new. Just want to take advantage of speed. I'd prefer to spend no money on this project. So I'm looking for a free or very close to free compiler. I've downloaded Microsoft Visual Studio C++ Express. But I've got a feeling that the completed compiled code will have to be run in the Windows environment. Which I'm sure slows the processing speed considerably.
So I'm looking for ideas or pointers to what is available.
Thank you,
Have a Great Day!
Jim

Speed generally comes with the price of either portability or complexity.
If your programming idea involves lots of computation, then if you're using Intel CPU, you might want to use Intel's compiler, which might benefit from some hidden processor features that might make your program faster. Otherwise, if portability is your goal, then use GCC (GNU Compiler Collection), which can cross-compile well optimized executable to practically any platform available on earth. If your computation can be parallelizable, then you might want to look at SIMD (Single Input Multiple Data) and GPGPU/CUDA/OpenCL (using graphic card for computation) techniques.
However, I'd recommend you should just try your idea in the simpler languages first, e.g. Python, Java, C#, Basic; and see if the speed is good enough. Since you've never programmed for decades now, it's likely your perception of what was an enormous computation is currently miniscule due to the increased processor speed and RAM. Nowadays, there is not much noticeable difference in running in GUI environment and command line environment.

Tthere is no substantial performance penalty to operating under Windows and a large quantity of extremely high performance applications do so. With new compiler advances and new optimization techniques, Windows is no longer the up-and-coming, new, poorly optimized technology it was twenty years ago.
The simple fact is that if you haven't programmed for 20 years, then you won't have any realistic performance picture at all. You should make like most people- start with an easy to learn but not very fast programming language like C#, create the program, then prove that it runs too slowly, then make several optimization passes with tools such as profilers, then you may decide that the language is too slow. If you haven't written a line of code in two decades, the overwhelming probability is that any program that you write will be slow because you're a novice programmer from modern perspectives, not because of your choice of language or environment. Creating very high performance applications requires a detailed understanding of the target platform as well as the language of choice, AND the operations of the program.
I'd definitely recommend Visual C++. The Express Edition is free and Visual Studio 2010 can produce some unreasonably fast code. Windows is not a slow platform - even if you handwrote your own OS, it'd probably be slower, and even if you produced one that was faster, the performance gain would be negligible unless your program takes days or weeks to complete a single execution.

The OS does not make your program magically run slower. True, the OS does eat a few clock cycles here and there, but it's really not enough to be at all noticeable (and it does so in order to provide you with services you most likely need, and would need to re-implement yourself otherwise)
Windows doesn't, as some people seem to believe, eat 50% of your CPU. It might eat 0.5%, but so does Linux and OSX. And if you were to ditch all existing OS'es and instead write your own from scratch, you'd end up with a buggy, less capable OS which also eats a bit of CPU time.
So really, the environment doesn't matter.
What does matter is what hardware you run the program on (and here, running it on the GPU might be worth considering) and how well you utilize the hardware (concurrency is pretty much a must if you want to exploit modern hardware).
What code you write, and how you compile it does make a difference. The hardware you're running on makes a difference. The choice of OS does not.

A digression: that the OS doesn't matter for performance is, in general, obviously false. Citing CPU utilization when idle seems a quite "peculiar" idea to me: of course one hopes that when no jobs are running the OS is not wasting energy. Otherwise one measure the speed/throughput of an OS when it is providing a service (i.e. mediating the access to hardware/resources).
To avoid an annoying MS Windows vs Linux vs Mac OS X battle, I will refer to a research OS concept: exokernels. The point of exokernels is that a traditional OS is not just a mediator for resource access but it implements policies. Such policies does not always favor the performance of your application-specific access mode to a resource. With the exokernel concept, researchers proposed to "exterminate all operating system abstractions" (.pdf) retaining its multiplexer role. In this way:
… The results show that common unmodified UNIX applications can enjoy the benefits of exokernels: applications either perform comparably on Xok/ExOS and the BSD UNIXes, or perform significantly better. In addition, the results show that customized applications can benefit substantially from control over their resources (e.g., a factor of eight for a Web server). …
So bypassing the usual OS access policies they gained, for a customized web server, an increase of about 800% in performance.
Returning to the original question: it's generally true that an application is executed with no or negligible OS overhead when:
it has a compute-intensive kernel, where such kernel does not call the OS API;
memory is enough or data is accessed in a way that does not cause excessive paging;
all inessential services running on the same systems are switched off.
There are possibly other factors, depending by hardware/OS/application.
I assume that the OP is correct in its rough estimation of computing power required. The OP does not specify the nature of such intensive computation, so its difficult to give suggestions. But he wrote:
The amount of calculations are enormous
"Calculations" seems to allude to compute-intensive kernels, for which I think is required a compiled language or a fast interpreted language with native array operators, like APL, or modern variant such as J, A+ or K (potentially, at least: I do not know if they are taking advantage of modern hardware).
Anyway, the first advice is to spend some time in researching fast algorithms for your specific problem (but when comparing algorithms remember that asymptotic notation disregards constant factors that sometimes are not negligible).
For the sequential part of your program a good utilization of CPU caches is crucial for speed. Look into cache conscious algorithms and data structures.
For the parallel part, if such program is amenable to parallelization (remember both Amdahl's law and Gustafson's law), there are different kinds of parallelism to consider (they are not mutually exclusive):
Instruction-level parallelism: it is taken care by the hardware/compiler;
data parallelism:
bit-level: sometimes the acronym SWAR (SIMD Within A Register) is used for this kind of parallelism. For problems (or some parts of them) where it can be formulated a data representation that can be mapped to bit vectors (where a value is represented by 1 or more bits); so each instruction from the instruction set is potentially a parallel instruction which operates on multiple data items (SIMD). Especially interesting on a machine with 64 bits (or larger) registers. Possible on CPUs and some GPUs. No compiler support required;
fine-grain medium parallelism: ~10 operations in parallel on x86 CPUs with SIMD instruction set extensions like SSE, successors, predecessors and similar; compiler support required;
fine-grain massive parallelism: hundreds of operations in parallel on GPGPUs (using common graphic cards for general-purpose computations), programmed with OpenCL (open standard), CUDA (NVIDIA), DirectCompute (Microsoft), BrookGPU (Stanford University) and Intel Array Building Blocks. Compiler support or use of a dedicated API is required. Note that some of these have back-ends for SSE instructions also;
coarse-grain modest parallelism (at the level of threads, not single instructions): it's not unusual for CPUs on current desktops/laptops to have more then one core (2/4) sharing the same memory pool (shared-memory). The standard for shared-memory parallel programming is the OpenMP API, where, for example in C/C++, #pragma directives are used around loops. If I am not mistaken, this can be considered data parallelism emulated on top of task parallelism;
task parallelism: each core in one (or multiple) CPU(s) has its independent flow of execution and possibly operates on different data. Here one can use the concept of "thread" directly or a more high-level programming model which masks threads.
I will not go into details of these programming models here because apparently it is not what the OP needs.
I think this is enough for the OP to evaluate by himself how various languages and their compilers/run-times / interpreters / libraries support these forms of parallelism.

Just my two cents about DOS vs. Windows.
Years ago (something like 1998?), I had the same assumption.
I have some program written in QBasic (this was before I discovered C), which did intense calculations (neural network back-propagation). And it took time.
A friend offered to rewrite the thing in Visual Basic. I objected, because, you know, all those gizmos, widgets and fancy windows, you know, would slow down the execution of, you know, the important code.
The Visual Basic version so much outperformed the QBasic one that it became the default application (I won't mention the "hey, even in Excel's VBA, you are outperformed" because of my wounded pride, but...).
The point here, is the "you know" part.
You don't know.
The OS here is not important. As others explained in their answers, choose your hardware, and choose your language. And write your code in a clear way because now, compilers are better at optimizing code developers, unless you're John Carmack (premature optimization is the root of all evil).
Then, if you're not happy with the result, use a profiler to test your code. Consider multithreading (which will help you if you have multiple cores... TBB comes to mind).

What are you trying to do? I believe all the stuff should be compiled in 64bit mode by default. Computers have gotten a lot faster. Speed should not be a problem for the most part.
Side note: As for computation intense stuff you may want to look into OpenCL or CUDA. OpenCL and CUDA take advantage of the GPU which can transfer lots of information at a time compared to the CPU.

If your last points of reference are M68K and PCs running DOS then I'd suggest that you start with C/C++ on a modern processor and OS. If you run into performance problems and can prove that they are caused by running on Linux / Windows or that the compiler / optimizer generated code isn't sufficient, then you could look at other OSes and/or hand coded ASM. If you're looking for free, Linux / gcc is a good place to start.

I am the original poster of this thread.
I am once again reiterating the emphasis that this program will have enormous number of calculations.
Windows & Ubuntu are multi-tasking environments. There are processes running and many of them are using processor resources. True many of them are seen as inactive. But still the Windows environment by the nature of multi-tasking is constantly monitoring the need to start up each process. For example currently there are 62 processes showing in the Windows Task Manager. According the task manager three are consuming CPU resouces. So we have three ongoing processes that are consuming CPU processing. There are an addition 59 showing active but consuming no CPU processing. So that is 63 being monitored by Windows and then there is the Windows that also is checking on various things.
I was hoping to find some way to just be able to run a program on the bare machine level. Side stepping all the Windows (Ubuntu) involvement.
The idea is very calculation intensive.
Thank you all for taking the time to respond.
Have a Great Day,
Jim

Interrupts in C/C++??? How are they implemented / coded?

Having programmed microcontrollers before and being interested in trying my hand at building an NES emulator at some point, I was really wondering how interrupts are implemented in C++?
How, for example, does a program know how to react when I speak into my mic or move my mouse? Is it constantly polling these ports?
When emulating an interrupt for a hardware device (say, for an NES emulator), do you have to constantly poll or is there another way to do this?

This is an implementation-specific question, but broad strokes: direct access to hardware via interrupts is generally limited to the OS (specifically, to the kernel.) Your code will not have such access on any modern system. Rather, you'd use OS-specific APIs to access hardware.
In short: desktop operating systems do not behave like embedded, microcontrolled devices.
User input is generally handled on modern systems by something called an event loop. Your application does something like this, the specific implementation varying between OSes:
int main(int argc, char *argv[]) {
Event *e = NULL;
while (e = App::GetNextEvent()) {
switch (e->getType()) {
case E_MOUSEUP:
case E_RIGHTMOUSEDOWN:
case E_KEYDOWN:
// etc.
}
}
return EXIT_SUCCESS;
}
In this example, App::GetNextEvent() doesn't busy-wait: it simply sits and does nothing until signalled internally by the OS that an event has arrived. In this way, no constant polling is required and the OS can dole out time slices to processes more efficiently.
However... the NES is not a modern system; emulating one means you need to emulate the interrupts and hardware of the NES. Doing so is a very large undertaking and has no relation to the underlying hardware or interrupts of the host operating system. The 6502 processor in the NES is very simple by modern standards; there are books available that discuss emulating similar simple processors, though titles escape me at the moment.
Does that help?

How to structure a C++ application to use a multicore processor

I am building an application that will do some object tracking from a video camera feed and use information from that to run a particle system in OpenGL. The code to process the video feed is somewhat slow, 200 - 300 milliseconds per frame right now. The system that this will be running on has a dual core processor. To maximize performance I want to offload the camera processing stuff to one processor and just communicate relevant data back to the main application as it is available, while leaving the main application kicking on the other processor.
What do I need to do to offload the camera work to the other processor and how do I handle communication with the main application?
Edit:
I am running Windows 7 64-bit.

Basically, you need to multithread your application. Each thread of execution can only saturate one core. Separate threads tend to be run on separate cores. If you are insistent that each thread ALWAYS execute on a specific core, then each operating system has its own way of specifying this (affinity masks & such)... but I wouldn't recommend it.
OpenMP is great, but it's a tad fat in the ass, especially when joining back up from a parallelization. YMMV. It's easy to use, but not at all the best performing option. It also requires compiler support.
If you're on Mac OS X 10.6 (Snow Leopard), you can use Grand Central Dispatch. It's interesting to read about, even if you don't use it, as its design implements some best practices. It also isn't optimal, but it's better than OpenMP, even though it also requires compiler support.
If you can wrap your head around breaking up your application into "tasks" or "jobs," you can shove these jobs down as many pipes as you have cores. Think of batching your processing as atomic units of work. If you can segment it properly, you can run your camera processing on both cores, and your main thread at the same time.
If communication is minimized for each unit of work, then your need for mutexes and other locking primitives will be minimized. Course grained threading is much easier than fine grained. And, you can always use a library or framework to ease the burden. Consider Boost's Thread library if you take the manual approach. It provides portable wrappers and a nice abstraction.

It depends on how many cores you have. If you have only 2 cores (cpu, processors, hyperthreads, you know what i mean), then OpenMP cannot give such a tremendous increase in performance, but will help. The maximum gain you can have is divide your time by the number of processors so it will still take 100 - 150 ms per frame.
The equation is
parallel time = (([total time to perform a task] - [code that cannot be parallelized]) / [number of cpus]) + [code that cannot be parallelized]
Basically, OpenMP rocks at parallel loops processing. Its rather easy to use
#pragma omp parallel for
for (i = 0; i < N; i++)
a[i] = 2 * i;
and bang, your for is parallelized. It does not work for every case, not every algorithm can be parallelized this way but many can be rewritten (hacked) to be compatible. The key principle is Single Instruction, Multiple Data (SIMD), applying the same convolution code to multiple pixels for example.
But simply applying this cookbook receipe goes against the rules of optimization.
1-Benchmark your code
2-Find the REAL bottlenecks with "scientific" evidence (numbers) instead of simply guessing where you think there is a bottleneck
3-If it is really processing loops, then OpenMP is for you
Maybe simple optimizations on your existing code can give better results, who knows?
Another road would be to run opengl in a thread and data processing on another thread. This will help a lot if opengl or your particle rendering system takes a lot of power, but remember that threading can lead to other kind of synchronization bottlenecks.

I would recommend against OpenMP, OpenMP is more for numerical codes rather than consumer/producer model that you seem to have.
I think you can do something simple using boost threads to spawn worker thread, common segment of memory (for communication of acquired data), and some notification mechanism to tell on your data is available (look into boost thread interrupts).
I do not know what kind of processing you do, but you may want to take a look at the Intel thread building blocks and Intel integrated primitives, they have several functions for video processing which may be faster (assuming they have your functionality)

You need some kind of framework for handling multicores. OpenMP seems a fairly simple choice.

Like what Pestilence said, you just need your app to be multithreaded. Lots of frameworks like OpenMP have been mentioned, so here's another one:
Intel Thread Building Blocks
I've never used it before, but I hear great things about it.
Hope this helps!

Read/Write Locks

As part of a project at work, I implemented a Read/Write lock class in C++. Before pushing my code to production, what sort of tests should I run on my class to be sure it will function correctly.
I've obviously performed some sanity tests on my class (making sure only one writer can access at a time, making sure releases and claims increment and decrement properly, etc.)
I'm looking for tests that will guarantee the stability of my class and to prevent edge cases. It seems testing multi-threaded code is much harder than standard code.

It is very difficult to test multi-threaded code, so you should supplement your tests with a detailed code review by colleagues experienced in writing multi-threaded applications.

Make sure you try your stress test on a machine that truly has multiple CPU's. That will usually uncover more multithreaded problems than anything run on a single CPU machine.
Then test it on machines that are 64-bit, faster CPUs, more CPUs, etc.
And as #onebyone.livejournal.com says, use a machine with non-coherent memory caches; although, according to the NUMA article on Wikipedia, that may be difficult to find.
Certainly using the code on as many different machines as possible can't hurt, and is also a good way to uncover issues.

I guess that you can start by looking at tests included in well-established code. For example the pthreads implementation of GNU libc (nptl) includes read-write locks and some tests.
$ ls nptl/tst-rwlock*
nptl/tst-rwlock1.c
nptl/tst-rwlock10.c
nptl/tst-rwlock11.c
nptl/tst-rwlock12.c
nptl/tst-rwlock13.c
nptl/tst-rwlock14.c
nptl/tst-rwlock2.c
nptl/tst-rwlock3.c
nptl/tst-rwlock4.c
nptl/tst-rwlock5.c
nptl/tst-rwlock6.c
nptl/tst-rwlock7.c
nptl/tst-rwlock8.c
nptl/tst-rwlock9.c

If you are able to use Boost in your work code at all, you should use the shared_mutex class, which implements read/write locking.
Even if it doesn't 100% suit your needs, you should use the ideas in the code for your code, and, if the Boost code has tests for the shared_mutex (I haven't checked), you should add them to the tests you have.

Generally I would offer guidance to avoid implementing your own locks, unless you have proven than an existing and stable implementation doesn't meet your performance needs.
Testing and building synchronization primitives can be tricky and non intuitive.
The guidance to use boost::shared_mutex is quite wise, if you're on the Win32 platform I would guide you to use the Slim Reader Writer Locks if possible because they are robust and fast.
Though it won't help for a production product today, in Visual Studio 2010 Beta 1 we've adding a reader_writer class which while not cross platform will be part of the VS2010 redist.

since you've implemented a read/write lock then obviously you should test it in multi-threading environment. Test scenarios such as multiple reading threads should not be blocked when there are no write operations and running thousands of read/write operations over several hours should not cause deadlock might be a good start.

Use ASSERT/assert often to test all of your assumptions and to check pre and post conditions.

Run two threads that both do this
long long int j = 0;
long long int oldJ = 0;
while (true)
{
for (long long int i = 0; i <= 100000; ++i)
// try and make it 64 bits to be sure its non-atomics :)
{
oldJ = j;
YourRead(j) // read in j
assert(j == (oldJ + 1));
SleepSomeRandomPeriod( );
YourWrite(i)
SleepSomeRandomPeriod( );
}
}

IBM have an analysis tool for detecting threading issues in Java. Perhaps there is something similar for C++?

Make sure you test on a machine with multiple CPU cores or at least a CPU with hyperthreading. There are many problems with mulitiple threads that only occur, or that occur much more frequently when threads are really running on parallel on different CPUs.

Native vs. Protothreads, what is easier?

I just stumbled on Protothreads. They seems superior to native threads since context switch are explicit.
My question is. Makes this multi-threaded programming an easy task again?
(I think so. But have I missed something?)

They're not "superior" - they're just different and fit another purpose. Protothreads are simulated, and hence aren't real threads. They won't run on multiple cores, and they will all block on a single system call (socket recv() and such). Hence you shouldn't see it as a "silver bullet" that solves all multithreading problems. Such threads have existed for Java, Ruby and Python for quite some time now.
On the other hand, they're very lightweight so they do make some tasks quicker and simpler. They're suitable for small embedded systems because of low code and memory footprint. If you design your whole system (including an "OS", as is customary on small embedded devices) from the ground up, protothreads can provide a simple way to achieve concurrency.
Also read on green threads.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js