How to improve computational time by parallelizing

How to improve computational time by parallelizing - c++

I have written a c++ code (using STL) and due to large computations it takes about one hour for the output to come. I checked on parallelizing on GPU and CPU. I have a ATI graphics card and a core i7 processor. On which one should i parallelize for better results.
Also can you please suggest reading material on how to set up my compiler for parallelizing on any of these platforms and how do i start parallelizing

For general libraries regarding multi-core/GPU programming:
Thrust for GPU/CPU STL-like interface programming
OpenMP for multi-threaded parallel code
TBB Intel Threading Building Blocks, lots of primitive data structures for parallel programming
in general, this area is absolutely vast, and no answer can make justice of the topic. There are many ways to approach parallelization, and that begins with analysing your logic and looking in parts that can be efficiently computed in parallel, and design (or redesign) your algorithms around those results.

You could also consider recoding your numerical kernels using OpenCL (and its ATI Stream implementation for your graphical card).

Related

Applications well suited for Xeon-phi many-core architecture

From this https://software.intel.com/en-us/videos/purpose-of-the-mic-architecture I understand that applications with complex or numerous random memory access are not well suited for Intel Xeon-phi. This is because the architecture uses 61 cores and 8 memory controllers. In case of L1 and L2 cache misses, it takes up to 100s cycles to fetch the line into memory and get it ready for use by the CPU. Such applications are called latency-bound.
Then, the tutorial mentions that many-core architecture (Xeon-phi coprocessor only) are well suited for highly parallel homogeneous code. Two questions from there:
What is referred to as homogeneous code ?
What are real-world applications which can fully benefit from MIC architecture ?

I see the Intel MIC architecture as a "x86 based GPGPU" and if you are familiar with the concept of GPGPU you will find your self familiar with the Intel MIC.
An homogeneous clustering is a system infrastructure with multiple execution unit (i.e. CPUs) all with the same features. For example a multicore system that have four Intel Xeon processors is homogeneous.
An heterogeneous clustering is a system infrastructure with multiple execution unit with different features (i.e. like CPU and GPU). For example my Levono z510 with its Intel i7 Haswell (4 CPUs), its Nvidia GT740M (GPU) and its Intel HD Graphics 4600 (GPU) is an heterogeneous system.
An example of heterogeneous code could be a Video Game.
A video game has a control code, executed by one code of one CPU, that control what the other agents do, its send shaders to execute on the GPUs, physic computation to be performed on others cores or GPUs and so on.
In this example you need to write code that run on the CPU (so it is "CPU aware") and code that run on GPU (so it is "GPU aware"). This is actually done by using different tools, different programming languages and different programming models!
homogeneous code is code that don't need to be aware of n different programming models, one for each different kind of agent. It is just the same programming model, language and tool.
Take a look a this very simple sample code for the MPI library.
The code is all written in C, it is the same program that just take a different flow.
About the applications, Well that's really a broad question...
As said above I see the Intel MIC as a GPGPU based on x86 ISA (part of it at least).
An SDK particularly useful (and listed on the video you linked) to work with clustered systems is OpenCL, it can be used for fast processing of images and computer vision and basically for anything that need the same algorithm to be run billions of times with different inputs (like cryptography applications/brute forcing).
If you search for some OpenCL based project on the web you will get an idea.
To answer you second question it is better to ask ourselves "What could not take advantage of the MIC architecture?" and we will soon find that the more an algorithm is distant from the concept of Stream Processing and the related topics, including the one of Kernel, the less it is suitable for the MIC.

First a straight forward answer to your direct question - to get the most out of the coprocessor, your code should be able to use a large number of threads and should vectorize. How many threads? Well, you have 60 cores (+/- depending on which version you get) and 4 threads per core, with a sweet spot around 2 threads per core on many codes. Sometimes you can get good performance even if you don't use every single core. But vectorization is extremely important; the long (512 byte) vectors are a big source of speed on the coprocessor.
Now, on to programming. The Intel Xeon Phi coprocessor uses two different kinds of programming - offload and native.
In the offload model, you write a program, determine which parts of that code have enough parallelism to make use of the large number of cores on the coprocessor and mark those sections with offload directives. Then inside those offloaded sections, you write the code using some form of parallelism, like OpenMP. (Heterogeneous)
In native code, you do not use any offload directives but, instead, use a -mmic compiler directive. Then you run the code directly on the coprocessor. The code you write will use some form of parallelism, like OpenMP, to make use of the large number of cores the coprocessor has. (Homogeneous)
Another variation on these programming models is to use MPI, often in addition to OpenMP. You can use the offload programming model, in which case, the nodes in you MPI system will be the host nodes in your system. (Hybrid) Alternately, you can use the native programming model, in which case you treat the coprocessor as just another node in your system. (Heterogeneous if host and coprocessors are nodes; homogeneous if only coprocessors are use.)
You may have noticed that nothing I have said implies a separate programming style for the host and coprocessor. There are some optimizations you can make that will keep code written for the coprocessor from running on the processor as well but, in general, the code you write for the coprocessor can also be compiled for and run on the host by just changing the compiler options.
As far as real world apps, see https://software.intel.com/en-us/mic-developer/app-catalogs

Want to improve the computing speed of matrix calculation, OpenMP or CUDA?

My program has a bunch of matrix multiplication and inversion, which is time consuming.
My computer: CPU: intel i7; GPU: 512MB NVIDIA® Quadro® NVS3100M
Which one is better for improving computing speed? OpenMP or CUDA?
(ps. I think generally, GPU has more cores than cpu, thus, CUDA could improve multiple times more than OpenMP?)

From my experience(work on both as a school project, in most condition, the calculation time for a medium size array, I would say less than 2000 * 2000, is almost the same, the actual calculation time depending on the working load of your computer(usually when you working on openMP, you would share a cluster with other guys, so make sure you are running your application alone, so that you might got a better result))
But if you are good at CUDA, the GPU is very powerful in these kinds of calculation stuff, when i was working on my CUDA project, there are lots of good materials in the official website. For openMP, it is only a library, and if you are good at c or c++, should not be any problem for you to use it(but the compiler of openMP is buggy~~, don't trust it, try to log anything).
And i assumed you have experience on CUDA, is not hard to find some good example i think. But CUDA is really dummy, can't debug, so I recommend you to try openMP first, it should be easier.

I'd guess it depends on what your application is and how you go about trying to implement improvements. Keep in mind that every optimization has tradeoffs. For instance, GPU's typically use half-precision floating point, and there are compiler options that allow you to bypass some aspects of the IEEE standard, which brings you some extra speed at the expense of precision, etc.

What are the recommended C++ parallelization libraries for large data processing

Can some one recommend approaches to parallelize in C++, when the data to be acted up on is huge. I have been reading about openMP and Intel's TBB for parallelization in C++, but have not experimented with them yet. Which of these is better for parallel data processing ? Any other libraries/ approaches ?

"large" and "data processing" cover a lot of ground here, and it's hard to give a sensible answer without more information.
If the data processing is "embarrassingly parallel" -- if it involves doing lots and lots of calculations that are completely independant of each other -- then there's a million things that will work and it's just a matter of finding something that matches your code and background.
If it isn't embarrasingly parallel, but nearly so - the computations take a big chunk of data but just distill it into a handfull of numbers - there's fewer, but still lots of options.
If the calculation is more tightly coupled than this - where you need the processors to work on tandem on big chunks of data then you're probably stuck with the standbys - the OpenMP features of your compiler if it will work on a single machine (there's TBB, too, but usually for number crunching OpenMP is faster and easier) or MPI if it needs several machines simultaneously. You mentioned C++; Boost has a very nice MPI layer.
But thinking about which library to use for parallelization is probably thinking about the wrong end of the problem first. In many cases, you don't necessarily need to deal with these layers directly. If the number crunching involves lots of linear algebra (for instance), then PLASMA (for multicore machines - http://icl.cs.utk.edu/plasma/ ) or PetSC, which has support for distributed memory machines, eg, multiple computers ( http://www.mcs.anl.gov/petsc/petsc-as/ ) are good choices, which can completely hide the actual details of the parallel implementation from you. Other sorts of techniques have other libraries, too. It's probably best to think about what sort of analysis you need to do, and look to see if existing toolkits have the amount of parallization you need. Only once you've determined the answer is no should you start to worry about how to roll your own.

Both OpenMP and Intel TBB are for local use as they help in writing multithreaded applications.
If you have truly huge datasets, you may need to split load over several machines -- and then libraries like Open MPI for parallel programming with MPI come into play. Open MPI has a C++ interface, but you now also face a networking component and some administrative issues you do not have with a single computer.

MPI is also useful on a single local machine. It will run a job across multiple cores/CPUs, while this is probably overkill compared to threading it does mean you can move the job to a cluster with no changes. Most MPI implementations also optimize a local job to use shared memory instead of TCP for data connections.

How to structure a C++ application to use a multicore processor

I am building an application that will do some object tracking from a video camera feed and use information from that to run a particle system in OpenGL. The code to process the video feed is somewhat slow, 200 - 300 milliseconds per frame right now. The system that this will be running on has a dual core processor. To maximize performance I want to offload the camera processing stuff to one processor and just communicate relevant data back to the main application as it is available, while leaving the main application kicking on the other processor.
What do I need to do to offload the camera work to the other processor and how do I handle communication with the main application?
Edit:
I am running Windows 7 64-bit.

Basically, you need to multithread your application. Each thread of execution can only saturate one core. Separate threads tend to be run on separate cores. If you are insistent that each thread ALWAYS execute on a specific core, then each operating system has its own way of specifying this (affinity masks & such)... but I wouldn't recommend it.
OpenMP is great, but it's a tad fat in the ass, especially when joining back up from a parallelization. YMMV. It's easy to use, but not at all the best performing option. It also requires compiler support.
If you're on Mac OS X 10.6 (Snow Leopard), you can use Grand Central Dispatch. It's interesting to read about, even if you don't use it, as its design implements some best practices. It also isn't optimal, but it's better than OpenMP, even though it also requires compiler support.
If you can wrap your head around breaking up your application into "tasks" or "jobs," you can shove these jobs down as many pipes as you have cores. Think of batching your processing as atomic units of work. If you can segment it properly, you can run your camera processing on both cores, and your main thread at the same time.
If communication is minimized for each unit of work, then your need for mutexes and other locking primitives will be minimized. Course grained threading is much easier than fine grained. And, you can always use a library or framework to ease the burden. Consider Boost's Thread library if you take the manual approach. It provides portable wrappers and a nice abstraction.

It depends on how many cores you have. If you have only 2 cores (cpu, processors, hyperthreads, you know what i mean), then OpenMP cannot give such a tremendous increase in performance, but will help. The maximum gain you can have is divide your time by the number of processors so it will still take 100 - 150 ms per frame.
The equation is
parallel time = (([total time to perform a task] - [code that cannot be parallelized]) / [number of cpus]) + [code that cannot be parallelized]
Basically, OpenMP rocks at parallel loops processing. Its rather easy to use
#pragma omp parallel for
for (i = 0; i < N; i++)
a[i] = 2 * i;
and bang, your for is parallelized. It does not work for every case, not every algorithm can be parallelized this way but many can be rewritten (hacked) to be compatible. The key principle is Single Instruction, Multiple Data (SIMD), applying the same convolution code to multiple pixels for example.
But simply applying this cookbook receipe goes against the rules of optimization.
1-Benchmark your code
2-Find the REAL bottlenecks with "scientific" evidence (numbers) instead of simply guessing where you think there is a bottleneck
3-If it is really processing loops, then OpenMP is for you
Maybe simple optimizations on your existing code can give better results, who knows?
Another road would be to run opengl in a thread and data processing on another thread. This will help a lot if opengl or your particle rendering system takes a lot of power, but remember that threading can lead to other kind of synchronization bottlenecks.

I would recommend against OpenMP, OpenMP is more for numerical codes rather than consumer/producer model that you seem to have.
I think you can do something simple using boost threads to spawn worker thread, common segment of memory (for communication of acquired data), and some notification mechanism to tell on your data is available (look into boost thread interrupts).
I do not know what kind of processing you do, but you may want to take a look at the Intel thread building blocks and Intel integrated primitives, they have several functions for video processing which may be faster (assuming they have your functionality)

You need some kind of framework for handling multicores. OpenMP seems a fairly simple choice.

Like what Pestilence said, you just need your app to be multithreaded. Lots of frameworks like OpenMP have been mentioned, so here's another one:
Intel Thread Building Blocks
I've never used it before, but I hear great things about it.
Hope this helps!

Thread Building Block versus MPI, which one fits mt need better?

Now I have a serial solver in C++ for solving optimization problems and I am supposed to parallelize my solver with different parameters to see whether it can help improve the performance of the solver. Now I am not sure whther I should use TBB or MPI. From a TBB book I read, I feel TBB is more suitable for looping or fine-grained code. Since I do not have much experience with TBB, I feel it is difficult to divide my code to small parts in order to realize the parallelization. In addition, from the literature, I find many authors used MPI to parallel several solvers and make it cooperate. I guess maybe MPI fits my need more. Since I do not have much knowledge on either TBB or MPI. Anyone can tell me whether my feeling is right? Will MPI fit me better? If so, what material is good for start learning MPI. I have no experience with MPI and I use Windows system and c++. Thanks a lot.

The basic thing you need to have in mind is to choose between shared-memory and distributed-memory.
Shared-memory is when you have more than one process (normally more than one thread within a process) that can access a common memory. This can be quite fine-grained and it is normally simpler to adapt a single-threaded program to have several threads. You will need to design the program in a way that the threads work most of the time in separate parts of the memory (exploit data parallelism) and that the shared part is protected against concurrent accesses using locks.
Distributed-memory means that you have different processes that might be executed in one or several distributed computers but these process have together a common goal and share data through message-passing (data communication). There is no common memory space and all the data one process need from another process will require communication.
It is a more general approach but, because of communication requirements, it requires coarse grains.
TBB is a library support for thread-based shared-memory parallelism while MPI is a library for distributed-memory parallelism (it has simple primitives for communication and also scripts for several processes in different nodes execution).
The most important thing is for you to identify the parallelisms within your solver and then choose the best solution. Do you have data parallelism (different thread/processes could be working in parallel in different chunks of data without the need of communication or sharing parts of this data)? Task parallelism (different threads/processes could be performing a different transformation to your data or a different step in the data processing in a pipeline or graph fashion)?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js