Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I understand the basic difference between the two, I often use std::async in my programs, which gives me concurrency.
Is there any reliable/notable libraries that can provide parallelism in C++? (I know this is a likely feature of C++17). If so, what are your experiences with them?
Thanks!
Barbra
Threading Building Blocks (TBB) is a templated C++ library for task parallelism. The library contains various algorithms and data structures specialized for task parallelism. I have had success with using parallel_for as well as parallel_pipeline to greatly speed up computations. With a little bit of extra coding, TBB's parallel_for can take a serial for loop that is appropriate for being executed in parallel and make it execute as such (See example here). TBB's parallel_pipeline has the ability to execute a chain of dependent tasks with the option of each being executed in parallel or serial (See example here). There are many more examples on the web especially at software.intel.com and here on stackoverflow (see here).
OpenMP is an API for thread parallelism that is accessed primarily through compiler directives. Although, I prefer to use the richer feature set provided by TBB, OpenMP can be a quick way of testing out parallel algorithms and code (just add a pragma and set some build settings). Once things have been tested and experimented with, I have found that converting certain uses of OpenMP to TBB can be done fairly easily. This isn't to say that OpenMP is not meant for serious coding. In fact, there may be instances in which one would prefer OpenMP over TBB (One is that because it primarily relies on pragmas, switching to serial execution can be easier than with TBB.). A number of open source projects that utilize OpenMP can be found in this discussion. There are a number of examples (e.g., on wikipedia) and tutorials on the web for OpenMP including many questions here on stackoverflow.
I previously neglected a discussion on SIMD (single instruction, multiple data), which provides data parallelism. As pointed out in the below comments, OpenMP is an option for exploring SIMD (check this link). Extensions to instruction sets such as SSE and AVX (both extensions to the x86 instruction set architecture) as well as NEON (ARM architecture) are also worthwhile to explore. I have had good and bad experience with using SSE and AVX. The good is that they can provide a nice speed up to certain algorithms (in particular I have used Intel intrinsics). The bad is that the ability to use these instructions is dependent upon specific CPU support, which may cause unexpected runtime exceptions.
Specifically with respect to parallelism and mathematics, I have had good experiences using Intel MKL (which now has a no cost option) as well as OpenBLAS. These libraries provide optimized, parallel, and/or vectorized implementations of common mathematical functions/routines (e.g., BLAS and LAPACK). There are many more libraries available that deal specifically with mathematics out there that involve optimized parallelism to some extent. While they may not provide lower level building blocks of parallelism (e.g., ability to manipulate threads, schedule tasks), it is very worthwhile to utilize (and contribute to) the immense amount of research and work in the field of computational mathematics. A similar statement could be said for areas of interest outside of mathematics.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What I mean: for example Unreal Engine 4. Its run good on Intel, but lagging pretty much on AMD (as Editor and also in lot of games). Is any difference between them in coding? How can I write highly optimized code for both of them?
Thanks.
As ALWAYS for optimising any code, the biggest gains are in changing the algorithm to the most efficient one for your data-set. Do this before you do any other performance optimisation.
The second step in improving performance is to figure out what parts of your code stand for the biggest "hit", and concentrate on that. Of course, that becomes a "peeling an onion apart" problem, where, when you improve the performance of one function, you end up with something else being the slowest part...
I'm not going to search out and link to the various performance optimisation pages (documents, etc) available. Both Intel and AMD have optimisation guides, with comments as to what different models of their processors can do what and which code-sequences and such to use (as does for example ARM for their various processor models). All compiler vendors have lists of what options affect code generation in what way (e.g. enabling SSE, AVX, etc). Different compilers are more or less good at actually USING "new" instructions available in the latest versions of processors.
Optimising code for one processor subarchitecture [the difference between processors within for example x86, ARM, etc] is not terribly hard. Writing code for multiple processor subarchitectures gets quite hard, especially if you want to squeeze that last little bit of performance out of the processor, because the tricks you have to use are specific to each subarchitecture. There are a few classes of problems:
Different features available in different processors, require code to be compiled with the right code-generation options enabled (e.g. SSE, AVX, etc). So, you need to "split" the code into generic code and code that can make use of vector instructions, and either make the compiler vectorise it, or hand-write assembler to make best use of the instructions.
Minor archetectural differences make different instruction sequences more or less good. So on a Processor X, you should use instructions A, B and C to replace instruction M (because M is unusually slow), but on processor Y, the one instruction M is faster than A, B and C. So again, you have to pick which one you make it fast for - or compile the same code multiple times.
Caches are different in different architectures, meaning that optimisation to make something like "copy this data" fast on one architecture may not show the same improvement on another architecture.
Beyond that, you really need to ask a more specific question to some specific code that you know is slow.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have an application that involves large amounts of matrix multiplications written using Eigen. I would like to make a GPU computational backend for it, while maintaining ability to run on the CPU only and be accelerated by MKL when available.
The problem:
Add a GPU computational backend in a way that shares as much code as possible as the CPU backend.
The easiest way to achieve GPU acceleration is through the use of Eigen-magma, however this is quite limited, since there are unnecessary copies back-and-forth from main memory to GPU memory on every operation, which limits the performance gain one could get from the GPU.
Knowing that I would have to ditch Eigen completely and rewrite the application but what would be the best way to do so, without having completely separate code paths for CPU and GPU computional backends?
With CUDA6 and the automatic managed memory feature you can easily avoid these extra copies by letting the drivers performing them when this is really necessary. It should be easy to adapt Eigen-magma to take advantage of that feature. This is way we plan to natively support CUDA within Eigen.
There are several approaches you can take for the more generic case:
CUDA allows one to compile functions for both the host and the device (using __host__ __device__. A runtime switch would be able to change whether your resulting code has a for loop which iterates over elements calling your __host__ __device__ function or a __global__ function calling it.
OpenACC is probably the closest you'll get to a performance portable approach. It's directive based, and so the back-end has a bit more leeway to optimise for the targetted platform. That said, if you are targetting a specific architecture, you'll probably be able to outperform it with hand-coded implementations. How much so will vary with the problem and the programmer's ability. Being directive based, to compile for the host you just don't use the OpenACC flags.
OpenCL. OpenCL, like OpenACC, can target several platforms, but is parallelise explicitally (like CUDA) rather than using directives. The pros and cons of OpenCL vs CUDA are outside the scope of the question.
The inevitable problem with sharing code between different architectures is performance. You cannot expect code to be both fast and performance portable across X86/NVIDIA/AMD/Xeon Phi. With a single code intended to run on multiple architectures I think OpenACC is probably the best way to go, but you'll still be leaving a lot of performance on the table relative to code written with a specific architecture in mind.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
With the rapid development of C++ compilers,especially the intel ones, and the abilities of directly applying SIMD functions in your C/C++ code, does Fortran still hold any real advantage in the world of numerical computations?
I am from an applied maths background, my job involves a lot of numerical analysis, computations, optimisations and such, with a strictly defined performance-requirement.
I hardly know anything about Fortran, I have some experience in C/CUDA/matlab(if you consider the latter as a computer language to begin with), and my daily task involves analysis of very large data (e.g. 10GB-large matrix), and it seems the program at least spend 2/3 of its time on memory-accessing (thats why I send some of its job to GPU), do you people think it may worth the effects for me to trying the fortran routine on at least some performance-critical part of my code to improve the performance of my program?
Because the complexity and things need to be done involved there, I will only go that routine if only there is significant performance benefit there, thanks in advance.
Fortran has strict aliasing semantics compared to C++ and has been aggressively tuned for numerical performance for decades. Algorithms that uses the CPU to work with arrays of data often have the potential to benefit from a Fortran implementation.
The programming languages shootout should not be taken too seriously, but of the 15 benchmarks, Fortran ranks #1 for speed on four of them (for Intel Q6600 one core), more than any other single language. You can see that the benchmarks where Fortran shines are the heavily numerical ones:
spectral norm 27% faster
fasta 67% faster
mandelbrot 56% faster
pidigits 18% faster
Counterexample:
k-nucleotide 500% slower (this benchmark focuses heavily on more sophisticated data structures and string processing, which is not Fortran's strength)
You can also see a summary page "how many times slower" that shows that out of all implementations, the Fortran code is on average closest to the fastest implementation for each benchmark -- although the quantile bars are much larger than for C++, indicating Fortran is unsuited for some tasks that C++ is good at, but you should know that already.
So the questions you will need to ask yourself are:
Is the speed of this function so critical that reimplementing it in Fortran is worth my time?
Is performance so important that my investment in learning Fortran will pay off?
Is it possible to use a library like ATLAS instead of writing the code myself?
Answering these questions would require detailed knowledge of your code base and business model, so I can't answer those. But yes, Fortran implementations are often faster than C++ implementations.
Another factor in your decision is the amount of sample code and the quantity of reference implementations available. Fortran's strong history means that there is a wealth of numerical code available for download and even with a trip to the library. As always you will need to sift through it to find the good stuff.
The complete and correct answer to your question is, "yes, Fortran does hold some advantages".
C++ also holds some, different, advantages. So do Python, R, etc etc. They're different languages. It's easier and faster to do some things in one language, and some in others. All are widely used in their communities, and for very good reasons.
Anything else, in the absence of more specific questions, is just noise and language-war-bait, which is why I've voted to close the question and hope others will too.
Fortran is just naturally suited for numerical programming. You tend to have a large amount of numbers in such programs, typically arranged arrays. Arrays are first class citizens in Fortran and it is often pretty straight forward to translate numerical kernels from Matlab into Fortran.
Regarding potential performance advantages see the other answers, that cover this quite nicely. The baseline is probably you can create highly efficient numerical applications with most compiled languages today, but you might jump through some loops to get there. Fortran was carefully designed to allow the compiler to recognize most spots for optimizations, due to the language features. Of course you can also write arbitrary slow code with any compiled language, including Fortran.
In any case you should pick the tools as suited. Fortran suits numerical applications, C suits system related development. On a final remark, learning Fortran basics is not hard, and it is always worthwhile to have a look into other languages. This opens a different view on problems you want to solve.
Also worth mentioning is that Fortran is a lot easier to master than C++. In fact, Fortran has a shorter language spec than plain C and it's syntax is arguably simpler. You can pick it up very quickly.
Meaning that if you are only interested in learning C++ or Fortran to solve a single specific problem you have at the moment (say, to speed up the bottlenecks in something you wrote in a prototyping language), Fortran might give you a better return on investment.
Fortran code is better for matrix and vector type operation in general. But you also can produce similar performance with c/c++ code by passing hints/suggestions to the compiler to produce similar quality vector instructions. One option that gave me good boost was not to assume memory aliasing among input variables that are array objects. This way, the compiler can aggressively do inner loop unrolling and pipelining for ILP where it can overlap loads and store operation across loop iteration with right prefetches.
I want to compare PPL vs. OpenMP regarding their performance, but can't find a detailed investigation on the web. I believe there are not many people who are experienced with PPL.
I'm developing my software on Windows, using Visual Studio 2010, and don't want to port it to somewhere else in a short term.
If portability is not an issue, and only concern is the performance, what do you think about these two methods?
On MSDN there is a great comparison of the properties OpenMP and ConcRT (core of PPL):
The OpenMP model is an especially good match for high-performance computing, where very large computational problems are distributed across the processing resources of a single computer. In this scenario, the hardware environment is known and the developer can reasonably expect to have exclusive access to computing resources when the algorithm is executed.
However, other, less constrained computing environments may not be a good match for OpenMP. For example, recursive problems (such as the quicksort algorithm or searching a tree of data) are more difficult to implement by using OpenMP. The Concurrency Runtime complements the capabilities of OpenMP by providing the Parallel Patterns Library (PPL) and the Asynchronous Agents Library. Unlike OpenMP, the Concurrency Runtime provides a dynamic scheduler that adapts to available resources and adjusts the degree of parallelism as workloads change.
So, Main disadvantages of OpenMP:
static sheduling model.
not contains cancelation mechanism (very huge disadvantage, in many concurrence algorithm cancelation is required).
not contains concurrence agent approach.
troubles with exceptions in paralel code.
It probably depends on your algorithm, however this research indicates that PPL may be faster then OpenMP:
http://www.codeproject.com/Articles/373305/Visual-Cplusplus-11-Beta-Benchmark-of-Parallel-Loo
Serial : 72ms
OpenMP : 16ms
PPL : 12ms
If your only concern is performance then what I think about the two approaches is completely irrelevant. This is a question resolvable by an empirical approach, not by argumentation.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I'm looking to add some genetic algorithms to an Operations research project I have been involved in. Currently we have a program that aids in optimizing some scheduling and we want to add in some heuristics in the form of genetic algorithms. Are there any good libraries for generic genetic programming/algorithms in c++? Or would you recommend I just code my own?
I should add that while I am not new to c++ I am fairly new to doing this sort of mathematical optimization work in c++ as the group I worked with previously had tended to use a proprietary optimization package.
We have a fitness function that is fairly computationally intensive to evaluate and we have a cluster to run this on so parallelized code is highly desirable.
So is c++ a good language for this? If not please recommend some other ones as I am willing to learn another language if it makes life easier.
thanks!
I would recommend rolling your own. 90% of the work in a GP is coding the genotype, how it gets operated on, and the fitness calculation. These are parts that change for every different problem/project. The actual evolutionary algorithm part is usually quite simple.
There are several GP libraries out there ( http://en.wikipedia.org/wiki/Symbolic_Regression#Implementations ). I would use these as examples and references though.
C++ is a good choice for GP because they tend to be very computationally intensive. Usually, the fitness function is the bottleneck, so it's worthwhile to at least make this part compiled/optimized.
I use GAUL
it's a C library with all you want.
( pthread/fork/openmp/mpi )
( various crossover / mutation function )
( non GA optimisation: Hill-Climbing, N-M Simplex, Simulated annealling, Tabu, ... )
Why build your own library when there is such powerful tools ???
I haven't used this personally yet, but the Age Layered Population Structure (ALPS) method has been used to generate human competitive results and has been shown to outperform several popular methods in finding optimal solutions in rough fitness landscapes. Additionally, the link contains source code in C++ FTW.
I have had similar problems. I used to have a complicated problem and defining a solution in terms of a fixed length vector was not desirable. Even a variable length vector does not look attractive. Most of the libraries focus on cases where the cost function is cheap to calculate which did not match my problem. Lack of parallelism is their another pitfall. Expecting the user to allocate memory for being used by the library is adding insult into injury. My cases were even more complicated because most of the libraries check the nonlinear conditions before evaluation. While, I needed to check the nonlinear condition during or after the evaluation based on the result of the evaluation. It is also undesirable when I needed to evaluate the solution to calculate its cost and then I had to recalculate the solution to present it. In most of the cases, I had to write the cost function two times. Once for GA and once for presentation.
Having all of these problems, I eventually, designed my own openGA library which is now mature.
This library is based on C++ and distributed with free Mozilla Public License 2.0. It guarantees that using this library does not limit your project and it can be used for commercial or none commercial purposes for free without asking for any permission. Not all libraries are transparent in this sense.
It supports three modes of single objective, multiple objective (NSGA-III) and Interactive Genetic Algorithm (IGA).
The solution is not mandated to be a vector. It can be any structure with any customized design containing any optional values with variable length. This feature makes this library suitable for Genetic Programming (GP) applications.
C++11 is used. Template feature allows flexibility of the solution structure design.
The standard library is enough to use this library. There is no dependency beyond that. The entire library is also a single header file for ease of use.
The library supports parallelism by default unless you turn it off. If you have an N-core CPU, the number of threads are set to N by default. You can change the settings. You can also set if the solution evaluations are distributed between threads equally or they are assigned to any thread which has finished its job and is currently idle.
The solution evaluation is separated from calculation of the final cost. It means that your evaluation function can simulate the system and keep a lot of information. Your cost function is called later and reports the cost based on the evaluation. While your evaluation results are kept to be used later by the user. You do not need to re-calculate it again.
You can reject a solution at any time during the evaluation. No waste of time. In fact, the evaluation and constraint check are integrated.
The GA assist feature help you to produce the C++ code base from the information you provide.
If these features match what you need, I recommend having a look at the user manual and the examples of openGA.
The number of the readers and citation of the related publication as well as its github favorite marks is increasing and its usage is keep growing.
I suggest you have a look into the matlab optimization toolkit - it comes with GAs out of the box, you only haver to code the fitness function (and a function to generate inital population eventually) and I believe matlab has some C++ interoperability so you could code you functions in C++. I am using it for my experiments and a very nice feature is that you get all sorts of charts out of the box as well.
Said so - if your aim is to learn about genetic algorithms you're better off coding it, but if you just want to run experiments matlab and C++ (or even just matlab) is a good option.