Why doesn't Eigen support OpenMP for coefficient-wise operations? - c++

This post along with some tests made it clear Eigen does not apply multiprocessing to coefficient-wise operations, such as cwiseProduct or Array multiplication, although matrix-matrix products can exploit multiple cores.
Still, with some optimization Eigen seems to be quite fast, and even if I try to write my own matrix library for particular purposes, I doubt if it will be faster than Eigen even if OpenMP for my library is enabled.
Why doesn't Eigen support OpenMP when it comes to coefficient-wise operations? Is it some sort of a blunder by the developer or are there some specific reasons to avoid multiprocessing for specific operations?
Can I manually include OpenMP support for such operations? The code for Eigen seems complicated so it is hard to find the exact implementation of a particular function, even through the use of Visual Studio instruments.

Why doesn't Eigen support OpenMP when it comes to coefficient-wise operations? Is it some sort of a blunder by the developer or are there some specific reasons to avoid multiprocessing for specific operations?
Using multiple threads by default would not be a good idea as users can already use multiple threads in their applications and having nested parallel loops is clearly not efficient. Moreover, sharing the work for each operation can introduce a significant overhead that is not great for basic operations small/medium-sized arrays. Eigen is meant to be fast for both small and big arrays. Using OpenMP on top of Eigen is better in practice. This is especially true one Numa systems due to the first touch policy: hiding the multithreading can introduce surprising overheads due to the remote accesses or page migrations.
For complex operations like LU decomposition, this is not reasonable to ask to the user to parallelise the operation as it would need to rewrite most of the algorithm. This is why Eigen try to parallelize such algorithms.
Can I manually include OpenMP support for such operations? The code for Eigen seems complicated so it is hard to find the exact implementation of a particular function, even through the use of Visual Studio instruments.
Element-wise operation are trivial to implement in parallel in OpenMP. You can simply use a #pragma omp parallel for directive (possibly combined with a collapse clause) on a loops iterating for 0 to array.rows() (or array.cols()) itself using basic Eigen array indexing operators.
Note that while using multiple threads for basic operations seems interesting at first glance, in practice, it is often disappointing in term of speed up on most machines. Indeed, reading/writing data in the RAM is very expensive compare to applying arithmetic operations (eg. add/sub/mul) on each items as long as the operation is already vectorized using SIMD instruction. For example, on my machine 1 core can reach a throughput of 20-25 GiB/s while the maximum throughput is 40 GiB/s with 6 cores (ie. speed-up less than 2x with 6x threads).

Related

What is the expected speed-up of using parallelization in C++ (not OpenMp, but <thread>)

What is the expected theoretical speed-up of using parallelization in C++?
For example, say I have 2 cores, and 4 logical processors. If I use a fully optimized parallel program to execute some tasks for me using 4 threads working at maximum capacity, how much of a speed-up over the serial code can I expect? Twice as fast? Four times as fast?
Please provide a reference for your answer.
And please do not close this question as being too broad or not containing a code sample. Providing a code sample would defeat the purpose of the question, since I am in search of a general, theoretical answer that might be used in a sales pitch for parallel computing. I am NOT wondering about the particular efficiency of some particular piece of code.
There is no limit imposed by using <thread>. It creates OS threads so can scale linearly with how many cores you have.
Now for the question of real cores vs. logical processors (Hyperthreading, SMT) you might find https://superuser.com/a/279803/112292 interesting. There is also lots of other benchmarks out there.
SMT is generally good when it can hide memory latency. So the speedup of SMT you can gain is purely dependent on your application (is it compute heavy, is it memory heavy?) and the only way to find is benchmark.
There is no specific number.
More practically, there is nothing in std::thread that has to impede linear scaling. And that translates to the real world. Using dozens of CPU cores is trivial with STD: thread.

Working with many fixed-size matrices in CUDA kernels

I am looking to work about 4000 fixed-size (3x3, 4x4) matrices, doing things such as matrix inversion and eigendecomposition.
It seems to me the best way to parallelize this would be to let each of the many GPU threads work on a single instance of the problem.
Is there a reasonable way to do this? I have read: http://www.culatools.com/blog/2011/12/09/batched-operations/ but as far as I can tell, it's always something that is "being worked on" with no solution in sight. Three years later, I hope there is a good solution.
So far, I have looked at:
Using Eigen in CUDA kernels: http://eigen.tuxfamily.org/dox-devel/TopicCUDA.html. But this is in its infancy: thus, it doesn't seem to work well and some things are not implemented. Moreover, I am not sure if it is optimized for CUDA at all. There is almost no documentation and the only example of code is a test file (eigen/test/cuda_basic.cu). When I tried using Eigen in CUDA kernels, simple things like declaring an Eigen::MatrixXf in a kernel did not survive compilation with nvcc V7.0.27 and Eigen 3.2.90 (mercurial).
Using the cuBLAS device API library to run BLAS routines within a kernel. It seems cuBLAS and its ilk are written to be parallelized even for small matrices, which seems overkill and likely slow for the 3x3 and 4x4 matrices I am interested in. Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Batch processing kernels using CUDA streams. In Section 2.1.7 "Batching Kernels" of the cuBLAS documentation for the CUDA Toolkit v7.0, this is suggested. But """in practice it is not possible to have more than 16 concurrent kernels executing at the same time""" and consequently it would be terrible for processing 4000 small matrices. In an aforementioned link to the CULA blog post, I quote, """One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; [...] Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU."""
Implementing my own matrix multiplication and eigendecomposition in kernels. This is likely to be very slow, and may in addition be time consuming to implement.
At this point I am tempted to give up on doing this on the GPU at all. It is a pity, since I was hoping for real time performance for an algorithm that requires inverting 4000 3x3 matrices about 100 times every 0.1 seconds.
The cublas functions getrfBatched and getriBatched are designed for batch inversion of small matrices. This should be quicker than either dynamic parallelism or streams (your 2nd and 3rd approaches.) Also a batch solver is available in source code form that can do matrix inversions. You will need to log in as a registered developer at developer.nvidia.com to access this link.
Also, I'm not sure if there is anything like cuBLAS that can also do eigendecomposition or SVD. (As far as I know, CULA does not support calling its routines from within kernels).
Cusolver provides some eigen solver functions. However they are not batched nor callable from device code, so you're faced with streams as the only option beyond that.

Speed of C++ operators/ simple math

I'm working on a physics engine and feel it would help having a better understanding of the speed and performance effects of performing many simple or complex math operations.
A large part of a physics engine is weeding out the unnecessary computations, but at what point are the computations small enough that a comparative checks aren't necessary?
eg: Testing if two line segments intersect. Should there be check on if they're near each other before just going straight into the simple math, or would the extra operation slow down the process in the long run?
How much time do different mathematical calculations take
eg: (3+8) vs (5x4) vs (log(8)) etc.
How much time do inequality checks take?
eg: >, <, =
You'll have to do profiling.
Basic operations, like additions or multiplications should only take one asm instructions.
EDIT: As per the comments, although taking one asm instruction, multiplications can expand to microinstructions.
Logarithms take longer.
Also one asm instruction.
Unless you profile your code, there's no way to tell where your bottlenecks are.
Unless you call math operations millions of times (and probably even if you do), a good choice of algorithms or some other high-level optimization will results in a bigger speed gain than optimizing the small stuff.
You should write code that is easy to read and easy to modify, and only if you're not satisfied with the performance then, start optimizing - first high-level, and only afterwards low-level.
You might also want to try dynamic programming or caching.
As regards 2 and 3, I could refer you to the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Appendix C presents the latencies and the throughput of various instructions.
However, unless you hand-code assembly code, your compiler will apply its own optimizations, so using this information directly would be rather difficult.
More importantly, you could use SIMD to vectorize your code and run computations in parallel. Also, memory performance can be a bottleneck if your memory layout is not ideal. The document I linked to has chapters on both issues.
However, as #Ph0en1x said, the first step would be choosing (or writing) an efficient algorithm, making it work for your problem. Only then should you start wondering about low-level optimizations.
As for 1, in a general case I'd say that if your algorithm works in such a way that it has some adjustable thresholds for when to execute certain tests, you could do some profiling and print out a performance graph of some kind, and determine the optimal values for those thresholds.
Well, this depends on your hardware. Very nice tables with instruction latency are http://www.agner.org/optimize/instruction_tables.pdf
1. it depends on the code a lot. Also don't forget it doesn't depend only on computations, but how well the comparison results can be predicted.
2. Generally addition/subtraction is very fast, multiplication of floats is a bit slower. Float division is rather slow (if you need to divide by a constant c, it's often better to precompute 1/c and multiply by it). The library functions are usually (I'd dare to say always) slower than simple operators, unless the compiler decides to use SSE. For example sqrt() and 1/sqrt() can be computed using one SSE instruction.
3. From about one cycle to several dozens of cycles. The current processors does the prediction on conditions. If the prediction is right right, it will be fast. However, if the prediction is wrong, the processor has to throw away all the preloaded instructions (IIRC Sandy Bridge preloads up to 30 instructions) and start processing new instructions.
That means if you have a code, where a condition is met most of the time, it will be fast. Similarly if you have code where the condition is not met most the time, it will be fast. Simple alternating conditions (TFTFTF…) are usually fast too.
This depends on the scenario you are trying to simulate. How many objects do you have and how close are they? Are they clustered or distributed evenly? Do your objects move around alot, or are they static? You will have to run tests. Possible data-structures for fast checking of proximity are kd-trees or locality-sensitive hashes (there may be others). I am not sure if these are appropriate for your application, you'd have to check if the maintenance of the data-structure and the lookup-cost are OK for you.
You will have to run tests. Consider checking if you can use vectorization, or if you can even run some of the computations in a GPU using CUDA or something like that.
Same as above - you have to test.
You can generally consider inequality checks, increment, decrement, bit shifts, addition and subtraction to be really cheap. Multiplication and division are generally a little more expensive. Complex math operations like logarithms are much more expensive.
Benchmark on your platform to be sure. Be careful about benchmarking using artificial tests with tight loops -- that tends to give you misleading results. Try to benchmark in code that's as realistic as possible. Ideally, profile the actual code under realistic conditions.
As for the optimizations for things like line intersection, it depends on the data set. If you do a lot of checks and most of your lines are short, it may be worth a quick check to rule out cases where the X or Y ranges don't overlap.
as much as I know all "inequality checks" take the same time.
regarding the rest calculations, I would advice you to run some tests like
take time stamp A
make 1,000,000 "+" calculation (or any other).
take time stamp B
calculate the diff between A and B.
then you can compare the calculations.
take in mind:
using different mathematical lib may change it (some math lib are more performance oriented and some more precision oriented)
the compiler optimization may change it.
each processor is doing it differently.

How to improve computational time by parallelizing

I have written a c++ code (using STL) and due to large computations it takes about one hour for the output to come. I checked on parallelizing on GPU and CPU. I have a ATI graphics card and a core i7 processor. On which one should i parallelize for better results.
Also can you please suggest reading material on how to set up my compiler for parallelizing on any of these platforms and how do i start parallelizing
For general libraries regarding multi-core/GPU programming:
Thrust for GPU/CPU STL-like interface programming
OpenMP for multi-threaded parallel code
TBB Intel Threading Building Blocks, lots of primitive data structures for parallel programming
in general, this area is absolutely vast, and no answer can make justice of the topic. There are many ways to approach parallelization, and that begins with analysing your logic and looking in parts that can be efficiently computed in parallel, and design (or redesign) your algorithms around those results.
You could also consider recoding your numerical kernels using OpenCL (and its ATI Stream implementation for your graphical card).

What are the recommended C++ parallelization libraries for large data processing

Can some one recommend approaches to parallelize in C++, when the data to be acted up on is huge. I have been reading about openMP and Intel's TBB for parallelization in C++, but have not experimented with them yet. Which of these is better for parallel data processing ? Any other libraries/ approaches ?
"large" and "data processing" cover a lot of ground here, and it's hard to give a sensible answer without more information.
If the data processing is "embarrassingly parallel" -- if it involves doing lots and lots of calculations that are completely independant of each other -- then there's a million things that will work and it's just a matter of finding something that matches your code and background.
If it isn't embarrasingly parallel, but nearly so - the computations take a big chunk of data but just distill it into a handfull of numbers - there's fewer, but still lots of options.
If the calculation is more tightly coupled than this - where you need the processors to work on tandem on big chunks of data then you're probably stuck with the standbys - the OpenMP features of your compiler if it will work on a single machine (there's TBB, too, but usually for number crunching OpenMP is faster and easier) or MPI if it needs several machines simultaneously. You mentioned C++; Boost has a very nice MPI layer.
But thinking about which library to use for parallelization is probably thinking about the wrong end of the problem first. In many cases, you don't necessarily need to deal with these layers directly. If the number crunching involves lots of linear algebra (for instance), then PLASMA (for multicore machines - http://icl.cs.utk.edu/plasma/ ) or PetSC, which has support for distributed memory machines, eg, multiple computers ( http://www.mcs.anl.gov/petsc/petsc-as/ ) are good choices, which can completely hide the actual details of the parallel implementation from you. Other sorts of techniques have other libraries, too. It's probably best to think about what sort of analysis you need to do, and look to see if existing toolkits have the amount of parallization you need. Only once you've determined the answer is no should you start to worry about how to roll your own.
Both OpenMP and Intel TBB are for local use as they help in writing multithreaded applications.
If you have truly huge datasets, you may need to split load over several machines -- and then libraries like Open MPI for parallel programming with MPI come into play. Open MPI has a C++ interface, but you now also face a networking component and some administrative issues you do not have with a single computer.
MPI is also useful on a single local machine. It will run a job across multiple cores/CPUs, while this is probably overkill compared to threading it does mean you can move the job to a cluster with no changes. Most MPI implementations also optimize a local job to use shared memory instead of TCP for data connections.