Fortran-style multidimensional arrays in C++ - c++

Is there a C++ library which provides Fortran-style multidimensional arrays with support for slicing, passing as procedural parameter and decent documentation? I've looked into blitz++ but its dead!

I highly suggest Armadillo:
Armadillo is a C++ linear algebra library (matrix maths) aiming towards a good balance between speed and ease of use
It is a C++ template library:
A delayed evaluation approach is employed (at compile-time) to combine several operations into one and reduce (or eliminate) the need for temporaries; this is automatically accomplished through template meta-programming
A simple example from the web page:
#include <iostream>
#include <armadillo>
int main(int argc, char** argv)
{
arma::mat A = arma::randu<arma::mat>(4,5);
arma::mat B = arma::randu<arma::mat>(4,5);
std::cout << A*B.t() << std::endl;
return 0;
}

If you are running OSX the you can use the vDSP libs for free.
If you want to deploy on windows targets then either license the intel equivalents (MKL) or I think that the AMD vector math libs (ACML) are free.

Related

C++ Eigen execution time difference

So I'm calculating a lot of statistical distances in my application, written in C++ (11/14). I use the Eigen library for linear algebra calculations. My code was initially compiled on macOS, particularly BigSur. Since I need to make my results reproducible, I was trying to get the same results under other OS, particularly Fedora 32. However, there are significant result differences, which I cannot contribute to anything specific after trying various things.
So I made a sample code...
#include <iostream>
#include <chrono>
#include <Eigen/Core>
#include <Eigen/Dense>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main()
{
MatrixXd cov(2,2);
cov << 1.5,0.2,0.2,1.5;
VectorXd mean(2),ne(2);
mean << 10,10;
ne << 10.2,10.2;
auto start = high_resolution_clock::now();
for(int i=0;i<2000000;i++) {
MatrixXd icov=cov.inverse();
VectorXd delta=ne-mean;
double N0=delta.transpose()*(icov*delta);
double res=sqrtf(N0);
}
auto stop = high_resolution_clock::now();
cout << "Mahalanobis calculations in "
<< duration_cast<milliseconds>(stop - start).count()
<< " ms." << endl;
return 0;
}
which was compiled with
clang++ -std=c++14 -w -O2 -I'....Eigen/include' -DNDEBUG -m64 -o benchmark benchmark.cpp
on both, macOS and Fedora32. Yes, I downloaded and installed clang on Fedora, just to be sure I'm using the same compiler. On macOS, I have clang version 12.0.0, and on Fedora 10.0.1!
The difference between these test cases is 2x
macOS:
Mahalanobis calculations in 2833 ms.
Fedora:
Mahalanobis calculations in 1490 ms.
When it comes to my specific application, the difference is almost 30x, which is quite unusual. In the meantime I checked for the following:
OpenMP support - tried switching on and off, compile time and runtime (setting the number of threads before the test code chunk)
various compiling flags and architectures
adding OpenMP support to macOS
tempering with EIGEN_USE_BLAS, EIGEN_USE_LAPACKE, and EIGEN_DONT_PARALLELIZE flags
Nothing helps. Any ideas where is the problem?
Maybe something with memory management?
Finally, to answer the question for all those that encounter the same problem. The issue is in the memory management. As someone pointed out, these is a big difference between dynamically and statically allocated Eigen objects. So
MatrixXd cov(2,2);
tends to be much slower than
Matrix<double,2,2> cov;
since the first approach uses heap to dynamically allocate the needed memory. At the end of the day, it all comes down to the way how the OS handles memory. It seems that Linux is doing it better than macOS or Windows (no surprises there actually).
I know that it is not possible always to use Matrix2d over the good old MatrixXd. Some developers even reported that Eigen matrix math tends to be slower than their own home-made simple solutions, but this comes down to the choice of doing everything yourself, or taking all-purpose linera algebra library. Depends on what you are doing...

fp16 support in cuda thrust

I am not able to found anything about the fp16 support in thrust cuda template library.
Even the roadmap page has nothing about it:
https://github.com/thrust/thrust/wiki/Roadmap
But I assume somebody has probably figured out how to overcome this problem, since the fp16 support in cuda is around for more than 6 month.
As of today, I heavily rely on thrust in my code, and templated nearly every class I use in order to ease fp16 integration, unfortunately, absolutely nothing works out of the box for half type even this simple sample code:
//STL
#include <iostream>
#include <cstdlib>
//Cuda
#include <cuda_runtime_api.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cuda_fp16.h>
#define T half //work when float is used
int main(int argc, char* argv[])
{
thrust::device_vector<T> a(10,1.0f);
float t = thrust::reduce( a.cbegin(),a.cend(),(float)0);
std::cout<<"test = "<<t<<std::endl;
return EXIT_SUCCESS;
}
This code cannot compile because it seems that there is no implicit conversion from float to half or half to float. However, it seems that there are intrinsics in cuda that allow for an explicit conversion.
Why can't I simply overload the half and float constructor in some header file in cuda, to add the previous intrinsic like that :
float::float( half a )
{
return __half2float( a ) ;
}
half::half( float a )
{
return __float2half( a ) ;
}
My question may seem basic but I don't understand why I haven't found much documentation about it.
Thank you in advance
The very short answer is that what you are looking for doesn't exist.
The slightly longer answer is that thrust is intended to work on fundamental and POD types only, and the CUDA fp16 half is not a POD type. It might be possible to make two custom classes (one for the host and one for the device) which implements all the required object semantics and arithmetic operators to work correctly with thrust, but it would not be an insignificant effort to do it (and it would require writing or adapting an existing FP16 host library).
Note also that the current FP16 support is only in device code and only on compute 5.3 and newer devices. So unless you have a Tegra TX1, you can't use the FP16 library in device code anyway.

Element-wise operations in C++

Is there a preexisting library that will let me create array-like objects which have the following properties:
Run time size specification (chosen at instantition, not grown or shrunk afterwards)
Operators overloaded to perform element wise operations (i.e. c=a+b will result in a vector c with c[i]=a[i]+b[i] for all i, and similarly for *, -, /, etc)
A good set of functions which act elementwise, for example x=sqrt(vec) will have elements x[i]=sqrt(vec[i])
Provide "summarising" functions such as sum(vec), mean(vec) etc
(Optional) Operations can be sent to a GPU for processing.
Basically something like the way arrays work in Fortran, with all of the implementation hidden. Currently I am using vector from the STL and manually overloading the operators, but I feel like this is probably a solved problem.
In the dusty corners of standard library, long forgotten by everyone, sits a class called valarray. Look it up and see if it suits your needs.
From manual page at cppreference.com:
std::valarray is the class for representing and manipulating arrays of values. It supports element-wise mathematical operations and various forms of generalized subscript operators, slicing and indirect access.
A code snippet for illustration:
#include <valarray>
#include <algorithm>
#include <iterator>
#include <iostream>
int main()
{
std::valarray<int> a { 1, 2, 3, 4, 5};
std::valarray<int> b = a;
std::valarray<int> c = a + b;
std::copy(begin(c), end(c),
std::ostream_iterator<int>(std::cout, " "));
}
Output: 2 4 6 8 10
You can use Cilk Plus Extentions (https://www.cilkplus.org/) that provides array notation by applying element-wise operations to arrays of the same shape for C/C++. It explores the vector parallelism from your processor as well co-processor.
Example:
Standard C code:
for (i=0; i<MAX; i++)
c[i]=a[i]+b[i];
Cilk Plus - Array notation:
c[i:MAX]=a[i:MAX]+b[i:MAX];
Stride sections like:
float d[10] = {0,1,2,3,4,5,6,7,8,9};
float x[3];
x[:] = d[0:3:2]; //x contains 0,2,4 values
You can use reductions on arrays sections:
_sec_reduce_add(a[0:n]);
Interest reading:
http://software.intel.com/en-us/articles/getting-started-with-intel-cilk-plus-array-notations
The Thrust library, which is part of the CUDA toolkit, provides an STL-like interface for vector operations on GPUs. It also has an OpenMP back end, however the GPU support utilizes CUDA, so you are limited to NVIDIA GPUs. You will have to do your own wrapping (say with expression templates) if you want to have expressions like c=a+b work for vectors
https://code.google.com/p/thrust/
The VienaCL library takes a more high level approach, providing vector and matrix operations like you want. It has both CUDA and OpenCL back ends, so you can use GPUs (and multi-core CPUs) from different vendors.
http://viennacl.sourceforge.net/
The vexcl library also looks very promising (again with support for both OpenCL and CUDA)
https://github.com/ddemidov/vexcl

How to optimize matrix multiplication operation [duplicate]

This question already has answers here:
Optimized matrix multiplication in C
(14 answers)
Closed 4 years ago.
I need to perform a lot of matrix operations in my application. The most time consuming is matrix multiplication. I implemented it this way
template<typename T>
Matrix<T> Matrix<T>::operator * (Matrix& matrix)
{
Matrix<T> multipliedMatrix = Matrix<T>(this->rows,matrix.GetColumns(),0);
for (int i=0;i<this->rows;i++)
{
for (int j=0;j<matrix.GetColumns();j++)
{
multipliedMatrix.datavector.at(i).at(j) = 0;
for (int k=0;k<this->columns ;k++)
{
multipliedMatrix.datavector.at(i).at(j) += datavector.at(i).at(k) * matrix.datavector.at(k).at(j);
}
//cout<<(*multipliedMatrix)[i][j]<<endl;
}
}
return multipliedMatrix;
}
Is there any way to write it in a better way?? So far matrix multiplication operations take most of time in my application. Maybe is there good/fast library for doing this kind of stuff ??
However I rather can't use libraries which uses graphic card for mathematical operations, because of the fact that I work on laptop with integrated graphic card.
Eigen is by far one of the fastest, if not the fastest, linear algebra libraries out there. It is well written and it is of high quality. Also, it uses expression template which makes writing code that is more readable. Version 3 just released uses OpenMP for data parallelism.
#include <iostream>
#include <Eigen/Dense>
using Eigen::MatrixXd;
int main()
{
MatrixXd m(2,2);
m(0,0) = 3;
m(1,0) = 2.5;
m(0,1) = -1;
m(1,1) = m(1,0) + m(0,1);
std::cout << m << std::endl;
}
Boost uBLAS I think is definitely the way to go with this sort of thing. Boost is well designed, well tested and used in a lot of applications.
Consider GNU Scientific Library, or MV++
If you're okay with C, BLAS is a low-level library that incorporates both C and C-wrapped FORTRAN instructions and is used a huge number of higher-level math libraries.
I don't know anything about this, but another option might be Meschach which seems to have decent performance.
Edit: With respect to your comment about not wanting to use libraries that use your graphics card, I'll point out that in many cases, the libraries that use your graphics card are specialized implementations of standard (non-GPU) libraries. For example, various implementations of BLAS are listed on it's Wikipedia page, only some are designed to leverage your GPU.
There is a book called Introduction to Algorithms. You may like to check the chapter of Dynamic Programming. It has an excellent matrix multiplication algo using dynamic programming. Its worth a read. Well, this info was in case you want to write your own logic instead of using a library.
There are plenty of algorithms for efficient matrix multiplication.
Algorithms for efficient matrix multiplication
Look at the algorithms, find an implementations.
You can also make a multi-threaded implementation for it.

Really big number

First of all apologies if there is already a topic like this but I have not found... I need to know how to handle a really big number such as the result of 789^2346:
#include <iostream>
#include <cmath>
using namespace std;
int main () {
cout << pow(789,2346) << endl;
}
You could try the GNU MP Bignum Library or ttmath. This link point to some samples. It is very easy to use.
You need a "big number" library. A popular choice is GNU's Multiple Precision Arithmetic Library, which has a C interface. I's also been around for a while. Another one, for C++, is Big Integer Library.
I'm sure there is a list of bignum libraries on SO somewhere, but I cannot find it. There is a tag you could stroll through.
You can consider NTL (Number Theory Library) for C++ - http://www.shoup.net/ntl/ . It's very easy to use.
If you can relax C++ requirement, Perl and Python support big integers natively. PHP supports via bcmath or gmp extensions.