fp16 support in cuda thrust - c++

I am not able to found anything about the fp16 support in thrust cuda template library.
Even the roadmap page has nothing about it:
https://github.com/thrust/thrust/wiki/Roadmap
But I assume somebody has probably figured out how to overcome this problem, since the fp16 support in cuda is around for more than 6 month.
As of today, I heavily rely on thrust in my code, and templated nearly every class I use in order to ease fp16 integration, unfortunately, absolutely nothing works out of the box for half type even this simple sample code:
//STL
#include <iostream>
#include <cstdlib>
//Cuda
#include <cuda_runtime_api.h>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <cuda_fp16.h>
#define T half //work when float is used
int main(int argc, char* argv[])
{
thrust::device_vector<T> a(10,1.0f);
float t = thrust::reduce( a.cbegin(),a.cend(),(float)0);
std::cout<<"test = "<<t<<std::endl;
return EXIT_SUCCESS;
}
This code cannot compile because it seems that there is no implicit conversion from float to half or half to float. However, it seems that there are intrinsics in cuda that allow for an explicit conversion.
Why can't I simply overload the half and float constructor in some header file in cuda, to add the previous intrinsic like that :
float::float( half a )
{
return __half2float( a ) ;
}
half::half( float a )
{
return __float2half( a ) ;
}
My question may seem basic but I don't understand why I haven't found much documentation about it.
Thank you in advance

The very short answer is that what you are looking for doesn't exist.
The slightly longer answer is that thrust is intended to work on fundamental and POD types only, and the CUDA fp16 half is not a POD type. It might be possible to make two custom classes (one for the host and one for the device) which implements all the required object semantics and arithmetic operators to work correctly with thrust, but it would not be an insignificant effort to do it (and it would require writing or adapting an existing FP16 host library).
Note also that the current FP16 support is only in device code and only on compute 5.3 and newer devices. So unless you have a Tegra TX1, you can't use the FP16 library in device code anyway.

Related

Comparing floating point numbers in VS C++ vs C++ Builder

I am working with some astronomy code originally compiled in Visual C++. I am compiling it in C++Builder XE4 on the 32bit VCL platform.
In this code, there are a lot of comparisons for very small numbers, all defined as double. The code snip below shows the headers and some sample comparisons from the VC++ code. I need the results to be the same in VC++ and C++ Builder, so I have some questions about comparing floating point numbers:
Does C++Builder compare floating point numbers the same as VC++?
In C++Builder, do I need to rewrite the code using the CompareValue(double, double) function?
Will I get the same result if I switch from #include <cmath> to using #include <math.h> and #include <math.hpp>?
Any suggestions for getting the same results in both compilers would be helpful.
#include "stdafx.h"
#include <cmath>
#include <cassert>
using namespace std;
...
else if ((fgamma > 0.9972) && (fgamma < (1.5433 + details.u)))
{
if ((fgamma > 0.9972) && (fgamma < (0.9972 + fabs(details.u))))
{
if (details.u < 0)
...
Short answer
No
Depends on the compiler settings in both + thread environment.
Yes, but see #2
Long answer
Compiler settings
The most important compiler setting is target instruction set. Depending on the setting, double-precision floating-point code can be compiled into legacy x87 instructions, to SSE2, or higher (SSE 4, AVX, etc.)
The funny thing is, some compilers with some settings compile into both. Within the same program, they may use x87 for one things, SSE for other things.
There’re are other relevant compiler switches, e.g. /fp in Visual C++
Thread environment
For x87 code, the interesting part of the thread state is x87 FPU control register. For Visual C++, see _controlfp_s API.
SSE components of the CPU use similar thing, MxCsr register.

Vector class library for processing speed

I am looking at parallel processing algorithm for processing speed improvement.
I want to test Agner Fog's vector class library, VCL.
I am wondering how to select different vector classes for example Vec16c (SSE2 instruction set) and Vec32c (AVX instruction set).
I am using Intel® Atom™ x5-Z8350 Processor and according to the specs, it supports SSE4.2 instruction sets.
How can I effectively choose vector class with regards to the hardware support?
Say for my processor, can I use Vec32c recommended for AVX instruction set?
You can use compiler defined macros to detect what instruction-sets are enabled for the target you're compiling for, such as:
// Assume SSE2 as a baseline
#include <vectori128.h>
#if defined(__AVX2__)
#include <vectori256.h>
using vector_type = Vec32c;
#else
// Vec16c uses whatever is enabled, so you don't have to check for SSE4 yourself
using vector_type = Vec16c;
#endif
This doesn't do run-time detection, so only enable AVX2 if you want to make a binary that only runs on CPUs with AVX2.
If you want your code to work on non-x86 platforms, or x86 without SSE2 where VCL isn't supported at all, you need to protect the #include <vectori128.h> with #if as well.
AVX is required for 32-byte vectors. (And AVX2 for 32B integer vectors like Vec32c). Since your Atom doesn't have AVX, don't include Agner's vectorclassi256.h or vectorclassf256.h, just the 128 headers.
Compile with -march=native to get the compiler to enable all the instruction-sets your host-CPU supports.
The implementations of the Vec16c functions will automatically use SSE4.2 intrinsics when they're enabled, because Vectorclass checks macros to see what's enabled. So just use Vec16c and you will automatically get the best implementations of every function that your target supports.
(This is true since you're doing compile-time CPU / target options. If you wanted to do run-time dispatching yourself, it would be harder.)
The vector class library has been updated and improved. It is moved to Github:
https://github.com/vectorclass

Using Boost.Units and Boost.Multiprecision

I am attempting to write a molecular dynamics program, and I thought that Boost.Units was a logical choice for the variables, and I also decided that Boost.Multiprecision offered a better option than double or long double with respect to round off errors. A combination of the two seems fairly straight forward until I attempt to use a constant, then it breaks down.
#include <boost/multiprecision/gmp.hpp>
#include <boost/units/io.hpp>
#include <boost/units/pow.hpp>
#include <boost/units/quantity.hpp>
#include <boost/units/systems/si.hpp>
#include <boost/units/systems/si/codata/physico-chemical_constants.hpp>
namespace units = boost::units;
namespace si = boost::si;
namespace mp = boost::multiprecision;
units::quantity<si::mass, mp::mpf_float_50> mass = 1.0 * si::kilogram;
units::quantity<si::temperature, mp::mpf_float_50> temperature = 300. * si::kelvin;
auto k_B = si::constants::codata::k_B; // Boltzmann constant
units::quantity<si::velocity, mp::mpf_float_50> velocity = units::root<2>(temperature * k_B / mass);
std::cout << velocity << std::endl;
The output will be 1 M S^-1. If I use long double in lieu of mp::mpf_float_50, then the result is 2.87818e-11 m s^-1. I know that the problem likes within the conversion between the constant and the other data because the constant defaults to a double. I have thought about creating my own Boltzmann constant, but I prefer to use the predefined value if possible.
My question, therefore, is how do I go about using Boost.Multiprecision when I have predefined constants from Boost.Units? If I must concede to using double or long double, then I will, but I suspect that a way exists to convert or utilize the other on the constants.
I am working with Mac OS X 10.7, Xcode 4.6.2, Clang 3.2, Boost 1.53.0 and the C++11 extensions.
I appreciate any help that can be offered.
I'd advise you not to use multiple precision arithmetic for molecular dynamics simulations because the time-step integration will be painfully slow. If the goal is to preserve total energy as much as possible, then just use Verlet or any other symplectic integrator. Multiple precision arithmetic (or long double, or compensated summation with plain double) may be useful for aggregating ensemble averages, though.
Besides, if you write your simulation code using dimensionless (reduced) units you will also get rid of the dependency on Boost.Units.

Fortran-style multidimensional arrays in C++

Is there a C++ library which provides Fortran-style multidimensional arrays with support for slicing, passing as procedural parameter and decent documentation? I've looked into blitz++ but its dead!
I highly suggest Armadillo:
Armadillo is a C++ linear algebra library (matrix maths) aiming towards a good balance between speed and ease of use
It is a C++ template library:
A delayed evaluation approach is employed (at compile-time) to combine several operations into one and reduce (or eliminate) the need for temporaries; this is automatically accomplished through template meta-programming
A simple example from the web page:
#include <iostream>
#include <armadillo>
int main(int argc, char** argv)
{
arma::mat A = arma::randu<arma::mat>(4,5);
arma::mat B = arma::randu<arma::mat>(4,5);
std::cout << A*B.t() << std::endl;
return 0;
}
If you are running OSX the you can use the vDSP libs for free.
If you want to deploy on windows targets then either license the intel equivalents (MKL) or I think that the AMD vector math libs (ACML) are free.

How to optimize matrix multiplication operation [duplicate]

This question already has answers here:
Optimized matrix multiplication in C
(14 answers)
Closed 4 years ago.
I need to perform a lot of matrix operations in my application. The most time consuming is matrix multiplication. I implemented it this way
template<typename T>
Matrix<T> Matrix<T>::operator * (Matrix& matrix)
{
Matrix<T> multipliedMatrix = Matrix<T>(this->rows,matrix.GetColumns(),0);
for (int i=0;i<this->rows;i++)
{
for (int j=0;j<matrix.GetColumns();j++)
{
multipliedMatrix.datavector.at(i).at(j) = 0;
for (int k=0;k<this->columns ;k++)
{
multipliedMatrix.datavector.at(i).at(j) += datavector.at(i).at(k) * matrix.datavector.at(k).at(j);
}
//cout<<(*multipliedMatrix)[i][j]<<endl;
}
}
return multipliedMatrix;
}
Is there any way to write it in a better way?? So far matrix multiplication operations take most of time in my application. Maybe is there good/fast library for doing this kind of stuff ??
However I rather can't use libraries which uses graphic card for mathematical operations, because of the fact that I work on laptop with integrated graphic card.
Eigen is by far one of the fastest, if not the fastest, linear algebra libraries out there. It is well written and it is of high quality. Also, it uses expression template which makes writing code that is more readable. Version 3 just released uses OpenMP for data parallelism.
#include <iostream>
#include <Eigen/Dense>
using Eigen::MatrixXd;
int main()
{
MatrixXd m(2,2);
m(0,0) = 3;
m(1,0) = 2.5;
m(0,1) = -1;
m(1,1) = m(1,0) + m(0,1);
std::cout << m << std::endl;
}
Boost uBLAS I think is definitely the way to go with this sort of thing. Boost is well designed, well tested and used in a lot of applications.
Consider GNU Scientific Library, or MV++
If you're okay with C, BLAS is a low-level library that incorporates both C and C-wrapped FORTRAN instructions and is used a huge number of higher-level math libraries.
I don't know anything about this, but another option might be Meschach which seems to have decent performance.
Edit: With respect to your comment about not wanting to use libraries that use your graphics card, I'll point out that in many cases, the libraries that use your graphics card are specialized implementations of standard (non-GPU) libraries. For example, various implementations of BLAS are listed on it's Wikipedia page, only some are designed to leverage your GPU.
There is a book called Introduction to Algorithms. You may like to check the chapter of Dynamic Programming. It has an excellent matrix multiplication algo using dynamic programming. Its worth a read. Well, this info was in case you want to write your own logic instead of using a library.
There are plenty of algorithms for efficient matrix multiplication.
Algorithms for efficient matrix multiplication
Look at the algorithms, find an implementations.
You can also make a multi-threaded implementation for it.