FMA performance compared to naive calculation

FMA performance compared to naive calculation - c++

I'm trying to compare FMA performance (fma() in math.h) versus naive multiplication and addition in floating point computing. Test is simple. I am going to iterate same calculation for large iteration number. There are two things I have to achieve for precise examination.
No other computing should be included in counting time.
Naive multiplication and addition should not be optimized to FMA
Iteration should not be optimized. i.e. iteration should be carried out exactly as much as I intended.
To achieve above things, I did following:
Function is inline and only required computation is included.
Used g++ -O0 option not to optimize the multiplication. (But when I look into dump file it seems to generate almost same code for both)
Used volatile.
But the results shows almost no difference, or even slower fma() compared to naive multiplication and addition. Is it the result as I intended (i.e. they are not really different in terms of speed) or am I doing something wrong?
Spec
Ubuntu 14.04.2
G++ 4.8.2
Intel(R) Core(TM) i7-4770 (3.4GHz, 8MB L3 cache)
My Code
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
using namespace chrono;
inline double rand_gen() {
return static_cast<double>(rand()) / RAND_MAX;
}
volatile double a, b, c;
inline void pure_fma_func() {
fma(a, b, c);
}
inline void non_fma_func() {
a * b + c;
}
int main() {
int n = 100000000;
a = rand_gen();
b = rand_gen();
c = rand_gen();
auto t1 = system_clock::now();
for (int i = 0; i < n; i++) {
non_fma_func();
}
auto t2 = system_clock::now();
for (int i = 0; i < n; i++) {
pure_fma_func();
}
auto t3 = system_clock::now();
cout << "non fma" << endl;
cout << duration_cast<microseconds>(t2 - t1).count() / 1000.0 << "ms" << endl;
cout << "fma" << endl;
cout << duration_cast<microseconds>(t3 - t2).count() / 1000.0 << "ms" << endl;
}

Yes, you are doing something completely wrong. At least two somethings. But let's keep it simple.
Used g++ -O0 option not to optimize the multiplication
This renders your whole results completely irrelevant. Fun fact: the cost of the function call is probably more than the cost of the the calculation in either case.
Fundamentally, the results of benchmarks without optimizations enabled are completely meaningless. You can't just turn them off and hope for the best. They absolutely must be enabled.
Secondly, FMA vs regular multiply-and-add is a complex situation- there are things like latency vs throughput and other matters where multiply-and-add can be a winner.
In short, your benchmark is not a benchmark at all, it's just a bunch of random instructions that produce meaningless junk.
If you want an accurate benchmark, you must accurately reproduce the actual using circumstances- entirely. Including surrounding code, compiler optimizations, the whole shebang.

Related

Why/how are division and multiplication equally fast here?

I'm trying to make a simple benchmarking algorithm, to compare different operations. Before I moved on to the actual functions i wanted to check a trivial case with a well-documented outcome : multiplication vs. division.
Division should lose by a fair margin from the literature i have read. When I compiled and ran the algorithm the times were just about 0. I added an accumulator that is printed to make sure the operations are actually carried out and tried again. Then i changed the loop, the numbers, shuffled and more. All in order to prevent any and all things that could cause "divide" to do anything but floating point division. To no avail. The times are still basically equal.
At this point I don't see where it could weasel its way out of the floating point divide and I give up. It wins. But I am really curious why the times are so close, what caveats/bugs i missed, and how to fix them.
(I know filling the vector with random data and then shuffling is redundant but I wanted to make sure the data was accessed and not just initialized before the loop.)
("String compares are evil", i am aware. If it is the cause of the equal times, i will gladly join the witch hunt. If not, please don't mention it.)
compile:
g++ -std=c++14 main.cc
tests:
./a.out multiply
2.42202e+09
1000000
t1 = 1.52422e+09 t2 = 1.52422e+09
difference = 0.218529
Average length of function : 2.18529e-07 seconds
./a.out divide
2.56147e+06
1000000
t1 = 1.52422e+09 t2 = 1.52422e+09
difference = 0.242061
Average length of function : 2.42061e-07 seconds
the code :
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <random>
#include <sys/time.h>
#include <sys/resource.h>
double get_time()
{
struct timeval t;
struct timezone tzp;
gettimeofday(&t, &tzp);
return t.tv_sec + t.tv_usec*1e-6;
}
double multiply(double lhs, double rhs){
return lhs * rhs;
}
double divide(double lhs, double rhs){
return lhs / rhs;
}
int main(int argc, char *argv[]){
if (argc == 1)
return 0;
double grounder = 0; //prevent optimizations
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(1.0, 100.0);
size_t loop1 = argc > 2 ? std::stoi (argv[2]) : 1000;
size_t loop2 = argc > 3 ? std::stoi (argv[3]) : 1000;
std::vector<size_t>vecL1(loop1);
std::generate(vecL1.begin(), vecL1.end(), [generator, distribution] () mutable { return distribution(generator); });
std::vector<size_t>vecL2(loop2);
std::generate(vecL2.begin(), vecL2.end(), [generator, distribution] () mutable { return distribution(generator); });
double (*fp)(double, double);
std::string function(argv[1]);
if (function == "multiply")
fp = (*multiply);
if (function == "divide")
fp = (*divide);
std::random_shuffle(vecL1.begin(), vecL1.end());
std::random_shuffle(vecL2.begin(), vecL2.end());
double t1 = get_time();
for (auto outer = vecL1.begin(); outer != vecL1.end(); outer++)
for (auto inner = vecL2.begin(); inner != vecL2.end(); inner++)
grounder += (*fp)(*inner, *outer);
double t2 = get_time();
std::cout << grounder << '\n';
std::cout << (loop1 * loop2) << '\n';
std::cout << "t1 = " << t1 << "\tt2 = " << t2
<< "\ndifference = " << (t2 - t1) << '\n';
std::cout << "Average length of function : " << (t2 - t1) * 1/(loop1 * loop2) << " seconds \n";
return 0;
}

You aren't just measuring the speed of multiplication/divide. If you put your code into https://godbolt.org/ you can see the assembly generated.
You are measuring the speed of calling a function and then doing multiply/divide inside the function. The time taken for the single multiply/divide instruction is tiny compared to the cost of the function calls so gets lost in the noise. If you move your loop to inside your function you'll probably see more of a difference. Note that with the loop inside your function your compiler may decide to vectorise your code which will still show whether there is a difference between multiply and divide but it wont be measuring the difference for the single mul/div instruction.

A better way to make matrix - log operations in Eigen?

I am playing around with Eigen doing some calculations with matrices and logs/exp, but I found the expressions I got a bit clumsy (and also possibly slower?). Is there a better way to write calculations like this ?
MatrixXd m = MatrixXd::Random(3,3);
m = m * (m.array().log()).matrix();
That is, not having to convert to arrays, then back to a matrix ?

If you are mixing array and matrix operations you can't really avoid them, except for some functions which have a cwise function which works directly on matrices (e.g., cwiseSqrt(), cwiseAbs()).
However, neither .array() nor .matrix() will have an impact on runtime when compiled with optimization (on any reasonable compiler).
If you consider that more readable, you can work with unaryExpr().

I agree fully with chtz's answer, and reiterate that there is no runtime cost to the "casts." You can confirm using the following toy program:
#include "Eigen/Core"
#include <iostream>
#include <chrono>
using namespace Eigen;
int main()
{
typedef MatrixXd matType;
//typedef MatrixXf matType;
volatile int vN = 1024 * 4;
int N = vN;
auto startAlloc = std::chrono::system_clock::now();
matType m = matType::Random(N, N).array().abs();
matType r1 = matType::Zero(N, N);
matType r2 = matType::Zero(N, N);
auto finishAlloc = std::chrono::system_clock::now();
r1 = m * (m.array().log()).matrix();
auto finishLog = std::chrono::system_clock::now();
r2 = m * m.unaryExpr<float(*)(float)>(&std::log);
auto finishUnary = std::chrono::system_clock::now();
std::cout << (r1 - r2).array().abs().maxCoeff() << '\n';
std::cout << "Allocation\t" << std::chrono::duration<double>(finishAlloc - startAlloc).count() << '\n';
std::cout << "Log\t\t" << std::chrono::duration<double>(finishLog - finishAlloc).count() << '\n';
std::cout << "unaryExpr\t" << std::chrono::duration<double>(finishUnary - finishLog).count() << '\n';
return 0;
}
On my computer, there is a slight advantage (~4%) to the first form which probably has to do with the way that the memory is loaded (unchecked). Beyond that, the reason for "casting" the type is to remove any ambiguities. For a clear example, consider operator *. In the matrix form, it should be considered matrix multiplication, whereas in the array form, it should be coefficient wise multiplication. The ambiguity in the case of exp and log are the matrix exponential and matrix logarithm respectively. Presumably, you want the element wise exp and log and therefore the cast is necessary.

Why tanh is faster than exp on my machine?

This question spawned from a separate question, which turned out to have some apparently machine specific quirks. When I run the C++ code listed below for recording the timing differences between tanh and exp, I see the following result:
tanh: 5.22203
exp: 14.9393
tanh runs ~3x as fast as exp. This is somewhat surprising considering the mathematical definition of tanh (and being ignorant of the algorithmic definition implemented).
What's more is that this happens on my laptop (Ubuntu 16.04, Intel Core i7-3517U CPU # 1.90GHz × 4), but does not occur on my desktop (same OS, not sure about CPU specs right now).
I compiled the code below with g++. The above times were with no compiler optimization, although the trend remains if I use -On for each n. I also fiddled with a and b values to see if the range of values being evaluated was having an effect. This doesn't seem to matter.
What would cause tanh to be faster than exp on different machines?
#include <iostream>
#include <cmath>
#include <ctime>
using namespace std;
int main() {
double a = -5;
double b = 5;
int N = 10001;
double x[10001];
double y[10001];
double h = (b-a) / (N-1);
clock_t begin, end;
for(int i=0; i < N; i++)
x[i] = a + i*h;
begin = clock();
for(int i=0; i < N; i++)
for(int j=0; j < N; j++)
y[i] = tanh(x[i]);
end = clock();
cout << "tanh: " << double(end - begin) / CLOCKS_PER_SEC << "\n";
begin = clock();
for(int i=0; i < N; i++)
for(int j=0; j < N; j++)
y[i] = exp(x[i]);
end = clock();
cout << "exp: " << double(end - begin) / CLOCKS_PER_SEC << "\n";
return 0;
}
edit: some assembly output
This is output when I compile the following simplified code below with g++ -g -O -Wa,-aslh nothing2.cpp > stuff.txt.
#include <cmath>
int main() {
double x = 0.0;
double y,z;
y = tanh(x);
z = exp(x);
return 0;
}
edit: another update
Assume nothing2.cpp contains the simplified code in the previous edit. I run:
g++ -o nothing2.so -shared -fPIC nothing2.cpp
objdump -d nothing2.so > stuff.txt
Here is the contents of stuff.txt

There is various possible explanation and the one applicable in your case depends on which platform you're using or exact which math library that is in use. But one possible explanation is:
First of all the calculation of tanh does not rely on the standard definition of tanh instead one expresses it in terms of exp(-2*x) or expm1(2*x) which means one only have to calculate one exponential which is probably the heavy operation (in addition there's a division and some additions).
Second which may be the trick is that for largish values of x this will reduce to (exp(2*x)-1)/(exp(2*x)+1) = 1 - 2/(expm1(2*x)+2). The advantage here is that since the second term is smallish it doesn't have to be calculated to the same relative accuracy to get the same final accuracy. This translate into that one wouldn't need the of expm1 here as in general.
Also for smalish values of x there's a similar trick in rewriting it as (1-exp(-2*x))/(1+exp(-2*x)) = - 1/ (1 + 2/(expm1(-2*x)+2) which again means that we can take advantage of the factor exp(-2*x) being large and not having to calculate it to the same accuracy. However you don't have to actually calculate it this way, you use the expression expm1(-2*x)/(2+expm1(-2*x)) instead with the same accuracy requirement on expm1.
In addition there are other optimizations available for larger values of x that isn't possible for exp of basically the same origin. With large x the factor expm1(2*x) will become so large that we can simply discard it entirely, while for exp we still have to calculate it (this is even the case for large negative x). For these values tanh would be immediately decided to be 1 while exp must be calculated.

Overflow while calculating combinations

I am trying to calculate the combination C(40, 20) in C++, however the data types in C++ seems unable to correctly handle this calculation even though I have used long long data type. The following is my code:
#include <iostream>
long long fac(int x) {
register long long i,f = 1; // Optimize with regFunction
for(i = 1;i <= x;i++)
f *= i;
std::cout << f << std::endl;
return f;
}
// C(n,r) = n!/r!(n-r)!
long long C(long long n, long long r) {
return fac(n) / (fac(r) * fac(n - r));
}
int main(int argc, char const *argv[]) {
std::cout << C(40, 20) << std::endl;
return 0;
}
Any idea to solve this problem?

Compute C at once by executing division immediately after multiplication:
long long C(long long n, long long r)
{
long long f = 1; // Optimize with regFunction
for(auto i = 0; i < r;i++)
f = (f * (n - i)) / (i + 1);
return f ;
}
Result should be exact (divisions without remainders, until overflows) since any integer factor present in (i+1) is already present in (n -i). (Should not be too difficult to prove)

Your numbers are growing too much and that is a common problem in this kind of calculations and I am afraid there is no straightforward solution. Even if you might reduce a bit the number of multiplications you will make probably still you will end up in an overflow with long long
You might want to check those out:
https://mattmccutchen.net/bigint/
https://gmplib.org/
I know there are different algorithmic approaches on this matter. I remember there were some solutions to use strings to store integer representations and stuff but as #Konrad mentioned this might be a poor approach to the matter.

The problem is that factorials get big very quickly. 40! is too large to be stored in a long long. Luckily you don’t actually need to compute this number here since you can reduce the fraction in the calculation of C(n, r) before computing it. This yields the equation (from Wikipedia):
This works much better since k! (r! in your code) is a much smaller number than n!. However, at some point it will also break down.
Alternatively, you can also use the recurrence definition by implementing a recursive algorithm. However, this will be very inefficient (exponential running time) unless you memoise intermediate results.

A lazy way out would be to use a library that supports multiple precision, for example GNU GMP.
Once you have installed it correctly (available from the repositories on most Linux distributions), it comes down to:
adding #include <gmpxx.h> to your source file
replacing long long with mpz_class
compiling with -lgmpxx -lgmp
The source:
#include <iostream>
#include <gmpxx.h>
mpz_class fac(mpz_class x) {
int i;
mpz_class f(1); // Optimize with regFunction
for(i = 1;i <= x;i++)
f *= i;
std::cout << f << std::endl;
return f;
}
// C(n,r) = n!/r!(n-r)!
mpz_class C(mpz_class n, mpz_class r) {
return fac(n) / (fac(r) * fac(n - r));
}
int main(int argc, char const *argv[]) {
std::cout << C(40, 20) << std::endl;
return 0;
}
Compiling and running:
$ g++ comb.cpp -lgmpxx -lgmp -o comb
$ ./comb
2432902008176640000
2432902008176640000
815915283247897734345611269596115894272000000000
137846528820
If you want to be thorough, you can do a lot more, but this will get you answers.

Even if you used uint64 aka ulonglong, the max value is 18446744073709551615 whereas 40! is 815915283247897734345611269596115894272000000000 which is a bit bigger.
I recommend you to use GMP for this kind of maths

if(a == b) doesn't work for doubles in a for loop

I am at the moment trying to code a titration curve simulator. But I am running into some trouble with comparing two values.
I have created a small working example that perfectly replicates the bug that I encounter:
#include <iostream>
#include <math.h>
using namespace std;
int main()
{
double a, b;
a = 5;
b = 0;
for(double i = 0; i<=(2*a); i+=0.1){
b = i;
cout << "a=" << a << "; b="<<b;
if(a==b)
cout << "Equal!" << endl;
else
cout << endl;
}
return 0;
}
The output at the relevant section is
a=5; b=5
However, if I change the iteration increment from i+=0.1 to i+=1 or i+=0.5 I get an output of
a=5; b=5Equal!
as you would expect.
I am compiling with g++ on linux using no further flags and I am frankly at a loss how to solve this problem. Any pointers (or even a full-blown solution to my problem) are very appreciated.

Unlike integers, multiplying floats/doubles and adding them up doesn't produce exactly the same results.
So the best practice is find if the abs of their difference is small enough.
If you have some idea on the size of the numbers, you can use a constant:
if (fabs(a - b) < EPS) // equal
If you don't (much slower!):
float a1 = fabs(a), b1 = fabs(b);
float mn = min(a1,b1), mx = max(a1,b1);
if (mn / mx > (1- EPS)) // equal
Note:
In your code, you can use std::abs instead. Same for std::min/max. The code is clearer/shorter when using the C functions.

I would recommend restructuring your loop to iterate using integers and then converting the integers into doubles, like this:
double step = 0.1;
for(int i = 0; i*step<=2*a; ++i){
b = i*step;
cout << "a=" << a << "; b="<<b;
if(a==b)
cout << "Equal!" << endl;
else
cout << endl;
}
This still isn't perfect. You possibly have some loss of precision in the multiplication; however, the floating point errors don't accumulate like they do when iterating using floating point values.

Floating point arithmetic is... interesting. Testing equality is annoying with floats/doubles in most languages because it is impossible to accurately represent many numbers in IEEE floating point math. Basically, where you might compute an expression to be 5.0, the compiler might compute it to be 4.9999999, because it's the closest representable number in the IEEE standard.
Because these numbers are slightly different, you end up with an inequality. Because it's unmaintainble to try and predict which number you will see at compile time, you can't/shouldn't attempt to hard code either one of them into your source to test equality with. As a hard rule, avoid directly checking equality of floating point numbers.
Instead, test that they are extremely close to being equal with something like the following:
template<typename T>
bool floatEqual(const T& a, const T& b) {
auto delta = a * 0.03;
auto minAccepted = a - delta;
auto maxAccepted = a + delta;
return b > minAccepted && b < maxAccepted;
}
This checks whether b is within a range of + or - 3% of the value of a.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

FMA performance compared to naive calculation - c++

Related

Why/how are division and multiplication equally fast here?

A better way to make matrix - log operations in Eigen?

Why tanh is faster than exp on my machine?

Overflow while calculating combinations

if(a == b) doesn't work for doubles in a for loop

Categories

Resources