Why tanh is faster than exp on my machine?

Why tanh is faster than exp on my machine? - c++

This question spawned from a separate question, which turned out to have some apparently machine specific quirks. When I run the C++ code listed below for recording the timing differences between tanh and exp, I see the following result:
tanh: 5.22203
exp: 14.9393
tanh runs ~3x as fast as exp. This is somewhat surprising considering the mathematical definition of tanh (and being ignorant of the algorithmic definition implemented).
What's more is that this happens on my laptop (Ubuntu 16.04, Intel Core i7-3517U CPU # 1.90GHz × 4), but does not occur on my desktop (same OS, not sure about CPU specs right now).
I compiled the code below with g++. The above times were with no compiler optimization, although the trend remains if I use -On for each n. I also fiddled with a and b values to see if the range of values being evaluated was having an effect. This doesn't seem to matter.
What would cause tanh to be faster than exp on different machines?
#include <iostream>
#include <cmath>
#include <ctime>
using namespace std;
int main() {
double a = -5;
double b = 5;
int N = 10001;
double x[10001];
double y[10001];
double h = (b-a) / (N-1);
clock_t begin, end;
for(int i=0; i < N; i++)
x[i] = a + i*h;
begin = clock();
for(int i=0; i < N; i++)
for(int j=0; j < N; j++)
y[i] = tanh(x[i]);
end = clock();
cout << "tanh: " << double(end - begin) / CLOCKS_PER_SEC << "\n";
begin = clock();
for(int i=0; i < N; i++)
for(int j=0; j < N; j++)
y[i] = exp(x[i]);
end = clock();
cout << "exp: " << double(end - begin) / CLOCKS_PER_SEC << "\n";
return 0;
}
edit: some assembly output
This is output when I compile the following simplified code below with g++ -g -O -Wa,-aslh nothing2.cpp > stuff.txt.
#include <cmath>
int main() {
double x = 0.0;
double y,z;
y = tanh(x);
z = exp(x);
return 0;
}
edit: another update
Assume nothing2.cpp contains the simplified code in the previous edit. I run:
g++ -o nothing2.so -shared -fPIC nothing2.cpp
objdump -d nothing2.so > stuff.txt
Here is the contents of stuff.txt

There is various possible explanation and the one applicable in your case depends on which platform you're using or exact which math library that is in use. But one possible explanation is:
First of all the calculation of tanh does not rely on the standard definition of tanh instead one expresses it in terms of exp(-2*x) or expm1(2*x) which means one only have to calculate one exponential which is probably the heavy operation (in addition there's a division and some additions).
Second which may be the trick is that for largish values of x this will reduce to (exp(2*x)-1)/(exp(2*x)+1) = 1 - 2/(expm1(2*x)+2). The advantage here is that since the second term is smallish it doesn't have to be calculated to the same relative accuracy to get the same final accuracy. This translate into that one wouldn't need the of expm1 here as in general.
Also for smalish values of x there's a similar trick in rewriting it as (1-exp(-2*x))/(1+exp(-2*x)) = - 1/ (1 + 2/(expm1(-2*x)+2) which again means that we can take advantage of the factor exp(-2*x) being large and not having to calculate it to the same accuracy. However you don't have to actually calculate it this way, you use the expression expm1(-2*x)/(2+expm1(-2*x)) instead with the same accuracy requirement on expm1.
In addition there are other optimizations available for larger values of x that isn't possible for exp of basically the same origin. With large x the factor expm1(2*x) will become so large that we can simply discard it entirely, while for exp we still have to calculate it (this is even the case for large negative x). For these values tanh would be immediately decided to be 1 while exp must be calculated.

Related

Wrong answer due to precision issues?

I am implementing Greedy Approach to TSP:
Start from first node.
Go to nearest node not visited yet. (If multiple, go to the one with the lowest index.)
Don't forget to include distance from node 1 to last node visited.
However, my code gives the wrong answer. I implemented the same code in Python and the python code gives right answer.
In my problem, the nodes are coordinates on 2-D plane and the distance is the Euclidean Distance.
I even changed everything to long double because it's more precise.
In fact, if I reverse the order of the for loop to reverse the direction and add an additional if statement to handle ties (we want minimum index nearest node), it gives a very different answer.
Is this because of precision issues?
(Note: I have to print floor(ans))
INPUT: Link
EXPECTED OUTPUT: 1203406
ACTUAL OUTPUT: 1200403
#include <iostream>
#include <cmath>
#include <vector>
#include <cassert>
#include <functional>
using namespace std;
int main() {
freopen("input.txt", "r", stdin);
int n;
cin >> n;
vector<pair<long double, long double>> points(n);
for (int i = 0; i < n; ++i) {
int x;
cin >> x;
assert(x == i + 1);
cin >> points[i].first >> points[i].second;
}
// Returns the squared Euclidean Distance
function<long double(int, int)> dis = [&](int x, int y) {
long double ans = (points[x].first - points[y].first) * (points[x].first - points[y].first);
ans += (points[x].second - points[y].second) * (points[x].second - points[y].second);
return ans;
};
long double ans = 0;
int last = 0;
int cnt = n - 1;
vector<int> taken(n, 0);
taken[0] = 1;
while (cnt > 0) {
pair<long double, int> mn = {1e18, 1e9};
for (int i = 0; i < n; ++i) {
if (!taken[i]) {
mn = min(mn, {dis(i, last), i});
}
}
int nex = mn.second;
taken[nex] = 1;
cnt--;
ans += sqrt(mn.first);
last = nex;
}
ans += sqrt(dis(0, last));
cout << ans << '\n';
return 0;
}
UPD: Python Code:
import math
file = open("input.txt", "r")
n = int(file.readline())
a = []
for i in range(n):
data = file.readline().split(" ")
a.append([float(data[1]), float(data[2])])
for c in a:
print(c)
def dis(x, y):
cur_ans = (a[x][0] - a[y][0]) * (a[x][0] - a[y][0])
cur_ans += (a[x][1] - a[y][1]) * (a[x][1] - a[y][1])
cur_ans = math.sqrt(cur_ans)
return cur_ans
ans = 0.0
last = 0
cnt = n - 1
take = []
for i in range(n):
take.append(0)
take[0] = 1
while cnt > 0:
idx = -1
cur_dis = 1e18
for i in range(n):
if take[i] == 0:
if dis(i, last) < cur_dis:
cur_dis = dis(i, last)
idx = i
assert(idx != -1)
take[idx] = 1
cnt -= 1
ans += cur_dis
last = idx
ans += dis(0, last)
print(ans)
file.close()
# 1203406

Yes, the difference is due to round-off error, with the C++ code producing the more accurate result because of your use of long double. If you change your C++ code, such that it uses the same precision as Python (IEEE-754, meaning double precision) you get the exact same round-off errors in both codes. Here is a demonstrator in Godbolt Compiler explorer, with your example boiled down to 4000 points: https://godbolt.org/z/rddrdT54n
If I run the same code on the whole input file I get 1203406.5012708856 in C++ and in Python (Had to try this offline, because Godbolt understandibly killed the process).
Note, that in theory your Python-Code and C++ code are not completely analogous, because std::min will compare tuples and pairs lexicographically. So if you ever have two distances exactly equal, the std::min call will choose the smaller of the two indices. Practically, this does not make a difference, though.
Now I don't think you really can get rid off the rounding errors. There are a few tricks to minimize them.
using higher precision (long double) is one option. But this also makes your code slower, it's a tradeoff
Rescale your points, so that they are relative to the centroid of all points, and the unit reflects your problem (e.g. don't think in mm, miles, km or whatever, but rather in "variance of your data set"). You can't get rid of numerical cancellation in your calculation of the Euclidean distance, but if the relative distances are small compared to the absolute values of the coordinates, the cancellation is typically more severe. Here is a small demonstration:
#include <iostream>
#include <iomanip>
int main() {
std::cout
<< std::setprecision(17)
<< (1000.0001 - 1000)/0.0001
<< std::endl
<< (1.0001 - 1)/0.0001
<< std::endl;
return 0;
}
0.99999999974897946
0.99999999999988987
Finally, there are some tricks and algorithms to better control the error accumulation in large sums (https://en.wikipedia.org/wiki/Pairwise_summation, https://en.wikipedia.org/wiki/Kahan_summation_algorithm)
One final comment, a bit unrelated to your question: Use auto with lambdas, i.e.
auto dis = [&](int x, int y) {
// ...
};
C++ has many different kinds of callable objects (functions, function pointers, functors, lambdas, ...) and std::function is a useful wrapper to have one type representing all kinds of callables with the same signature. This comes at some computational overhead (runtime polymorphism, type erasure) and the compiler will have a hard time optimizing your code. So if you don't need the type erasing functionality of std::function, just store your lambda in a variable declared with auto.

C++ - dealing with infinitesimal numbers

I need to find some way to deal with infinitesimial double values.
For example:
exp(-0.00000000000000000000000000000100000000000000000003)= 0.99999999999999999999999999999899999999999999999997
But exp function produce result = 1.000000000000000000000000000000
So my first thought was to make my own exp function. Unfortunately I am getting same output.
double my_exp(double x)
{
bool minus = x < 0;
x = abs(x);
double exp = (double)1 + x;
double temp = x;
for (int i = 2; i < 100000; i++)
{
temp *= x / (double)i;
exp = exp + temp;
}
return minus ? exp : (double)1 / exp;
}
I found that issue is when such small numbers like 1.00000000000000000003e-030 doesn't work well when we try to subtract it, neither both if we subtracting or adding such a small number the result always is equal to 1.
Have U any idea how to manage with this?

Try using std::expm1
Computes the e (Euler's number, 2.7182818) raised to the given power
arg, minus 1.0. This function is more accurate than the expression
std::exp(arg)-1.0 if arg is close to zero.
#include <iostream>
#include <cmath>
int main()
{
std::cout << "expm1(-0.00000000000000000000000000000100000000000000000003) = " << std::expm1(-0.00000000000000000000000000000100000000000000000003) << '\n';
}
Run the example in the below source by changing the arguments to your very small numbers.
Source: https://en.cppreference.com/w/cpp/numeric/math/expm1

I think the best way of dealing with such small numbers is to use existing libraries. You could try GMP starting with their example to calculate billions of digits of pi. Another library, MPFR which is based on GMP, seems to be a good choice. I don't know when to choose one over the other.

Why is this benchmark code for linear and binary search not working?

I am trying to benchmark linear and binary search as a part of an assignment. I have written the necessary search and randomizer functions. But when I try to benchmark them I get 0 delay even for higher array sizes.
The code:
#include<iostream>
#include <time.h>
#include <windows.h>
using namespace std;
double getTime()
{
LARGE_INTEGER t, f;
QueryPerformanceCounter(&t);
QueryPerformanceFrequency(&f);
return (double)t.QuadPart/(double)f.QuadPart;
}
int linearSearch(int arr[], int len,int target){
int resultIndex = -1;
for(int i = 0;i<len;i++){
if(arr[i] == target){
resultIndex = i;
break;
}
}
return resultIndex;
}
void badSort(int arr[],int len){
for(int i = 0 ; i< len;i++){
int indexToSwapWith = i;
for(int j = i+1;j < len;j++){
if(arr[j] < arr[indexToSwapWith] )
indexToSwapWith = j;
}
if(indexToSwapWith != i){
int t = arr[i];
arr[i] = arr[indexToSwapWith];
arr[indexToSwapWith] = t;
}
}
}
int binSearch(int arr[], int len,int target){
int resultIndex = -1;
int first = 0;
int last = len;
int mid = first;
while(first <= last){
mid = (first + last)/2;
if(target < arr[mid])
last = mid-1;
else if(target > arr[mid])
first = mid+1;
else
break;
}
if(arr[mid] == target)
resultIndex = mid;
return resultIndex;
}
void fillArrRandomly(int arr[],int len){
srand(time(NULL));
for(int i = 0 ; i < len ;i++){
arr[i] = rand();
}
}
void benchmarkRandomly(int len){
float startTime = getTime();
int arr[len];
fillArrRandomly(arr,len);
badSort(arr,len);
/*
for(auto i : arr)
cout<<i<<"\n";
*/
float endTime = getTime();
float timeElapsed = endTime - startTime;
cout<< "prep took " << timeElapsed<<endl;
int target = rand();
startTime = getTime();
int result = linearSearch(arr,len,target);
endTime = getTime();
timeElapsed = endTime - startTime;
cout<<"linear search result for "<<target<<":"<<result<<" after "<<startTime<<" to "<<endTime <<":"<<timeElapsed<<"\n";
startTime = getTime();
result = binSearch(arr,len,target);
endTime = getTime();
timeElapsed = endTime - startTime;
cout<<"binary search result for "<<target<<":"<<result<<" after "<<startTime<<" to "<<endTime <<":"<<timeElapsed<<"\n";
}
int main(){
benchmarkRandomly(30000);
}
Sample output:
prep took 0.9375
linear search result for 29445:26987 after 701950 to 701950:0
binary search result for 29445:26987 after 701950 to 701950:0
I have tried using clock_t as well but it was the result was the same. Do I need even higher array size or am I benchmarking the wrong way?
In the course I have to implement most of the stuff myself. That's why I'm not using stl. I'm not sure if using stl::chrono is allowed but I'd like to ensure that the problem does not lie elsewhere first.
Edit: In case it isn't clear, I can't include the time for sorting and random generation in the benchmark.

One problem is that you set startTime = getTime() before you pack your test arrays with random values. If the random number generation is slow this might dominate that returned results. The main effort is sorting your array, the search time will be extremely low compared to this.
It is probably too course an interval as you suggest. For a binary search on 30k objects we are talking about just 12 or 13 iterations so on a modern machine 20 / 1000000000 seconds at most. This is approximately zero ms.
Increasing the number of array entries won't help much, but you could try increasing the array size until you get near the memory limit. But now your problem will be that the preparatory random number generation and sorting will take forever.
I would suggest either :-
A. Checking for a very large number of items :-
unsigned int total;
startTime = getTime();
for (i=0; i<10000000; i++)
total += binSearch(arr, len, rand());
endTime = getTime();
B. Modify your code to count the number of times you compare elements and use that information instead of timing.

It looks like you're using the search result (by printing it with cout *outside the timed region, that's good). And the data + key are randomized, so the search shouldn't be getting optimized away at compile time. (Benchmarking with optimization disabled is pointless, so you need tricks like this.)
Have you looked at timeElapsed with a debugger? Maybe it's a very small float that prints as 0 with default cout settings?
Or maybe float endTime - float startTime actually is equal to 0.0f because rounding to the nearest float made them equal. Subtracting two large nearby floating-point numbers produces "catastrophic cancellation".
Remember that float only has 24 bits of significand, so regardless of the frequency you divide by, if the PerformanceCounter values differ in less than 1 part in 2^24, you'll get zero. (If that function returns raw counts from x86 rdtsc, then that will happen if your system's last reboot was more than 2^24 times longer ago than the time interval. x86 TSC starts at zero when the system boots, and (on CPUs in the last ~10 years) counts at a "reference frequency" that's (approximately) equal to your CPU's rated / "sticker" frequency, regardless of turbo or idle clock speeds. See Get CPU cycle count?)
double might help, but much better to subtract in the integer domain before dividing. Also, rewriting that part will take QueryPerformanceFrequency out of the timed interval!
As #Jon suggests, it's often better to put the code under test into a repeat loop inside one longer timed interval, so (code) caches and branch prediction can warm up.
But then you have the problem of making sure repeated calls aren't optimized away, and of randomizing the search key inside the loop. (Otherwise a smart compiler might hoist the search out of the loop).
Something like volatile int result = binSearch(...); can help, because assigning to (or initializing) a volatile is a visible side-effect that can't be optimized away. So the compiler needs to actually materialize each search result in a register.
For some compilers, e.g. ones that support GNU C inline asm, you can use inline asm to require the compiler to produce a value in a register without adding any overhead of storing it anywhere. AFAIK this isn't possible with MSVC inline asm.

FMA performance compared to naive calculation

I'm trying to compare FMA performance (fma() in math.h) versus naive multiplication and addition in floating point computing. Test is simple. I am going to iterate same calculation for large iteration number. There are two things I have to achieve for precise examination.
No other computing should be included in counting time.
Naive multiplication and addition should not be optimized to FMA
Iteration should not be optimized. i.e. iteration should be carried out exactly as much as I intended.
To achieve above things, I did following:
Function is inline and only required computation is included.
Used g++ -O0 option not to optimize the multiplication. (But when I look into dump file it seems to generate almost same code for both)
Used volatile.
But the results shows almost no difference, or even slower fma() compared to naive multiplication and addition. Is it the result as I intended (i.e. they are not really different in terms of speed) or am I doing something wrong?
Spec
Ubuntu 14.04.2
G++ 4.8.2
Intel(R) Core(TM) i7-4770 (3.4GHz, 8MB L3 cache)
My Code
#include <iostream>
#include <cmath>
#include <cstdlib>
#include <chrono>
using namespace std;
using namespace chrono;
inline double rand_gen() {
return static_cast<double>(rand()) / RAND_MAX;
}
volatile double a, b, c;
inline void pure_fma_func() {
fma(a, b, c);
}
inline void non_fma_func() {
a * b + c;
}
int main() {
int n = 100000000;
a = rand_gen();
b = rand_gen();
c = rand_gen();
auto t1 = system_clock::now();
for (int i = 0; i < n; i++) {
non_fma_func();
}
auto t2 = system_clock::now();
for (int i = 0; i < n; i++) {
pure_fma_func();
}
auto t3 = system_clock::now();
cout << "non fma" << endl;
cout << duration_cast<microseconds>(t2 - t1).count() / 1000.0 << "ms" << endl;
cout << "fma" << endl;
cout << duration_cast<microseconds>(t3 - t2).count() / 1000.0 << "ms" << endl;
}

Yes, you are doing something completely wrong. At least two somethings. But let's keep it simple.
Used g++ -O0 option not to optimize the multiplication
This renders your whole results completely irrelevant. Fun fact: the cost of the function call is probably more than the cost of the the calculation in either case.
Fundamentally, the results of benchmarks without optimizations enabled are completely meaningless. You can't just turn them off and hope for the best. They absolutely must be enabled.
Secondly, FMA vs regular multiply-and-add is a complex situation- there are things like latency vs throughput and other matters where multiply-and-add can be a winner.
In short, your benchmark is not a benchmark at all, it's just a bunch of random instructions that produce meaningless junk.
If you want an accurate benchmark, you must accurately reproduce the actual using circumstances- entirely. Including surrounding code, compiler optimizations, the whole shebang.

How to generate random double numbers with high precision in C++?

I am trying to generate a number of series of double random numbers with high precision. For example, 0.856365621 (has 9 digits after decimal).
I've found some methods from internet, however, they do generate double random number, but the precision is not as good as I request (only 6 digits after the decimal).
Thus, may I know how to achieve my goal?

In C++11 you can using the <random> header and in this specific example using std::uniform_real_distribution I am able to generate random numbers with more than 6 digits. In order to see set the number of digits that will be printed via std::cout we need to use std::setprecision:
#include <iostream>
#include <random>
#include <iomanip>
int main()
{
std::random_device rd;
std::mt19937 e2(rd());
std::uniform_real_distribution<> dist(1, 10);
for( int i = 0 ; i < 10; ++i )
{
std::cout << std::fixed << std::setprecision(10) << dist(e2) << std::endl ;
}
return 0 ;
}
you can use std::numeric_limits::digits10 to determine the precision available.
std::cout << std::numeric_limits<double>::digits10 << std::endl;

In a typical system, RAND_MAX is 231-1 or something similar to that. So your "precision" from using a method like:L
double r = rand()/RAND_MAX;
would be 1/(2<sup>31</sup)-1 - this should give you 8-9 digits "precision" in the random number. Make sure you print with high enough precision:
cout << r << endl;
will not do. This will work better:
cout << fixed << sprecision(15) << r << endl;
Of course, there are some systems out there with much smaller RAND_MAX, in which case the results may be less "precise" - however, you should still get digits down in the 9-12 range, just that they are more likely to be "samey".

Why not create your value out of multiple calls of the random function instead?
For instance:
const int numDecimals = 9;
double result = 0.0;
double div = 1.0;
double mul = 1.0;
for (int n = 0; n < numDecimals; ++n)
{
int t = rand() % 10;
result += t * mul;
mul *= 10.0;
div /= 10.0;
}
result = result * div;
I would personally try a new implementation of the rand function though or at least multiply with the current time or something..

In my case, I'm using MQL5, a very close derivative of C++ for a specific market, whose only random generator produces a random integer from 0 to 32767 (= (2^15)-1). Far too low precision.
So I've adapted his idea -- randomly generate a string of digits any length I want -- to solve my problem, more reliably (and arguably more randomly also), than anything else I can find or think of. My version builds a string and converts it to a double at the end -- avoids any potential math/rounding errors along the way (because we all know 0.1 + 0.2 != 0.3 😉 )
Posting it here in case it helps anyone.
(Disclaimer: The following is valid MQL5. MQL5 and C++ are very close, but some differences. eg. No RAND_MAX constant (so I've hard-coded the 32767). I'm not entirely sure of all the differences, so there may be C++ syntax errors here. Please adapt accordingly).
const int RAND_MAX_INCL = 32767;
const int RAND_MAX_EXCL = RAND_MAX_INCL + 1;
int iRandomDigit() {
const double dRand = rand()/RAND_MAX_EXCL; // double 0.0 <= dRand < 1.0
return (int)(dRand * 10); // int 0 <= result < 10
};
double dRandom0IncTo1Exc(const int iPrecisionDigits) {
int iPrecisionDigits2 = iPrecisionDigits;
if ( iPrecisionDigits > DBL_DIG ) { // DBL_DIG == "Number of significant decimal digits for double type"
Print("WARNING: Can't generate random number with precision > ", DBL_DIG, ". Adjusted precision to ", DBL_DIG, " accordingly.");
iPrecisionDigits2 = DBL_DIG;
};
string sDigits = "";
for (int i = 0; i < iPrecisionDigits2; i++) {
sDigits += (string)iRandomDigit();
};
const string sResult = "0." + sDigits;
const double dResult = StringToDouble(sResult);
return dResult;
}
Noted in a comment on #MasterPlanMan's answer -- the other answers use more "official" methods designed for the question, from standard library, etc. However, I think conceptually it's a good solution when faced with limitations that the other answers can't address.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js