Performance of pow(x,3.0f) vs x*x*x? - c++

The following program...
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += x*x*x;
}
return t;
}
...takes about 900ms to complete on my machine. Whereas...
#include <cmath>
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += std::pow(x,3.0f);
}
return t;
}
...takes about 6600ms to complete.
I'm kind of suprised that the optimizer doesn't inline the std::pow function so that the two programs produce the same code and have identical performance.
Any insights? How do you account for the 5x performance difference?
For reference I'm using gcc -O3 on Linux x86
Update: (C Version)
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += x*x*x;
}
return t;
}
...takes about 900ms to complete on my machine. Whereas...
#include <math.h>
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += powf(x,3.0f);
}
return t;
}
...takes about 6600ms to complete.
Update 2
The following program:
#include <math.h>
int main() {
float t = 0;
for (int i = 0; i < 1000000000; i++) {
const float x = i;
t += __builtin_powif(x,3.0f);
}
return t;
}
runs in 900ms like the first program.
Why isn't pow being inlined to __builtin_powif ?
Update 3:
With -ffast-math the following program:
#include <math.h>
#include <iostream>
int main() {
float t = 0;
for (int i = 0; i < 1'000'000'000; i++) {
const float x = i;
t += powf(x, 3.0f);
}
std::cout << t;
}
runs in 227ms (as does the x*x*x version). That's 200 picoseconds per iteration. Using -fopt-info it says optimized: loop vectorized using 16 byte vectors and optimized: loop with 2 iterations completely unrolled so I guess that means its doing iterations in batches of 4 for SSE and doing 2 iterations at once pipelining (for a total of 8 iterations at once), or something like that?

The doc page about gcc builtins is explicit (emphasize mine):
Built-in Function: double __builtin_powi (double, int)
Returns the first argument raised to the power of the second. Unlike the pow function no guarantees about precision and rounding are made.
Built-in Function: float __builtin_powif (float, int)
Similar to __builtin_powi, except the argument and return types are float.
As __builtin_powif has equivalent performances to a a mere product, it means that the additional time is used to the controls required by pow for its guarantees about precision and rounding.

% Assuming your compiler chose to just call pow in the shared library like https://godbolt.org/z/re3baK (without -ffast-math)
I did not take a look at how pow(float, float) is implemented, but I see some points.
x*x*x is inlined while pow can't be as it is in a shared library - function call overhead difference
Whether the exponent 3.0 is constant? If compiler know something is constant, it is likely to generate more efficient code
x*x*x : Just generates assembly for float value multiplication twice
pow : This must have considered all the exponent values so probably it has general code(less efficient, may include loops)

Related

Random generation with TRNG

For the following code which generates random numbers for Monte Carlo simulation, I need to receive the exact sum for each run, but this will not happen, although I have fixed the seed. I would appreciate it if anyone could point out the problem with this code
#include <cmath>
#include <random>
#include <iostream>
#include <chrono>
#include <cfloat>
#include <iomanip>
#include <cstdlib>
#include <omp.h>
#include <trng/yarn2.hpp>
#include <trng/mt19937_64.hpp>
#include <trng/uniform01_dist.hpp>
using namespace std;
using namespace chrono;
const double landa = 1;
const double exact_solution = landa / (pow(landa, 2) + 1);
double function(double x) {
return cos(x) / landa;
}
int main() {
int rank;
const int N = 1000000;
double sum = 0.0;
trng::yarn2 r[6];
for (int i = 0; i <6; i++)
{
r[i].seed(0);
}
for (int i = 0; i < 6; i++)
{
r[i].split(6,i);
}
trng::uniform01_dist<double> u;
auto start = high_resolution_clock::now();
#pragma omp parallel num_threads(6)
{
rank=omp_get_thread_num();
#pragma omp for reduction (+: sum)
for (int i = 0; i<N; ++i) {
//double x = distribution(g);
double x= u(r[rank]);
x = (-1.0 / landa) * log(1.0 - x);
sum = sum+function(x);
}
}
double app = sum / static_cast<double> (N);
auto end = high_resolution_clock::now();
auto diff=duration_cast<milliseconds>(end-start);
cout << "Approximation is: " <<setprecision(17) << app << "\t"<<"Time: "<< setprecision(17) << diff.count()<<" Error: "<<(app-exact_solution)<< endl;
return 0;
}
TL;DR The problem is two-fold:
Floating point addition is not associative;
You are generating different random number for each thread.
I need to receive the exact sum for each run, but this will not
happen, although I have fixed the seed. I would appreciate it if
anyone could point out the problem with this code
First, you have a race-condition on rank=omp_get_thread_num();, the variable rank is shared among all threads, to fix that you can declared the variable rank inside the parallel region, hence, making it private to each thread.
#pragma omp parallel num_threads(6)
{
int rank=omp_get_thread_num();
...
}
In your code, you should not expect that the value of the sum will be the same for different number of threads. Why ?
because you are adding doubles in parallel
double sum = 0.0;
...
#pragma omp for reduction (+: sum)
for (int i = 0; i<N; ++i) {
//double x = distribution(g);
double x= u(r[rank]);
x = (-1.0 / landa) * log(1.0 - x);
sum = sum+function(x);
}
and from What Every Computer Scientist Should Know about Floating
Point Arithmetic one can read:
Another grey area concerns the interpretation of parentheses. Due to roundoff errors, the associative laws of algebra do not necessarily hold for floating-point numbers. For example, the
expression (x+y)+z has a totally different answer than x+(y+z) when
x = 1e30, y = -1e30 and z = 1 (it is 1 in the former case, 0 in the
latter).
Hence, from that you conclude that floating point addition is not
associative, and the reason why for a different number of threads you might have different sum values.
You are generating different random values per thread:
for (int i = 0; i < 6; i++)
{
r[i].split(6,i);
}
Consequently, for different number of threads, the variable sum
gets different results as well.
As kindly point out by jérôme-richard in the comments:
Note that more precise algorithm like the Kahan summation can
significantly reduces the rounding issue while being still relatively
fast.

Cilk Plus code result depends on number of workers

I have a small piece of code that I would like to parallelize as I upscale. I've been using cilk_for from Cilk Plus to run the multithreading. The trouble is that I get a different result depending on the number of workers.
I've read that this might be due to a race condition, but I'm not sure what specifically about the code causes that or how to ameliorate it. Also, I realize that long and __float128 are overkill for this problem, but might be necessary in the upscaling.
Code:
#include <assert.h>
#include "cilk/cilk.h"
#include <cstring>
#include <iostream>
#include <math.h>
#include <stdio.h>
#include <string>
#include <vector>
using namespace std;
__float128 direct(const vector<double>& Rpct, const vector<unsigned>& values, double Rbase, double toWin) {
unsigned count = Rpct.size();
__float128 sumProb = 0.0;
__float128 rProb = 0.0;
long nCombo = static_cast<long>(pow(2, count));
// for (long j = 0; j < nCombo; ++j) { //over every combination
cilk_for (long j = 0; j < nCombo; ++j) { //over every combination
vector<unsigned> binary;
__float128 prob = 1.0;
unsigned point = Rbase;
for (unsigned i = 0; i < count; ++i) { //over all the individual events
long exp = static_cast<long>(pow(2, count-i-1));
bool odd = (j/exp) % 2;
if (odd) {
binary.push_back(1);
point += values[i];
prob *= static_cast<__float128>(Rpct[i]);
} else {
binary.push_back(0);
prob *= static_cast<__float128>(1.0 - Rpct[i]);
}
}
sumProb += prob;
if (point >= toWin) rProb += prob;
assert(sumProb >= rProb);
}
//print sumProb
cout << " sumProb = " << (double)sumProb << endl;
assert( fabs(1.0 - sumProb) < 0.01);
return rProb;
}
int main(int argc, char *argv[]) {
vector<double> Rpct;
vector<unsigned> value;
value.assign(20,1);
Rpct.assign(20,0.25);
unsigned Rbase = 22;
unsigned win = 30;
__float128 rProb = direct(Rpct, value, Rbase, win);
cout << (double)rProb << endl;
return 0;
}
Sample output for export CILK_NWORKERS=1 && ./code.exe:
sumProb = 1
0.101812
Sample output for export CILK_NWORKERS=4 && ./code.exe:
sumProb = 0.948159
Assertion failed: (fabs(1.0 - sumProb) < 0.01), function direct, file code.c, line 61.
Abort trap: 6
It is because of a race condition. cilk_for is implementation of parallel for algorithm. If you want to use parallel for you must use independent iteration (independent data). It`is very important. You have to use cilk reducers for your case: https://www.cilkplus.org/tutorial-cilk-plus-reducers
To clarify, there is at least one race on sumProb. Each of the parallel workers will do a read/modify/write on that location. As sribin mentioned above, solving problems like this is what reducers are for.
It's entirely possible that there's more than one race in your program. The only way to be sure is to run it under a race detector, since finding races is one of the things that computers are much better at than humans. A free possibility is the Cilkscreen race detector, available from the cilkplus.org website. Unfortunately it doesn't support gcc/g++.

Rounding errors giving incorrect tesults in DFT?

I have been beating my head against the wall on this DFT. It should print out: 8,0,0,0,0,0,0,0 but instead I get 8 and then very very tiny numbers. Are these rounding errors? Is there anything I can do? My Radix2 FFT gives correct results, it seems silly a DFT could not also work.
I started with complex numbers so I know there is a good bit missing, I tried to strip it down to illustrate the problem.
#include <cstdlib>
#include <math.h>
#include <iostream>
#include <complex>
#include <cassert>
#define SIZE 8
#define M_PI 3.14159265358979323846
void fft(const double src[], double dst[], const unsigned int n)
{
for(int i=0; i < SIZE; i++)
{
const double ph = -(2*M_PI) / n;
const int gid = i;
double res = 0.0f;
for (int k = 0; k < n; k++) {
double t = src[k];
const double val = ph * k * gid;
double cs = cos(val);
double sn = sin(val);
res += ((t * cs) - (t * sn));
int a = 1;
}
dst[i] = res;
std::cout << dst[i] << std::endl;
}
}
int main(void)
{
double array1[SIZE];
double array2[SIZE];
for(int i=0; i < SIZE; i++){
array1[i] = 1;
array2[i] = 0;
}
fft(array1, array2, SIZE);
return 666;
}
An FFT can actually produce more accurate results than a straight DFT calculation, as the fewer arithmetic ops usually allow fewer opportunities for arithmetic quantization errors to accumulate. There's a paper by one of the FFTW authors on this topic.
Since the DFT/FFT deal with a transcendental basis function, the results will never (except perhaps in a few special cases, or by lucky accident) be exactly correct using any non-symbolic and finite computer number format. So values very close (within a few LSB) to zero should simply be ignored as noise, or considered to be the same as zero.

Multi-threaded Simulated Annealing

I wrote a multithreaded simulated annealing program but its not running. I am not sure if the code is correct or not. The code is able to compile but when i run the code it crashes. Its just a run time error.
#include <stdio.h>
#include <time.h>
#include <iostream>
#include <stdlib.h>
#include <math.h>
#include <string>
#include <vector>
#include <algorithm>
#include <fstream>
#include <ctime>
#include <windows.h>
#include <process.h>
using namespace std;
typedef vector<double> Layer; //defines a vector type
typedef struct {
Layer Solution1;
double temp1;
double coolingrate1;
int MCL1;
int prob1;
}t;
//void SA(Layer Solution, double temp, double coolingrate, int MCL, int prob){
double Rand_NormalDistri(double mean, double stddev) {
//Random Number from Normal Distribution
static double n2 = 0.0;
static int n2_cached = 0;
if (!n2_cached) {
// choose a point x,y in the unit circle uniformly at random
double x, y, r;
do {
// scale two random integers to doubles between -1 and 1
x = 2.0*rand()/RAND_MAX - 1;
y = 2.0*rand()/RAND_MAX - 1;
r = x*x + y*y;
} while (r == 0.0 || r > 1.0);
{
// Apply Box-Muller transform on x, y
double d = sqrt(-2.0*log(r)/r);
double n1 = x*d;
n2 = y*d;
// scale and translate to get desired mean and standard deviation
double result = n1*stddev + mean;
n2_cached = 1;
return result;
}
} else {
n2_cached = 0;
return n2*stddev + mean;
}
}
double FitnessFunc(Layer x, int ProbNum)
{
int i,j,k;
double z;
double fit = 0;
double sumSCH;
if(ProbNum==1){
// Ellipsoidal function
for(j=0;j< x.size();j++)
fit+=((j+1)*(x[j]*x[j]));
}
else if(ProbNum==2){
// Schwefel's function
for(j=0; j< x.size(); j++)
{
sumSCH=0;
for(i=0; i<j; i++)
sumSCH += x[i];
fit += sumSCH * sumSCH;
}
}
else if(ProbNum==3){
// Rosenbrock's function
for(j=0; j< x.size()-1; j++)
fit += 100.0*(x[j]*x[j] - x[j+1])*(x[j]*x[j] - x[j+1]) + (x[j]-1.0)*(x[j]-1.0);
}
return fit;
}
double probl(double energychange, double temp){
double a;
a= (-energychange)/temp;
return double(min(1.0,exp(a)));
}
int random (int min, int max){
int n = max - min + 1;
int remainder = RAND_MAX % n;
int x;
do{
x = rand();
}while (x >= RAND_MAX - remainder);
return min + x % n;
}
//void SA(Layer Solution, double temp, double coolingrate, int MCL, int prob){
void SA(void *param){
t *args = (t*) param;
Layer Solution = args->Solution1;
double temp = args->temp1;
double coolingrate = args->coolingrate1;
int MCL = args->MCL1;
int prob = args->prob1;
double Energy;
double EnergyNew;
double EnergyChange;
Layer SolutionNew(50);
Energy = FitnessFunc(Solution, prob);
while (temp > 0.01){
for ( int i = 0; i < MCL; i++){
for (int j = 0 ; j < SolutionNew.size(); j++){
SolutionNew[j] = Rand_NormalDistri(5, 1);
}
EnergyNew = FitnessFunc(SolutionNew, prob);
EnergyChange = EnergyNew - Energy;
if(EnergyChange <= 0){
Solution = SolutionNew;
Energy = EnergyNew;
}
if(probl(EnergyChange ,temp ) > random(0,1)){
//cout<<SolutionNew[i]<<endl;
Solution = SolutionNew;
Energy = EnergyNew;
cout << temp << "=" << Energy << endl;
}
}
temp = temp * coolingrate;
}
}
int main ()
{
srand ( time(NULL) ); //seed for getting different numbers each time the prog is run
Layer SearchSpace(50); //declare a vector of 20 dimensions
//for(int a = 0;a < 10; a++){
for (int i = 0 ; i < SearchSpace.size(); i++){
SearchSpace[i] = Rand_NormalDistri(5, 1);
}
t *arg1;
arg1 = (t *)malloc(sizeof(t));
arg1->Solution1 = SearchSpace;
arg1->temp1 = 1000;
arg1->coolingrate1 = 0.01;
arg1->MCL1 = 100;
arg1->prob1 = 3;
//cout << "Test " << ""<<endl;
_beginthread( SA, 0, (void*) arg1);
Sleep( 100 );
//SA(SearchSpace, 1000, 0.01, 100, 3);
//}
return 0;
}
Please help.
Thanks
Avinesh
As leftaroundabout pointed out, you're using malloc in C++ code. This is the source of your crash.
Malloc will allocate a block of memory, but since it was really designed for C, it doesn't call any C++ constructors. In this case, the vector<double> is never properly constructed. When
arg1->Solution1 = SearchSpace;
Is called, the member variable "Solution1" has an undefined state and the assignment operator crashes.
Instead of malloc try
arg1 = new t;
This will accomplish roughly the same thing but the "new" keyword also calls any necessary constructors to ensure the vector<double> is properly initialized.
This also brings up another minor issue, that this memory you've newed also needs to be deleted somewhere. In this case, since arg1 is passed to another thread, it should probably be cleaned up like
delete args;
by your "SA" function after its done with the args variable.
While I don't know the actual cause for your crashes I'm not really surprised that you end up in trouble. For instance, those "cached" static variables in Rand_NormalDistri are obviously vulnerable to data races. Why don't you use std::normal_distribution? It's almost always a good idea to use standard library routines when they're available, and even more so when you need to consider multithreading trickiness.
Even worse, you're heavily mixing C and C++. malloc is something you should virtually never use in C++ code – it doesn't know about RAII, which is one of the few intrinsically safe things you can cling onto in C++.

theoretical and practical matrix multiplication FLOP

My system:
system specification:
Intel core2duo E4500 3700g memory L2 cache 2M x64 fedora 17
How I measure flops/mflops
well,I use papi library (to read hardware performance counter) to measure flops and mflops of my code.it return real time procesing time, flops and finally flops/process time which is equal to MFLOPS.library use hardware counter to count floating point inststruction or floating point operations and Total cycle to get the final result that contain flops and MFLOPS.
MY computational kernel
I used three loop matrix matrix multiplication (square matrix) and three nested loop which do some operation on 1d array in its inner-loop.
First Kernel MM
float a[size][size];
float b[size][size];
float c[size][size];
start_calculate_MFlops();
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; **k+=1**) {
*c[i][j]=c[i][j]+a[i][k] * b[k][j];*
}
}
}
stop_calculate_MFlops();
Second kernel with 1d array
float d[size];
float e[size];
float f[size];
float g[size];
float r = 3.6541;
start_calculate_MFlops();
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
d[k]=d[k]+e[k]+f[k]+g[k]+r;
}
}
}
stop_calculate_MFlops();
what I know about flops
Matrix matrix Multiplication (MM) do 2 operation in its inner loop (here floating point operation) and as there is 3 loop which iterate for size X therefore in theory we have total flops of 2*n^3 for MM.
In second kernel we have 3 loop which in inner-most loop we have 1d array which do some computation.there is 4 floating point operation in this loop.hence we have total flops of 4*n^3 flops in theory
I know that the flops that we calculate above is not exactly the same as what will happen in real machine. In real machine there are other operation like load and store wich will add up to out theoretical flops.
Questions ?:
when I use 1d array as in second kernel theoretical flops is the
same or around the flops I get by executing the code and measuring
it.actually when I use 1d array flops is equal to # of operation in
inner-most loop multiply by n^3 but when I use my first kernel MM
which use 2d array theoretical flop is 2n^3 but when I run the code
,measured value is too much higher than theoretical value,it is
about 4+(2 operation in inner-most loop of matrix multiplication)*n^3+=6n^3.
I changed the matrix multiplication line in innermost loop with just the code below:
A[i][j]++;
the theoretical flops for this code in 3 nested loop is 1 operation * n^3=n^3 again when I ran the code the result was too higher than what expected which was 2+(1 operation of inner-most loop)*n^3=3*n^3
Sample Results for matrix of size 512X512 :
Real_time: 1.718368 Proc_time: 1.227672 Total flpops:
807,107,072 MFLOPS: 657.429016
Real_time: 3.608078 Proc_time: 3.042272 Total flpops:
807,024,448 MFLOPS: 265.270355
theoretical flop: 2*512*512*512=268,435,456
Measured flops= 6*512^3 =807,107,072
Sample Result for 1d array operation in 3 nested loop
Real_time: 1.282257 Proc_time: 1.155990 Total flpops:
536,872,000 MFLOPS: 464.426117
theoretical flop: 4n^3 = 536,870,912
Measured flop: 4n^3=4*512^3+overheads(other operation?)=536,872,000
I could not find any reason for the aforementioned behaviour?
Is my assumption true ?
Hope to make it much simpler than before description.
By practical I meant real flop measured by executing the code.
Code:
void countFlops() {
int size = 512;
int itr = 20;
float a[size][size];
float b[size][size];
float c[size][size];
/* float d[size];
float e[size];
float f[size];
float g[size];*/
float r = 3.6541;
float real_time, proc_time, mflops;
long long flpops;
float ireal_time, iproc_time, imflops;
long long iflpops;
int retval;
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
a[j][j] = b[j][j] = c[j][j] = 1.0125;
}
}
/* for (int i = 0; i < size; ++i) {
d[i]=e[i]=f[i]=g[i]=10.235;
}*/
if ((retval = PAPI_flops(&ireal_time, &iproc_time, &iflpops, &imflops))
< PAPI_OK) {
printf("Could not initialise PAPI_flops \n");
printf("Your platform may not support floating point operation event.\n");
printf("retval: %d\n", retval);
exit(1);
}
for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; k+=16) {
c[i][j]=c[i][j]+a[i][k] * b[k][j];
}
}
}
/* for (int i = 0; i < size; ++i) {
for (int j = 0; j < size; ++j) {
for (int k = 0; k < size; ++k) {
d[k]=d[k]+e[k]+f[k]+g[k]+r;
}
}
}*/
if ((retval = PAPI_flops(&real_time, &proc_time, &flpops, &mflops))
< PAPI_OK) {
printf("retval: %d\n", retval);
exit(1);
}
string flpops_tmp;
flpops_tmp = output_formatted_string(flpops);
printf(
"calculation: Real_time: %f Proc_time: %f Total flpops: %s MFLOPS: %f\n",
real_time, proc_time, flpops_tmp.c_str(), mflops);
}
thank you
If you need to count number of your operations - you can make simple class which acts like floating point value and gathers statistics. It will be interchangeable with builtin types.
LIVE DEMO:
#include <boost/numeric/ublas/matrix.hpp>
#include <boost/operators.hpp>
#include <iostream>
#include <ostream>
#include <utility>
#include <cstddef>
#include <vector>
using namespace boost;
using namespace std;
class Statistic
{
size_t ops = 0;
public:
Statistic &increment()
{
++ops;
return *this;
}
size_t count() const
{
return ops;
}
};
template<typename Domain>
class Profiled: field_operators<Profiled<Domain>>
{
Domain value;
static vector<Statistic> stat;
void stat_increment()
{
stat.back().increment();
}
public:
struct StatisticScope
{
StatisticScope()
{
stat.emplace_back();
}
Statistic &current()
{
return stat.back();
}
~StatisticScope()
{
stat.pop_back();
}
};
template<typename ...Args>
Profiled(Args&& ...args)
: value{forward<Args>(args)...}
{}
Profiled& operator+=(const Profiled& x)
{
stat_increment();
value+=x.value;
return *this;
}
Profiled& operator-=(const Profiled& x)
{
stat_increment();
value-=x.value;
return *this;
}
Profiled& operator*=(const Profiled& x)
{
stat_increment();
value*=x.value;
return *this;
}
Profiled& operator/=(const Profiled& x)
{
stat_increment();
value/=x.value;
return *this;
}
};
template<typename Domain>
vector<Statistic> Profiled<Domain>::stat{1};
int main()
{
typedef Profiled<double> Float;
{
Float::StatisticScope s;
Float x = 1.0, y = 2.0, res = 0.0;
res = x+y*x+y;
cout << s.current().count() << endl;
}
{
using namespace numeric::ublas;
Float::StatisticScope s;
matrix<Float> x{10, 20},y{20,5},res{10,5};
res = prod(x,y);
cout << s.current().count() << endl;
}
}
Output is:
3
2000
P.S. Your matrix loop is not cache-friendly, and as the result very inefficient.
P.P.S
int size = 512;
float a[size][size];
This is not legal C++ code. C++ does not support VLA.