Why is (n += 2 * i * i) faster than (n+= i) in C++?

Why is (n += 2 * i * i) faster than (n+= i) in C++? - c++

This C++11 program takes on average between 7.42s and 7.79s to run.
#include <iostream>
#include <chrono>
using namespace std;
using c = chrono::system_clock;
using s = chrono::duration<double>;
void func(){
int n=0;
const auto before = c::now();
for(int i=0; i<2000000000; i++){
n += i;
}
const s duration = c::now() - before;
cout << duration.count();
}
if I replace n += i with n += 2 * i * i it takes between 5.80s and 5.96s. how come?
I ran each version of the program 20 times, alternating between the two. Here are the results:
n += i | n += 2 * i * i
---------+----------------
7.77047 | 5.87978
7.69226 | 5.83551
7.77375 | 5.84888
7.73748 | 5.84629
7.72988 | 5.84356
7.69736 | 5.83784
7.72597 | 5.84246
7.72722 | 5.81678
7.73291 | 5.81237
7.71871 | 5.81016
7.7478 | 5.80119
7.64906 | 5.80058
7.7253 | 5.9078
7.42734 | 5.96399
7.72573 | 5.84733
7.65591 | 5.81793
7.76619 | 5.83116
7.76963 | 5.84424
7.79928 | 5.87078
7.79274 | 5.84689
I have compiled it with (GCC) 9.1.1 20190503 (Red Hat 9.1.1-1). No optimization levels
g++ -std=c++11
We know that the maximum integer is ~ 2 billion. So, when i ~ 32000, can we say that the compiler predicts that the calculation will overflow?

https://godbolt.org/z/B3zIsv
You'll notice that with -O2, the code used to calculate 'n' is removed completely. So the real questions should be:
Why are you profiling code without -O2?
Why are you profiling code that has no observable side effects? ('n' can be removed completely - e.g. printing the value of 'n' at the end would be more useful here)
Why are you not profiling code in a profiler?
The timing results you have, result from a deeply flawed methodology.

Related

How can I create a trace which shows the call order where there is recursion

I want to know the details about the execution of recursive functions.
#include<iostream>
int a=0;
int fac(int n) {
if (n <= 1)
return n;
int temp = fac(n-2) + fac(n - 1);
a++;
return temp;
}
int main() {
fac(4);
std::cout<<a;
}
The output is 4.
I want to know when int temp = fac(n-2) + fac(n - 1)； is executed, for example fac(4-2)+fac(4-1) ---> fac(2)+fac(3), at this time, compiler will finish fac(2) first? or finish it together?
I'm not good at English, I hope that there is no obstacle to your reading.

Analysing this code purely in an algorithmic sense with no respect to C++ implementation intricacies,
fac(4)
fac(2) + fac(3)
|----------------------------|
fac(0) + fac(1) fac(1) + fac(2)
1 + 1 1 + fac(0) + fac(1)
+ 1 + 1
How can I create a trace which shows the call order where there is recursion?
First, I want to make note that the compiler output produced by the compiler will not match one-to-one with the code you write. The compiler applies different levels of optimization based on the flags provided to it with the highest level being -O3 and the default being -O0 but those seem out of scope of this question. Creating a trace influences this process itself as the compiler now needs to meet your expectations of what the trace looks like. The only true way to trace the actual execution flow is to read the assembly produced by the compiler.
Knowing that, we can apply a trace to see the call order by printing to screen when execution enters the called method. Note, I have removed a as it does not really trace anything and only adds to the complexity of the explanation.
int fac(int n) {
std::cout << "fac(" << n << ")" << std::endl;
if (n <= 1)
return n;
int temp = fac(n-2) + fac(n - 1);
return temp;
}
int main() {
fac(4);
}
// Output
fac(4)
fac(2)
fac(0)
fac(1)
fac(3)
fac(1)
fac(2)
fac(0)
fac(1)
As seen by this output on my PC, execution has proceeded from left to right depth first. We can number our call tree with this order to obtain a better picture,
// Call order
1. fac(4)
2. fac(2) + 5. fac(3)
|----------------------------|
3. fac(0) + 4. fac(1) 6. fac(1) + 7. fac(2)
+ 8. fac(0) + 9. fac(1)
Note: This does not mean that the results will be the same on every implementation nor does it mean that the order of execution is preserved when you remove the trace and allow compiler optimisations but it demonstrates how recursion works in computer programming.

First fac(2) will be finished and then fac(1). The output is 4.
The call stack would be like this (from bottom to top) -
|---fac(1)
|--- fac(2) |
|---- fac(3) | |---fac(0)
| |
| |----fac(1)
|
fac(4) |
|
| |---- fac(1)
|---- fac(2) |
|---- fac(0)
fac() function will be returned when n <= 1

Overflowed value prints non-overflowed value on Arduino

First of all, I know the problem with the code and how to get it to work. I'm mainly looking for an explanation why my output is what it is.
The following piece of code replicates the behaviour:
void setup() {
Serial.begin(9600);
}
void loop() {
Serial.println("start");
for(int i = 0; i < 70000; i++) {
if((i % 2000) == 0)
Serial.println(i);
}
}
Obviously the for loop will run forever because i will overflow at 32,767. I would expect it to overflow and print -32000.
Expected| Actually printed
0 | 0
2000 | 2000
4000 | 4000
... | ...
30000 | 30000
32000 | 32000
-32000 | 33536
-30000 | 35536
It looks like it prints the actual iterations, since if you overflow and count to -32000 you would have 33536 iterations, but I can't figure out how it's able to print the 33536.
The same thing happens every 65536 iterations:
95536 | 161072 | 226608 | 292144 | 357680
97536 | 163072 | 228608 | 294144 | 359680
99072 | 164608 | 230144 | 295680 | 361216
101072 | 166608 | 232144 | 297680 | 363216
EDIT
When I change the loop to add 10.000 every iteration and only print every 1.000.000 to speed it up the Arduino crashes (or at least, the prints stop) at '2.147.000.000'. Which seems to point to the 32-bit idea of svtag**
EDIT2
The edits I made for the 2.147.000.000-check:
void loop() {
Serial.println("start");
for(int i = 0; i < 70000; i+=10000) {
if((i % 1000000) == 0)
Serial.println(i);
}
}
They work in the same trend with the previous example, printing 0, 1000000, 2000000,...
However, when I update the AVR package from 1.6.17 to the latest 1.6.23 I get only 0's. The original example (% 2000 & i++) still gives the same output.

The compiler may have automatically casted the i to usigned int when you are doing the println or the value is going to the wrong overload. Try using Serial.println((int)i);

c++ why std::async slower than sequential execution

#include <future>
#include <iostream>
#include <vector>
#include <cstdint>
#include <algorithm>
#include <random>
#include <chrono>
#include <utility>
#include <type_traits>
template <class Clock = std::chrono::high_resolution_clock, class Task>
double timing(Task&& t, typename std::result_of<Task()>::type* r = nullptr)
{
using namespace std::chrono;
auto begin = Clock::now();
if (r != nullptr) *r = std::forward<Task>(t)();
auto end = Clock::now();
return duration_cast<duration<double>>(end - begin).count();
}
template <typename Num>
double sum(const std::vector<Num>& v, const std::size_t l, const std::size_t h)
{
double s;
for (auto i = l; i <= h; i++) s += v[i];
return s;
}
template <typename Num>
double asum(const std::vector<Num>& v, const std::size_t l, const std::size_t h)
{
auto m = (l + h) / 2;
auto s1 = std::async(std::launch::async, sum<Num>, v, l, m);
auto s2 = std::async(std::launch::async, sum<Num>, v, m+1, h);
return s1.get() + s2.get();
}
int main()
{
std::vector<uint> v(1000);
auto s = std::chrono::system_clock::now().time_since_epoch().count();
std::generate(v.begin(), v.end(), std::minstd_rand0(s));
double r;
std::cout << 1000 * timing([&]() -> double { return asum(v, 0, v.size() - 1); }, &r) << " msec | rst " << r << std::endl;
std::cout << 1000 * timing([&]() -> double { return sum(v, 0, v.size() - 1); }, &r) << " msec | rst " << r << std::endl;
}
Hi,
So above are two functions for summing up a vector of random numbers.
I did several runs, but it seems that I did not benefit from std::async. Below are some results I got.
0.130582 msec | rst 1.09015e+12
0.001402 msec | rst 1.09015e+12
0.23185 msec | rst 1.07046e+12
0.002308 msec | rst 1.07046e+12
0.18052 msec | rst 1.07449e+12
0.00244 msec | rst 1.07449e+12
0.190455 msec | rst 1.08319e+12
0.002315 msec | rst 1.08319e+12
All four cases the async version spent more time. But ideally I should have been two times faster right?
Did I miss anything in my code?
By the way I am running on OS X 10.10.4 Macbook Air with 1.4 GHz Intel Core i5.
Thanks,
Edits:
compiler flags: g++ -o asum asum.cpp -std=c++11
I changed the flag to include -O3 and vector size to be 10000000, but the results are still weired.
72.1743 msec | rst 1.07349e+16
14.3739 msec | rst 1.07349e+16
58.3542 msec | rst 1.07372e+16
12.1143 msec | rst 1.07372e+16
57.1576 msec | rst 1.07371e+16
11.9332 msec | rst 1.07371e+16
59.9104 msec | rst 1.07395e+16
11.9923 msec | rst 1.07395e+16
64.032 msec | rst 1.07371e+16
12.0929 msec | rst 1.07371e+16

here
auto s1 = std::async(std::launch::async, sum<Num>, v, l, m);
auto s2 = std::async(std::launch::async, sum<Num>, v, m+1, h);
async will store its own vector copy, twice. You should use std::cref and make sure the futures are retrieved before the vector dies ( as it is in your current code ) and that accesses get properly synchronized ( as it is in your current code ).
As mentioned in comments, thread creation overhead may further slow down the code.

Well, this is the simplest possible example and the results should not be binding because of following reasons.
when you create a thread, it takes some extra CPU cycles to create thread context and stack. Those cycles are added to the sum function.
When main thread runs this code, the main thread was empty and not doing anything else other than doing the sum
we go for multithreaded solution only when we can't accomplish something in a single thread or we need to synchronously wait for some input.
Just by creating threads, you don't increase performance. You increase performance by carefully designing multithreaded applications where you can use empty CPU's when some other things are waiting for example IO

First, the performance of your original async function is bad compared with the sequential because it make one more copy of your test data as mentioned in other answers. Second, you might not be able to see the improvement after fixing copying issue because creating threads is not cheap and it can kill your performance gain.
From the benchmark results, I can see async version is 1.88 times faster than that of the sequential version for N = 1000000. However, if I use N = 10000 then async version is 3.55 times slower. Both non-iterator and iterator solutions produce similar results.
Beside that you should use iterator when writing your code because this approach is more flexible for example you can try different container types, will give you similar performance compared to C style version, and it also is more elegant IMHO :).
Benchmark results:
-----------------------------------------------------------------------------------------------------------------------------------------------
Group | Experiment | Prob. Space | Samples | Iterations | Baseline | us/Iteration | Iterations/sec |
-----------------------------------------------------------------------------------------------------------------------------------------------
sum | orginal | Null | 50 | 10 | 1.00000 | 1366.30000 | 731.90 |
sum | orginal_async | Null | 50 | 10 | 0.53246 | 727.50000 | 1374.57 |
sum | iter | Null | 50 | 10 | 1.00022 | 1366.60000 | 731.74 |
sum | iter_async | Null | 50 | 10 | 0.53261 | 727.70000 | 1374.19 |
Complete.
Celero
Timer resolution: 0.001000 us
-----------------------------------------------------------------------------------------------------------------------------------------------
Group | Experiment | Prob. Space | Samples | Iterations | Baseline | us/Iteration | Iterations/sec |
-----------------------------------------------------------------------------------------------------------------------------------------------
sum | orginal | Null | 50 | 10 | 1.00000 | 13.60000 | 73529.41 |
sum | orginal_async | Null | 50 | 10 | 3.55882 | 48.40000 | 20661.16 |
sum | iter | Null | 50 | 10 | 1.00000 | 13.60000 | 73529.41 |
sum | iter_async | Null | 50 | 10 | 3.53676 | 48.10000 | 20790.02 |
Complete.
Complete code sample
#include <algorithm>
#include <chrono>
#include <future>
#include <iostream>
#include <random>
#include <type_traits>
#include <vector>
#include <iostream>
#include "celero/Celero.h"
constexpr int NumberOfSamples = 50;
constexpr int NumberOfIterations = 10;
template <typename Container>
double sum(const Container &v, const std::size_t begin, const std::size_t end) {
double s = 0;
for (auto idx = begin; idx < end; ++idx) {
s += v[idx];
}
return s;
}
template <typename Container>
double sum_async(const Container &v, const std::size_t begin, const std::size_t end) {
auto middle = (begin + end) / 2;
// Removing std::cref will slow down this function because it makes two copy of v..
auto s1 = std::async(std::launch::async, sum<Container>, std::cref(v), begin, middle);
auto s2 = std::async(std::launch::async, sum<Container>, std::cref(v), middle, end);s,
return s1.get() + s2.get();
}
template <typename Iter>
typename std::iterator_traits<Iter>::value_type sum_iter(Iter begin, Iter end) {
typename std::iterator_traits<Iter>::value_type results = 0.0;
std::for_each(begin, end, [&results](auto const item) { results += item; });
return results;
}
template <typename Iter>
typename std::iterator_traits<Iter>::value_type sum_iter_async(Iter begin, Iter end) {
Iter middle = begin + std::distance(begin, end) / 2;
auto s1 = std::async(std::launch::async, sum_iter<Iter>, begin, middle);
auto s2 = std::async(std::launch::async, sum_iter<Iter>, middle, end);
return s1.get() + s2.get();
}
template <typename T> auto create_test_data(const size_t N) {
auto s = std::chrono::system_clock::now().time_since_epoch().count();
std::vector<T> v(N);
std::generate(v.begin(), v.end(), std::minstd_rand0(s));
return v;
}
// Create test data
constexpr size_t N = 10000;
using value_type = double;
auto data = create_test_data<value_type>(N);
using container_type = decltype(data);
CELERO_MAIN
BASELINE(sum, orginal, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum<container_type>(data, 0, N));
}
BENCHMARK(sum, orginal_async, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum_async<container_type>(data, 0, N));
}
BENCHMARK(sum, iter, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum_iter(data.cbegin(), data.cend()));
}
BENCHMARK(sum, iter_async, NumberOfSamples, NumberOfIterations) {
celero::DoNotOptimizeAway(sum_iter_async(data.cbegin(), data.cend()));

Running std::normal_distribution with user-defined random generator

I am about to generate an array of normally distributed pseudo-random numbers. As I know the std library offers the following code for that:
std::random_device rd;
std::mt19937 gen(rd());
std::normal_distribution<> d(mean,std);
...
double number = d(gen);
The problem is that I want to use a Sobol' quasi-random sequence instead of Mersenne
Twister pseudo-random generator. So, my question is:
Is it possible to run the std::normal_distribution with a user-defined random generator (with a Sobol' quasi-random sequence generator in my case)?
More details: I have a class called RandomGenerators, which is used to generate a Sobol' quasi-random numbers:
RandomGenerator randgen;
double number = randgen.sobol(0,1);

Yes, it is possible. Just make it comply to the requirements of a uniform random number generator (§26.5.1.3 paragraphs 2 and 3):
2 A class G satisfies the requirements of a uniform random number
generator if the expressions shown in Table 116 are valid and have the
indicated semantics, and if G also satisfies all other requirements
of this section. In that Table and throughout this section:
a) T is the type named by G’s associatedresult_type`, and
b) g is a value of G.
Table 116 — Uniform random number generator requirements
Expression | Return type | Pre/post-condition | Complexity
----------------------------------------------------------------------
G::result_type | T | T is an unsigned integer | compile-time
| | type (§3.9.1). |
----------------------------------------------------------------------
g() | T | Returns a value in the | amortized constant
| | closed interval |
| | [G::min(), G::max()]. |
----------------------------------------------------------------------
G::min() | T | Denotes the least value | compile-time
| | potentially returned by |
| | operator(). |
----------------------------------------------------------------------
G::max() | T | Denotes the greatest value | compile-time
| | potentially returned by |
| | operator(). |
3 The following relation shall hold: G::min() < G::max().

A word of caution here - I came across a big gotcha when I implemented this. It seems that if the return types of max()/min()/operator() are not 64 bit then the distribution will resample. My (unsigned) 32 bit Sobol implementation was getting sampled twice per deviate thus destroying the properties of the numbers. This code reproduces:
#include <random>
#include <limits>
#include <iostream>
#include <cstdint>
typedef uint32_t rng_int_t;
int requested = 0;
int sampled = 0;
struct Quasi
{
rng_int_t operator()()
{
++sampled;
return 0;
}
rng_int_t min() const
{
return 0;
}
rng_int_t max() const
{
return std::numeric_limits<rng_int_t>::max();
}
};
int main()
{
std::uniform_real_distribution<double> dist(0.0,1.0);
Quasi q;
double total = 0.0;
for (size_t i = 0; i < 10; ++i)
{
dist(q);
++requested;
}
std::cout << "requested: " << requested << std::endl;
std::cout << "sampled: " << sampled << std::endl;
}
Output (using g++ 5.4):
requested: 10
sampled: 20
and even when compiled with -m32. If you change rng_int_t to 64bit the problem goes away. My workaround is to stick the 32 bit value into the most significant bits of the return value, e.g
return uint64_t(val) << 32;

You can now generate Sobol sequences directly with Boost. See boost/random/sobol.hpp.

Fibonacci series in C++ : Control reaches end of non-void function

while practicing recursive functions, I have written this code for fibonacci series and (factorial) the program does not run and shows the error "Control reaches end of non-void function" i suspect this is about the last iteration reaching zero and not knowing what to do into minus integers. I have tried return 0, return 1, but no good. any suggestions?
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <ctime>
using namespace std;
int fib(int n) {
int x;
if(n<=1) {
cout << "zero reached \n";
x= 1;
} else {
x= fib(n-1)+fib(n-2);
return x;
}
}
int factorial(int n){
int x;
if (n==0){
x=1;
}
else {
x=(n*factorial(n-1));
return x;
}
}

Change
else if (n==1)
x=1;
to
else if (n==1)
return 1;
Then fib() should work for all non-negative numbers. If you want to simplify it and have it work for all numbers, go with something like:
int fib(int n) {
if(n<=1) {
cout << "zero reached \n";
return 1;
} else {
return fib(n-1)+fib(n-2);
}
}

"Control reaches end of non-void function"
This is a compile-time warning (which can be treated as an error with appropriate compiler flags). It means that you have declared your function as non-void (in this case, int) and yet there is a path through your function for which there is no return (in this case if (n == 1)).
One of the reasons that some programmers prefer to have exactly one return statement per function, at the very last line of the function...
return x;
}
...is that it is easy to see that their functions return appropriately. This can also be achieved by keeping functions very short.
You should also check your logic in your factorial() implementation, you have infinite recursion therein.

Presumably the factorial function should be returning n * factorial(n-1) for n > 0.
x=(factorial(n)*factorial(n-1)) should read x = n * factorial(n-1)

In your second base case (n == 1), you never return x; or 'return 1;'

The else section of your factorial() function starts:
x=(factorial(n)*factorial(n-1));
This leads to infinite recursion. It should be:
x=(n*factorial(n-1));

Sometimes, your compiler is not able to deduce that your function actually has no missing return. In such cases, several solutions exist:
Assume
if (foo == 0) {
return bar;
} else {
return frob;
}
Restructure your code
if (foo == 0) {
return bar;
}
return frob;
This works good if you can interpret the if-statement as a kind of firewall or precondition.
abort()
(see David Rodríguez's comment.)
if (foo == 0) {
return bar;
} else {
return frob;
}
abort(); return -1; // unreachable
Return something else accordingly. The comment tells fellow programmers and yourself why this is there.
throw
#include <stdexcept>
if (foo == 0) {
return bar;
} else {
return frob;
}
throw std::runtime_error ("impossible");
Not a counter measure: Single Function Exit Point
Do not fall back to one-return-per-function a.k.a. single-function-exit-point as a workaround. This is obsolete in C++ because you almost never know where the function will exit:
void foo(int&);
int bar () {
int ret = -1;
foo (ret);
return ret;
}
Looks nice and looks like SFEP, but reverse engineering the 3rd party proprietary libfoo reveals:
void foo (int &) {
if (rand()%2) throw ":P";
}
Also, this can imply an unnecessary performance loss:
Frob bar ()
{
Frob ret;
if (...) ret = ...;
...
if (...) ret = ...;
else if (...) ret = ...;
return ret;
}
because:
class Frob { char c[1024]; }; // copy lots of data upon copy
And, every mutable variable increases the cyclomatic complexity of your code. It means more code and more state to test and verify, in turn means that you suck off more state from the maintainers brain, in turn means less maintainer's brain quality.
Last, not least: Some classes have no default construction and you would have to write really bogus code, if possible at all:
File mogrify() {
File f ("/dev/random"); // need bogus init
...
}
Do not do this.

There is nothing wrong with if-else statements. The C++ code applying them looks similar to other languages. In order to emphasize expressiveness of C++, one could write for factorial (as example):
int factorial(int n){return (n > 1) ? n * factorial(n - 1) : 1;}
This illustration, using "truly" C/C++ conditional operator ?:, and other suggestions above lack the production strength. It would be needed to take measures against overfilling the placeholder (int or unsigned int) for the result and with recursive solutions overfilling the calling stack. Clearly, that the maximum n for factorial can be computed in advance and serve for protection against "bad inputs". However, this could be done on other indirection levels controlling n coming to the function factorial. The version above returns 1 for any negative n. Using unsigned int would prevent dealing processing negative inputs. However, it would not prevent possible conversion situation created by a user. Thus, measures against negative inputs might be desirable too.

While the author is asking what is technically wrong with the recursive function computing the Fibonacci numbers, I would like to notice that the recursive function outlined above will solve the task in time exponentially growing with n. I do not want to discourage creating perfect C++ from it. However, it is known that the task can be computed faster. This is less important for small n. You would need to refresh the matrix multiplication knowledge in order to understand the explanation. Consider, evaluation of powers of the matrix:
power n = 1 | 1 1 |
| 1 0 | = M^1
power n = 2 | 1 1 | | 1 1 | | 2 1 |
| 1 0 | * | 1 0 | = | 1 1 | = M^2
power n = 3 | 2 1 | | 1 1 | | 3 2 |
| 1 1 | * | 1 0 | = | 2 1 | = M^3
power n = 4 | 3 2 | | 1 1 | | 5 3 |
| 2 1 | * | 1 0 | = | 3 2 | = M^4
Do you see that the matrix elements of the result resemble the Fibonacci numbers? Continue
power n = 5 | 5 3 | | 1 1 | | 8 5 |
| 3 2 | * | 1 0 | = | 5 3 | = M^5
Your guess is right (this is proved by mathematical induction, try or just use)
power n | 1 1 |^n | F(n + 1) F(n) |
| 1 0 | = M^n * | F(n) F(n - 1) |
When multiply the matrices, apply at least the so-called "exponentiation by squaring". Just to remind:
if n is odd (n % 2 != 0), then M * (M^2)^((n - 1) / 2)
M^n =
if n is even (n % 2 == 0), then (M^2)^(n/2)
Without this, your implementation will get the time properties of an iterative procedure (which is still better than exponential growth). Add your lovely C++ and it will give a decent result. However, since there is no a limit of perfectness, this also can be improved. Particularly, there is a so-called "fast doubling" for the Fibonacci numbers. This would have the same asymptotic properties but with a better time coefficient for the dependence on time. When one say O(N) or O(N^2) the actual constant coefficients will determine further differences. One O(N) can be still better than another O(n).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is (n += 2 * i * i) faster than (n+= i) in C++? - c++

Related

How can I create a trace which shows the call order where there is recursion

Overflowed value prints non-overflowed value on Arduino

c++ why std::async slower than sequential execution

Running std::normal_distribution with user-defined random generator

Fibonacci series in C++ : Control reaches end of non-void function

Categories

Resources