combinatorial exercise in C++ seeing speed decrement vs VBA - c++

just for recreation I thought I'd model a combinatorial problem with a monte carlo implementation. I did an implementation in VBA, and then as an exercise I thought I'd try and write it in C++ (am a complete novice) to check speed differences etc. Other than my not knowing advanced coding techniques/tricks, I had naively thought that as long as the model was faithfully transferred to C++ with mirroring functions / loops / variable types as far as possible etc that other than for minor tweaks the power of C++ would give me an immediate speed improvement as am running a lot of sims with lots of embedded sorting etc. Well, quite the opposite is occurring so, so there must be something seriously wrong with the C++ implementation, which is about half as fast at best depending on parameters. They both converge to the same answer so am happy mathematically that they work.
The problem:
Suppose you have N days in which to allocate k exams randomly with, eg 2 exam slots per day (AM/PM). What is the probability that say 2 days are full exam days? I think I have a closed form for this which I believe for now, so anyway wanted to test with MC.
Algorithm Heuristic:
Quite simply, say we have 18 days, 6 exams, 2 slots a day, and we want to know the probability we'll have 2 full days.
(i) simulate 6 uniforms U_i
(ii) allocate slots to the exams by randomly allocating them amongst remaining slots using the uniforms adjusting for slots already allotted. As as example if Exam 4 got allocated slot 4 in 34-slot space but 3 and 5 were already taken, then in 36-slot space Exam_4 would be allotted slot 6 (that would be the first free slot after rebasing). Have implemented this with some embedded sorting (in VB Bubblesort/quicksort has negligible diff, so far in C++ just using bubblesort).
(iii) just convert the slots into days, then count the sims that hit the target.
Phew - that's just there for background. The spirit of this is not really to optimise the algorithm, just to help me understand what I've done wrong to make it so much slower when 'mirrored' in C++!!
The Code!
// monte carlo
#include "stdafx.h"
#include"AllocateSlots.h"
#include<vector>
#include<string>
#include<iostream>
#include<cmath>
#include<ctime>
using namespace std;
int main()
{
int i, j, k, m;
int days, exams, slotsperday, filledslotsperday, targetfulldays, filleddays;
long sims, count, simctr;
cout << "Days?: ";cin >> days;
cout << "Exams?: ";cin >> exams;
cout << "Slots Per Day?: ";cin >> slotsperday;
cout << "Filled Slots?: ";cin >> filledslotsperday;
cout << "Target Full Days?: ";cin >> targetfulldays;
cout << "No. of sims?: ";cin >> sims;
system("PAUSE");
//timer
clock_t start;
start = clock();
double randomvariate;
//define intervals for remaining slots
vector <double> interval(exams);
int totalslots = (days * slotsperday);
for (k = 1; k <= exams; k++)
{
interval[k-1] = 1 / (static_cast <double> (totalslots - k + 1));
}
vector <int> slots(exams); //allocated slots
vector <int> previousslots(exams); //previously allocated slots
vector <int> slotdays(exams); //days on which slots fall
srand((int) time(0)); //generates seed on current system time
count = 0;
for (simctr = 1; simctr <= sims; simctr++)
{
vector<int> daycounts(days); //initialised at 0
for (i = 1; i <= exams;i++)
{
//rand() generates integers in [0.0,32767]
randomvariate = (static_cast <double> (rand()+1))/ (static_cast <double> (RAND_MAX+1));
j = 1;
while (j <= totalslots - i + 1)
{
if (randomvariate < j*interval[i - 1]) break;
j++;
}
slots[i - 1] = j;
}
for (i = 2; i <= exams;i++)
{
previousslots.resize(i - 1);
for (m = 1; m <= i - 1; m++)
{
previousslots[m - 1] = slots[m - 1];
}
BubbleSort(previousslots);
for (k = 1; k <= i - 1;k++)
{
if (slots[i - 1] >= previousslots[k - 1])
{
slots[i - 1]++ ;
}
}
}
//convert slots into days
for (i = 1; i <= exams;i++)
{
slotdays[i - 1] = SlottoDays(slots[i - 1], slotsperday);
}
//calculate the filled days
filleddays = 0;
for (j = 1; j <= days; j++)
{
for (k = 1; k <= exams; k++)
{
if (slotdays[k - 1] == j)
{
daycounts[j - 1]++;
}
}
if (daycounts[j - 1] == filledslotsperday)
{
filleddays++;
}
}
//check if target is hit
if (filleddays == targetfulldays)
{
count++;
}
}
cout << count << endl;
cout << "Time: " << (clock() - start) / (double)(CLOCKS_PER_SEC) << " s" << endl;
//cout << (static_cast<double>(count)) / (static_cast<double>(sims));
system("PAUSE");
return 0;
}
And the 2 ancillary functions:
#include "stdafx.h"
#include"AllocateSlots.h"
#include<iostream>
#include<cmath>
#include<vector>
using namespace std;
//returns day for a given slot
int SlottoDays(int &examslot, int &slotsperday)
{
return((examslot % slotsperday == 0) ? examslot/ slotsperday: examslot/ slotsperday + 1);
}
//BubbleSort Algorithm
vector <int> BubbleSort(vector <int> &values)
{
int i;
int j;
int tmpSort;
int N = values.size();
for (i = 0; i < N;i++)
{
for (j = i + 1; j < N; j++)
{
if (values[i] > values[j])
{
tmpSort = values[j];
values[j] = values[i];
values[i] = tmpSort;
}
}
}
return values;
}
So there it is - like I say the algorithm is common to C++ and VBA, happy to post the VBA but in the first instance just wondered if there was glaringly thing glaringly obvious in the above. Pretty much first time have done this, used vectors etc etc, unaided, self 'taught' so have definitely screwed something up even though have managed to make it run by some miracle! Be very grateful for some words of wisdom - trying to teach myself C++ with exercises like this but what I really want to get to is speed (and mathematical accuracy of course!) for much larger projects.
Fyi in my example of 18 days, 6 exams, 2 slots per day, 2 days to get filled it should converge to about 3.77%, which it does with 1mm sims in 38s in VBA and 145s in the implementation above on my duocore 2.7G i7 4GB RAM laptop on x64 windows7.

From the discussion in comments it sounds like you may be running your program in Debug mode. This turns off a number of optimisations and even generates some extra code.
To run in Release mode look for the Solution Configurations drop down in the Standard Tool Bar and use the drop down to change from Debug to Release.
Then rebuild your Solution and rerun your tests.
To explore program performance further in Visual Studio you'll want to use the Performance Profiler tool. There's a tutorial on using the Performance Profiler (including a video) on the Microsoft documentation site: Profile application performance in Visual Studio. There's also Quickstart: First look at profiling tools and a whole bunch more: all under the Profiling in Visual Studio section.

Related

iterator dereferencing cost a huge time

I solved a problem with Set operations like upperbound, iterator dereferencing etc. It solves in around 20 seconds. The general problem is I am iterating over group of numbers (i*(i-1)/2) until it is less than 2 * 10^5, and then complete a DP vector. So in my algorithm for each number "x" I get the upper_bound,"up", then starting from the beginning iterate over the numbers until reach to "up". The solution does the same but it does not run upper_bound and dereferencing, but instead it directly calculate the i*(i-1)/2, which i previously calculated and stored in vset. the number of operations for both algorithm is almost same, around 80*10^6, which is not super big number. But my code takes 20 seconds, solution needs 2 seconds.
Please look at my code and let me know if you need more information about this:
1- vset has 600 numbers, which is all numbers in the form of i*(i-1)/2; less than 2*10^5
2- vset is already sorted as it is increasing
3- the final vector "v" in both algorithm is exactly same
4- cnt, number of operation for both is almost same. 80,000,000
5- you can test the codes with n = 199977.
6- On my computer, corei7 32G RAM, it takes 20 seconds, on server accepted around 200 Mili seconds, this is very strange to me.
typedef long long int llint;
int n; cin >> n;
vector<llint> v(n+1, INT_MAX);
llint p = 1;
llint node = 2;
llint cnt = 0;
for (int i = 1; i <= n; i++)
{
if (v[i] == INT_MAX)
{
for (int s = 1; (s * (s - 1)) / 2 <= i; ++s)
v[i] = min(v[i], v[i - (s * (s - 1)) / 2] + s) , cnt++;
}
else cnt ++ ;
}
cout << cnt << endl; // works in less than 2 seconds
The Second solution takes 20 seconds.
typedef long long int llint;
int n; cin >> n;
vector<llint> v(n+1, INT_MAX);
llint p = 1;
llint node = 2;
vector<int> vset;
while (p <= n) // only 600 numbers
{
v[p] = node;
vset.push_back(p);
node++;
p = node * (node - 1) / 2;
}
llint cnt = 0;
for (int i = 1; i <= n; i++)
{
if (v[i] == INT_MAX)
{
auto up = upper_bound(vset.begin(), vset.end(), i);
for (auto it = vset.begin(); it != up; it++) // at most 600 iteration
{
cnt++;
int j = *it;
v[i] = min(v[j] + v[i - j], v[i]);
}
}
else cnt ++ ;
}
cout << cnt << endl; // cnt for both is around 84,000,000
So the question is about something I cannot figure out: which operation(s) here is expensive?
going through the iterator? dereferencing the iterator? there is no more difference here but the time is TEN TIMES MORE. thanks
Thanks to all guys that commented and helped me to figure out the issue. I realized that the reason that I have slow performance was Debug Mode. So I changed it to Release Mode and it works in less than 2 seconds. There is a similar question, may help you more. I used Visual Studio C++ on Windows 10

Why is 1 for-loop slower than 2 for-loops in problem related to prefix sum matrix?

I'm recently doing this problem, taken directly and translated from day 1 task 3 of IOI 2010, "Quality of life", and I encountered a weird phenomenon.
I was setting up a 0-1 matrix and using that to calculate a prefix sum matrix in 1 loop:
for (int i = 1; i <= m; i++)
{
for (int j = 1; j <= n; j++)
{
if (a[i][j] < x) {lower[i][j] = 0;} else {lower[i][j] = 1;}
b[i][j] = b[i-1][j] + b[i][j-1] - b[i-1][j-1] + lower[i][j];
}
}
and I got TLE (time limit exceeded) on 4 tests (the time limit is 2.0s). While using 2 for loop seperately:
for (int i = 1; i <= m; i++)
{
for (int j = 1; j <= n; j++)
{
if (a[i][j] < x) {lower[i][j] = 0;} else {lower[i][j] = 1;}
}
}
for (int i = 1; i <= m; i++)
{
for (int j = 1; j <= n; j++)
{
b[i][j] = b[i-1][j] + b[i][j-1] - b[i-1][j-1] + lower[i][j];
}
}
got me full AC (accepted).
As we can see from the 4 pictures here:
TLE result, picture 1 : https://i.stack.imgur.com/9o5C2.png
TLE result, picture 2 : https://i.stack.imgur.com/TJwX5.png
AC result, picture 1 : https://i.stack.imgur.com/1fo2H.png
AC result, picture 2 : https://i.stack.imgur.com/CSsZ2.png
the 2 for-loops code generally ran a bit faster (even in accepted test cases), contrasting my logic that the single for-loop should be quicker. Why does this happened?
Full code (AC) : https://pastebin.com/c7at11Ha (Please ignore all the nonsense bit and stuff like using namespace std;, as this is a competitive programming contest).
Note : The judge server, lqdoj.edu.vn is built on dmoj.ca, a global competitive programming contest platform.
If you look at assembly you'll see the source of the difference:
Single loop:
{
if (a[i][j] < x)
{
lower[i][j] = 0;
}
else
{
lower[i][j] = 1;
}
b[i][j] = b[i-1][j]
+ b[i][j-1]
- b[i-1][j-1]
+ lower[i][j];
}
In this case, there's a data dependency. The assignment to b depends on the value from the assignment to lower. So the operations go sequentially in the loop - first assignment to lower, then to b. The compiler can't optimize this code significantly because of the dependency.
Separation of assignments into 2 loops:
The assignment to lower is now independent and the compiler can use SIMD instructions that leads to a performance boost in the first loop. The second loop stays more or less similar to the original assembly.

Why is this C++ program slower than Node.js equivalent?

I'm learning C++ and decided to remake an old Node.js program to see how much faster it would be, as C++ should to my knowledge be a lot faster due to being compiled.
This program is very simple, it is just to find prime numbers. It uses the exact same logic as my Node.js program, but it takes 8 to 9 seconds whereas the Node.js took only 4 to 5 seconds.
#include <iostream>
#include <string>
#include <ctime>
using namespace std;
// Declare functions
int main();
bool isPrime(int num);
bool is6n_1(int num);
// Define variables
int currentNum = 5; // Start at 5 so we iterate over odd numbers, we add 2 and 3 manually
int primesToFind = 1000000;
int primesFound = 2;
int* primes = NULL;
// Main
int main() {
// Create dynamic memory primes array
primes = new int[1000000];
primes[0] = 2;
primes[1] = 3;
cout << "Finding primes..." << endl;
time_t start_time = time(NULL);
// Main execution loop
for (; primesFound < primesToFind; currentNum += 2) {
if (isPrime(currentNum)) {
primes[primesFound] = currentNum;
primesFound++;
}
}
time_t end_time = time(NULL);
cout << "Finished" << endl;
cout << end_time - start_time << endl;
return 0;
}
// Check whether a number is prime
// Dependant on primes[]
bool isPrime(int num) {
// We divide it by every previous prime number smaller than the sqrt
// and check the remainder
for (int i = 1; i <= sqrt(num) && i < primesFound; i++) { // Start i at 1 to skip the first unnecessary modulo with 2
if (num % primes[i] == 0) { // because we increment by 2
return false;
}
}
return true;
}
Because I'm so new at C++ I don't know if this is due to inefficient code (probably) or because of some settings in the compiler or in Visual Studio IDE.
I'm using Visual Studio 2019 community, release version and x64 architecture with O2 optimization.
How can I make this program faster?
Regarding compiler settings, I can only say:
Use x64, not x86 (32-bit) as a target
Use Release as configuration, not Debug
Enable aggressive optimizations
(Edit) It seems you already have these settings in the compiler, so there should be nothing obvious left to do regarding the compiler.
Also, there is probably a lot of optimization possible, because it does not seem that you are using the Sieve of Eratosthenes. You can then further skip all multiples of two, and increase the step size to tree.
And you would of course have to provide the node.js code. I am almost certain that it does not use the exact same logic.

optimizing code: fibonacci algorithm

I'm working on a fibonacci algorithm for really big numbers (100k th number). I need to make this run faster though, but just a couple of seconds and I ran out of ideas. Is there any way to make it faster? Thanks for help.
#include <iostream>
using namespace std;
int main() {
string elem_major = "1";
string elem_minor = "0";
short elem_maj_int;
short elem_min_int;
short sum;
int length = 1;
int ten = 0;
int n;
cin >> n;
for (int i = 1; i < n; i++)
{
for (int j = 0; j < length; j++)
{
elem_maj_int = short(elem_major[j] - 48);
elem_min_int = short(elem_minor[j] - 48);
sum = elem_maj_int + elem_min_int + ten;
ten = 0;
if (sum > 9)
{
sum -= 10;
ten = 1;
if (elem_major[j + 1] == NULL)
{
elem_major += "0";
elem_minor += "0";
length++;
}
}
elem_major[j] = char(sum + 48);
elem_minor[j] = char(elem_maj_int + 48);
}
}
for (int i = length-1; i >= 0; i--)
{
cout << elem_major[i];
}
return 0;
}
No matter how good optimizations you perform on a given code, without changing the underlying algorithm you can only optimize it marginally. Your approach is with linear complexity and for big values it will quickly become slow. A faster implementation of Fibonacci numbers is by doing matrix exponentiation by squaring on the matrix:
0 1
1 1
This approach will be with logarithmic complexity which is asymptotically better. Perform a few exponentiations of this matrix and you'll notice that the n + 1st Fibonacci number is at its lower right corner.
I suggest you use something like cpp-bigint (http://sourceforge.net/projects/cpp-bigint/) for your big numbers.
The code would look like this then
#include <iostream>
#include "bigint.h"
using namespace std;
int main() {
BigInt::Rossi num1(0);
BigInt::Rossi num2(1);
BigInt::Rossi num_next(1);
int n = 100000;
for (int i = 0; i < n - 1; ++i)
{
num_next = num1 + num2;
num1 = std::move(num2);
num2 = std::move(num_next);
}
cout << num_next.toStrDec() << endl;
return 0;
}
Quick benchmark on my machine:
time ./yourFib
real 0m8.310s
user 0m8.301s
sys 0m0.005s
time ./cppBigIntFib
real 0m2.004s
user 0m1.993s
sys 0m0.006s
I would save some precomputed points (especially since you are looking for really big numbers)
ie say I saved 500th and 501st fib number. Then if some one asks me what is 600th fib? I would start computing from 502 rather than from 1. This would really save time.
Now the question how many points you would save and how would select the points to save?
The answer to this question totally depends on the application and probable distribution.

What operations are time consuming and how to avoid them?

I am fairly new to c++ and I want to learn how to optimize the speed of my programs. I am currently working on a program that computes the perfect numbers from 1 to 1.000.000. A perfect number is where the sum of its proper divisors is equal to the number itself. Eg. 28 is a perfect number because 1+2+4+7+14=28. Below is my code
#include <iostream>
using namespace std;
int main() {
int a = 1000000;
for(int i = 1; i <= a; ++i)
{
int sum = 0;
int q;
// The biggest proper divisor is half the number itself
if(i % 2 == 0) q = i/2;
else q = (i+1)/2;
for(int j = 1; j <= q; ++j)
{
if(i % j == 0) sum += j;
}
//Condition for perfect number
if(sum == i) cout << i << " is a perfect number!" << endl;
}
system("pause");
return 0;
}
What operations in this code are time consuming? How can I improve the speed of the program? In general, how do I learn about what operations that are time consuming and how to avoid them?
The only way to really know what operations are time consuming and are limiting the execution speed of your program is to run the program through a profiler. This tool will tell you where each second of the execution time was spent (on a function call basis, usually).
To answer your question specifically: the most time in this program will be spent at this line:
system("pause");
because, aside from the fact that this is a horrible line of code you should get rid of, is actually user input, and as we all know, the thing between the chair and the screen is what slows things down.
You may trade of computation by memory consumption with the following:
const int max = 1000000;
std::vector<std::size_t> a(max);
for(std::size_t i = 1; i != a.size(); ++i) {
for (std::size_t j = 2 * i; j < a.size(); j += i) {
a[j] += i;
}
}
for (std::size_t i = 1; i != a.size(); ++i) {
if(a[i] == i) {
std::cout << i << " is a perfect number!" << std::endl;
}
}
Live example
Branches: ifs, loops, function calls and goto are costly. They tend to distract the processor from perform data transfers and math operations.
For example, you can eliminate the if statement:
q = (i + (i % 2)) / 2; // This expression not simplified.
Research loop unrolling, although the compiler may perform this on higher optimization settings.
I/O operations are costly, especially using formatted I/O.
Try this:
if(sum == i)
{
static const char text[] = " is a perfect number!\n";
cout << i;
cout.write(text, sizeof(text) - 1); // Output as raw data.
}
Division and modulo operations are costly.
If you can divide by a power of 2, you can convert the division into a shift right.
You may be able to avoid modulo operations by using binary AND.
My rules of thumb:
conditional branches (i.e. comparisons) are costly
divisions are costly (as well as modulo %)
prefer integer operations over floating-point
How to avoid them ?
Well, there is no simple answer. In many cases you just can't.
You avoid conditional branches by using branchless expressions, or improving the program logics.
You avoid divisions by using shifts or lookup-tables or rewriting expressions (when possible).
You avoid floating-point by simulating fixed-point.
In the given example, you have to focus on the body of the innermost loop. That's the line that is the most frequently executed (about 125000000000 times vs 1000000 for others). Unfortunately, there is a comparison and a division, which are hard to remove.
Optimizations of other parts of the code will have no measurable effect. In particular, don't worry about the cout statement: it will be called 4 times in total.