iterator dereferencing cost a huge time

iterator dereferencing cost a huge time - c++

I solved a problem with Set operations like upperbound, iterator dereferencing etc. It solves in around 20 seconds. The general problem is I am iterating over group of numbers (i*(i-1)/2) until it is less than 2 * 10^5, and then complete a DP vector. So in my algorithm for each number "x" I get the upper_bound,"up", then starting from the beginning iterate over the numbers until reach to "up". The solution does the same but it does not run upper_bound and dereferencing, but instead it directly calculate the i*(i-1)/2, which i previously calculated and stored in vset. the number of operations for both algorithm is almost same, around 80*10^6, which is not super big number. But my code takes 20 seconds, solution needs 2 seconds.
Please look at my code and let me know if you need more information about this:
1- vset has 600 numbers, which is all numbers in the form of i*(i-1)/2; less than 2*10^5
2- vset is already sorted as it is increasing
3- the final vector "v" in both algorithm is exactly same
4- cnt, number of operation for both is almost same. 80,000,000
5- you can test the codes with n = 199977.
6- On my computer, corei7 32G RAM, it takes 20 seconds, on server accepted around 200 Mili seconds, this is very strange to me.
typedef long long int llint;
int n; cin >> n;
vector<llint> v(n+1, INT_MAX);
llint p = 1;
llint node = 2;
llint cnt = 0;
for (int i = 1; i <= n; i++)
{
if (v[i] == INT_MAX)
{
for (int s = 1; (s * (s - 1)) / 2 <= i; ++s)
v[i] = min(v[i], v[i - (s * (s - 1)) / 2] + s) , cnt++;
}
else cnt ++ ;
}
cout << cnt << endl; // works in less than 2 seconds
The Second solution takes 20 seconds.
typedef long long int llint;
int n; cin >> n;
vector<llint> v(n+1, INT_MAX);
llint p = 1;
llint node = 2;
vector<int> vset;
while (p <= n) // only 600 numbers
{
v[p] = node;
vset.push_back(p);
node++;
p = node * (node - 1) / 2;
}
llint cnt = 0;
for (int i = 1; i <= n; i++)
{
if (v[i] == INT_MAX)
{
auto up = upper_bound(vset.begin(), vset.end(), i);
for (auto it = vset.begin(); it != up; it++) // at most 600 iteration
{
cnt++;
int j = *it;
v[i] = min(v[j] + v[i - j], v[i]);
}
}
else cnt ++ ;
}
cout << cnt << endl; // cnt for both is around 84,000,000
So the question is about something I cannot figure out: which operation(s) here is expensive?
going through the iterator? dereferencing the iterator? there is no more difference here but the time is TEN TIMES MORE. thanks

Thanks to all guys that commented and helped me to figure out the issue. I realized that the reason that I have slow performance was Debug Mode. So I changed it to Release Mode and it works in less than 2 seconds. There is a similar question, may help you more. I used Visual Studio C++ on Windows 10

Related

Minimum Cost to reduce the size of array to 1

Given an array of N numbers (not necessarily sorted). We can merge any two numbers into one and the cost of merging the two numbers is equal to the sum of the two values. The task is to find the total minimum cost of merging all the numbers.
Example:
Let the array A = [1,2,3,4]
Then, we can remove 1 and 2, add both of them and keep the sum back in array. Cost of this step would be (1+2) = 3.
Now, A = [3,3,4], Cost = 3
In second step, we can 3 and 3, add both of them and keep the sum back in array. Cost of this step would be (3+3) = 6.
Now, A = [4,6], Cost = 6
In third step, we can remove both elements from the array and keep the sum back in array again. Cost of this step would be (4+6) = 6.
Now, A = [10], Cost = 10
So, total cost turns out to be 19 (10+6+3).
We will have to pick the 2 smallest elements to minimize our total cost. A simple way to do this is using a min heap structure. We will be able to get the minimum element in O(1) and insertion will be O(log n).
The time complexity of this approach is O(n log n).
But I tried another approach, and wasn't able to find the cases where it fails. The basic idea was that the sum of two smallest elements that we will choose at any time will always be greater than the sum of the pair of elements chosen before. So the "temp" array will always be sorted, and we will be able to access the minimum elements in O(1).
As I am sorting the input array and then simply traversing the array, the complexity of my approach is O(n log n).
int minCost(vector<int>& arr) {
sort(arr.begin(), arr.end());
// temp array will contain the sum of all the pairs of minimum elements
vector<int> temp;
// index for arr
int i = 0;
// index for temp
int j = 0;
int cost = 0;
// while we have more than 1 element combined in both the input and temp array
while(arr.size() - i + temp.size() - j > 1) {
int num1, num2;
// selecting num1 (minimum element)
if(i < arr.size() && j < temp.size()) {
if(arr[i] <= temp[j])
num1 = arr[i++];
else
num1 = temp[j++];
}
else if(i < arr.size())
num1 = arr[i++];
else if(j < temp.size())
num1 = temp[j++];
// selecting num2 (second minimum element)
if(i < arr.size() && j < temp.size()) {
if(arr[i] <= temp[j])
num2 = arr[i++];
else
num2 = temp[j++];
}
else if(i < arr.size())
num2 = arr[i++];
else if(j < temp.size())
num2 = temp[j++];
// appending the sum of the minimum elements in the temp array
int sum = num1 + num2;
temp.push_back(sum);
cost += sum;
}
return cost;
}
Is this approach correct? If not, please let me know what I am missing, and the test cases in which this algorithm fails.
SPOJ Link for the same problem

The logic seems very solid to me... all the computed sums will never be decreasing and therefore you only need to add up either oldest two computed sums, next two elements or oldest sum and next element.
I would just simplify the code:
#include <vector>
#include <algorithm>
#include <stdio.h>
int hsum(std::vector<int> arr) {
int ni = arr.size(), nj = 0, i = 0, j = 0, res = 0;
std::sort(arr.begin(), arr.end());
std::vector<int> temp;
auto get = [&]()->int {
if (j == nj || (i < ni && arr[i] < temp[j])) return arr[i++];
return temp[j++];
};
while ((ni-i)+(nj-j)>1) {
int a = get(), b = get();
res += a+b;
temp.push_back(a + b); nj++;
}
return res;
}
int main() {
fprintf(stderr, "%i\n", hsum(std::vector<int>{1,4,2,3}));
return 0;
}
Very nice idea!
Another improvement is noting that the cumulative length of the two arrays being processed (the original one and the temporary one holding the sums) will decrease at every step.
Since the first step will use two input elements, the fact that the temporary array grows one element at each step will still not be enough for a "walking queue" allocated in the array itself to reach the reading pointer.
This means that there is no need of a temporary array and the space for the sums can be found in the array itself...
int hsum(std::vector<int> arr) {
int ni = arr.size(), nj = 0, i = 0, j = 0, res = 0;
std::sort(arr.begin(), arr.end());
auto get = [&]()->int {
if (j == nj || (i < ni && arr[i] < arr[j])) return arr[i++];
return arr[j++];
};
while ((ni-i)+(nj-j)>1) {
int a = get(), b = get();
res += a+b;
arr[nj++] = a + b;
}
return res;
}
About the error on SPOJ... I tried briefly to search for the problem but I didn't succeed. I tried however generating random arrays of random lengths and checking this solution with what finds a "brute-force" one implemented directly from the specs and I'm reasonably confident that the algorithm is correct.
I know at least one programming arena (Topcoder) where sometimes the problems are carefully crafted so that the computation gives correct results if using unsigned but not if using int (or if using unsigned long long but not if using long long) because of integer overflow.
I don't know if SPOJ also does this kind of nonsense(1)... may be that is the reason some hidden test case fails...
EDIT
Checking with SPOJ the algorithm passes if using long long values... this is the entry I used:
#include <stdio.h>
#include <algorithm>
#include <vector>
int main(int argc, const char *argv[]) {
int n;
scanf("%i", &n);
for (int testcase=0; testcase<n; testcase++) {
int sz; scanf("%i", &sz);
std::vector<long long> arr(sz);
for (int i=0; i<sz; i++) scanf("%lli", &arr[i]);
int ni = arr.size(), nj = 0, i = 0, j = 0;
long long res = 0;
std::sort(arr.begin(), arr.end());
auto get = [&]() -> long long {
if (j == nj || (i < ni && arr[i] < arr[j])) return arr[i++];
return arr[j++];
};
while ((ni-i)+(nj-j)>1) {
long long a = get(), b = get();
res += a+b;
arr[nj++] = a + b;
}
printf("%lli\n", res);
}
return 0;
}
PS: This very kind of computation is also what is needed to build an Huffman tree for entropy coding given the symbols frequency table and thus it's not a mere random exercise but it has practical applications.
(1) I'm saying "nonsense" because in Topcoder they never give problems that require 65 bits; thus it's not a genuine care about overflows, but just setting traps for novices.
Another that I think is a bad practice I saw on TC is that some problems are carefully designed so that the correct algorithm if using C++ will barely fit in the timeout limit: just use another language (and get e.g. a 2× slowdown) and you cannot solve the problem.

First of all, think simple!
When using a priority queue, the problem is easy!
In the first test case :
1 6 3 20
// after pushing to Q
1 3 6 20
// and sum two top items and pop and push!
(1 + 3) 6 20 cost = 4
(4 + 6) 20 cost = 10 + 4
(10 + 20) cost = 30 + 14
30 cost = 44
#include<iostream>
#include<queue>
using namespace std;
int main()
{
int t;
cin >> t;
while (t--) {
int n;
cin >> n;
priority_queue<long long int, vector<long long int>, greater<long long int>> q;
for (int i = 0; i < n; ++i) {
int k;
cin >> k;
q.push(k);
}
long long int sum = 0;
while (q.size() > 1) {
long long int a = q.top();
q.pop();
long long int b = q.top();
q.pop();
q.push(a + b);
sum += a + b;
}
cout << sum << "\n";
}
}

Basically we need to sort the list in desc order and then find its cost like this.
A.sort(reverse=True)
cost = 0
for i in range(len(A)):
cost += A[i] * (i+1)
return cost

Compiler optimization on the traveling salesman problem

I am playing with the travelling salesman problem and am looking at the version where:
the towns are points in 2d space and there are paths from every town to all others and the lengths are the distances between the points. So it's very easy to implement the naive solution where you check all permutations of n points and calculate the length of the path.
I've found however that for n >= 10 the compiler does some magic and prints a value that is certainly not the actual shortest path. I compile with the Microsoft visual studio compiler in release mode with the default settings. For values (10,30) it thinks for 30 seconds and then returns some number that seems like it could be correct but it is not (I check in different ways). And for n > 40 it calculates a result immediately and is always 2.14748e+09.
I am looking for an explanation to what does the compiler do in the different situations (the (10,30) case is really interesting). And an example where these optimizations are more useful than the program just spinning to the end of the world.
vector<pair<int,int>> points;
void min_len()
{
// n is a global variable with the number of points(towns)
double min = INT_MAX;
// there are n! permutations of n elements
for (auto j = 0; j < factorial(n); ++j)
{
double sum = 0;
for (auto i = 0; i < n - 1; ++i)
{
sum += distance_points(points[i], points[i + 1]);
}
if (sum < min)
{
min = sum;
s_path = points;
}
next_permutation(points.begin(), points.end());
}
for (auto i = 0; i < n; ++i)
{
cout << s_path[i].first << " " << s_path[i].second << endl;
}
cout << min << endl;
}
unsigned int factorial(unsigned int n)
{
int res = 1, i;
for (i = 2; i <= n; i++)
res *= i;
return res;
}

Your factorial function is overflowing. Try replacing it with one returning int64_t and see your code taking 3 years to terminate for n > 20.
constexpr uint64_t factorial(unsigned int n) {
return n ? n * factorial(n-1) : 1;
}
Also, you don't need to calculate this at all. The std::next_permutation function returns 0 when all permutations have occured (starting from sorted position).

Find the number of pairs of positive integers satisfying the inequality

I'm trying to solve a programming problem where I have to display the number of positive integer solutions of the inequality x² + y² < n, where n is given by the user. I've already written a code that seems to work but not as fast as I'd like it to. Is there any way to speed it up?
My current code:
#include <iostream>
#include <cmath>
using namespace std;
int main()
{
long long n, i, r, k, p, a;
cin >> k;
while (k--)
{
r = 0;
cin >> n;
p = sqrt(n);
for (i = 1; i <= p; i++)
{
a = sqrt(n - (i * i));
r += a;
if ((((i * i) + (a * a)) == n) && (a > 0))
{
r--;
}
}
cout << r << "\n";
}
return 0;
}
Edit:
This is a solution for this task.
The task in English:
Find the number of natural solutions (x≥1, y≥1) of the inequality x²+y² < n, where 0 < n < 2147483647. For example, for n=10 there are 4 solutions: (1,1), (1,2), (2,1), (2,2).
Input
In the first line of input the number of test cases k is given. In the next k lines, there are the n values given.
Output
In the output, you have to display in separate lines the number of natural solutions of the inequality.
Example
Input:
2
10
11
Output:
4
6

Your solution seems fast already. The main possibility to reduce the time spent is to suppress the call to sqrtin the loop. This is obtained by considering that the value a = sqrt(n - (i * i)) does not vary very much from one iteration to the next one.
Here is the code:
r = 0;
p = sqrt(n);
if ((p*p) == n) p--;
a = p;
for (long long i = 1; i <= p; i++)
{
while ((n-i*i) <= a*a) {
--a;
}
r += a;
}

Print first 1 million primes in 1 sec with constraints program size 50000 bytes and limited Memory

I tried sieve of Eratosthenes: Following is my code:
void prime_eratos(int N) {
int root = (int)sqrt((double)N);
bool *A = new bool[N + 1];
memset(A, 0, sizeof(bool) * (N + 1));
for (int m = 2; m <= root; m++) {
if (!A[m]) {
printf("%d ",m);
for (int k = m * m; k <= N; k += m)
A[k] = true;
}
}
for (int m = root; m <= N; m++)
if (!A[m])
printf("%d ",m);
delete [] A;
}
int main(){
prime_eratos(179426549);
return 0;
}
It took time : real 7.340s in my system.
I also tried Sieve of Atkins(studied somewhere faster than
sieve of Eratosthenes).
But in my case,it took time : real 10.433s .
Here is the code:
int main(){
int limit=179426549;
int x,y,i,n,k,m;
bool *is_prime = new bool[179426550];
memset(is_prime, 0, sizeof(bool) * 179426550);
/*for(i=5;i<=limit;i++){
is_prime[i]=false;
}*/
int N=sqrt(limit);
for(x=1;x<=N;x++){
for(y=1;y<=N;y++){
n=(4*x*x) + (y*y);
if((n<=limit) &&(n%12 == 1 || n%12==5))
is_prime[n]^=true;
n=(3*x*x) + (y*y);
if((n<=limit) && (n%12 == 7))
is_prime[n]^=true;
n=(3*x*x) - (y*y);
if((x>y) && (n<=limit) && (n%12 == 11))
is_prime[n]^=true;
}
}
for(n=5;n<=N;n++){
if(is_prime[n]){
m=n*n;
for(k=m;k<=limit;k+=m)
is_prime[k]=false;
}
}
printf("2 3 ");
for(n=5;n<=limit;n++){
if(is_prime[n])
printf("%d ",n);
}
delete []is_prime;
return 0;
}
Now,I wonder,none is able to output 1 million primes in 1 sec.
One approach could be:
I store the values in Array but the program size is limited.
Could someone suggest me some way to get first 1 million primes in less
than a sec satisfying the constraints(discussed above) ?
Thanx !!

Try
int main()
{
std::ifstream primes("Filecontaining1MillionPrimes.txt");
std::cout << primes.rdbuf();
}

You've counted the primes incorrectly. The millionth prime is 15485863, which is a lot smaller than you suggest.
You can speed your program and save space by eliminating even numbers from your sieve.

The fastest way I know to check if a number is prime is to check for compositeness, I've implemented the http://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test with great sucess for RSA, it is probabilistic, with high degree of success depending on how many times you run it.

Step 1. don't do a printf
Step 2. buy a faster computer.

weighted RNG speed problem in C++

Edit: to clarify, the problem is with the second algorithm.
I have a bit of C++ code that samples cards from a 52 card deck, which works just fine:
void sample_allcards(int table[5], int holes[], int players) {
int temp[5 + 2 * players];
bool try_again;
int c, n, i;
for (i = 0; i < 5 + 2 * players; i++) {
try_again = true;
while (try_again == true) {
try_again = false;
c = fast_rand52();
// reject collisions
for (n = 0; n < i + 1; n++) {
try_again = (temp[n] == c) || try_again;
}
temp[i] = c;
}
}
copy_cards(table, temp, 5);
copy_cards(holes, temp + 5, 2 * players);
}
I am implementing code to sample the hole cards according to a known distribution (stored as a 2d table). My code for this looks like:
void sample_allcards_weighted(double weights[][HOLE_CARDS], int table[5], int holes[], int players) {
// weights are distribution over hole cards
int temp[5 + 2 * players];
int n, i;
// table cards
for (i = 0; i < 5; i++) {
bool try_again = true;
while (try_again == true) {
try_again = false;
int c = fast_rand52();
// reject collisions
for (n = 0; n < i + 1; n++) {
try_again = (temp[n] == c) || try_again;
}
temp[i] = c;
}
}
for (int player = 0; player < players; player++) {
// hole cards according to distribution
i = 5 + 2 * player;
bool try_again = true;
while (try_again == true) {
try_again = false;
// weighted-sample c1 and c2 at once
// h is a number < 1325
int h = weighted_randi(&weights[player][0], HOLE_CARDS);
// i2h uses h and sets temp[i] to the 2 cards implied by h
i2h(&temp[i], h);
// reject collisions
for (n = 0; n < i; n++) {
try_again = (temp[n] == temp[i]) || (temp[n] == temp[i+1]) || try_again;
}
}
}
copy_cards(table, temp, 5);
copy_cards(holes, temp + 5, 2 * players);
}
My problem? The weighted sampling algorithm is a factor of 10 slower. Speed is very important for my application.
Is there a way to improve the speed of my algorithm to something more reasonable? Am I doing something wrong in my implementation?
Thanks.
edit: I was asked about this function, which I should have posted, since it is key
inline int weighted_randi(double *w, int num_choices) {
double r = fast_randd();
double threshold = 0;
int n;
for (n = 0; n < num_choices; n++) {
threshold += *w;
if (r <= threshold) return n;
w++;
}
// shouldn't get this far
cerr << n << "\t" << threshold << "\t" << r << endl;
assert(n < num_choices);
return -1;
}
...and i2h() is basically just an array lookup.

Your reject collisions are turning an O(n) algorithm into (I think) an O(n^2) operation.
There are two ways to select cards from a deck: shuffle and pop, or pick sets until the elements of the set are unique; you are doing the latter which requires a considerable amount of backtracking.
I didn't look at the details of the code, just a quick scan.

you could gain some speed by replacing the all the loops that check if a card is taken with a bit mask, eg for a pool of 52 cards, we prevent collisions like so:
DWORD dwMask[2] = {0}; //64 bits
//...
int nCard;
while(true)
{
nCard = rand_52();
if(!(dwMask[nCard >> 5] & 1 << (nCard & 31)))
{
dwMask[nCard >> 5] |= 1 << (nCard & 31);
break;
}
}
//...

My guess would be the memcpy(1326*sizeof(double)) within the retry-loop. It doesn't seem to change, so should it be copied each time?

Rather than tell you what the problem is, let me suggest how you can find it. Either 1) single-step it in the IDE, or 2) randomly halt it to see what it's doing.
That said, sampling by rejection, as you are doing, can take an unreasonably long time if you are rejecting most samples.

Your inner "try_again" for loop should stop as soon as it sets try_again to true - there's no point in doing more work after you know you need to try again.
for (n = 0; n < i && !try_again; n++) {
try_again = (temp[n] == temp[i]) || (temp[n] == temp[i+1]);
}

Answering the second question about picking from a weighted set also has an algorithmic replacement that should be less time complex. This is based on the principle of that which is pre-computed does not need to be re-computed.
In an ordinary selection, you have an integral number of bins which makes picking a bin an O(1) operation. Your weighted_randi function has bins of real length, thus selection in your current version operates in O(n) time. Since you don't say (but do imply) that the vector of weights w is constant, I'll assume that it is.
You aren't interested in the width of the bins, per se, you are interested in the locations of their edges that you re-compute on every call to weighted_randi using the variable threshold. If the constancy of w is true, pre-computing a list of edges (that is, the value of threshold for all *w) is your O(n) step which need only be done once. If you put the results in a (naturally) ordered list, a binary search on all future calls yields an O(log n) time complexity with an increase in space needed of only sizeof w / sizeof w[0].

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

iterator dereferencing cost a huge time - c++

Thanks to all guys that commented and helped me to figure out the issue. I realized that the reason that I have slow performance was Debug Mode. So I changed it to Release Mode and it works in less than 2 seconds. There is a similar question, may help you more. I used Visual Studio C++ on Windows 10

Related

Minimum Cost to reduce the size of array to 1

Compiler optimization on the traveling salesman problem

Find the number of pairs of positive integers satisfying the inequality

Print first 1 million primes in 1 sec with constraints program size 50000 bytes and limited Memory

weighted RNG speed problem in C++

Categories

Resources