Finding the minimum value - c++

I can't begin to understand how to approach this problem. Can someone help me to just point me in the direction as to how I can approach it?
N tasks are given and there are M workers that are available. Each worker can takes different times to complete each task. The time taken by each worker for every task is given. At any time only one task can be worked on by only one worker. But the condition is once a worker has stopped working, he can't work on any task again. I want to find out what is the minimum time required to finish all the tasks. Here's an example-
M = 3
N = 4 {T1, T2,T3,T4}
No of days required by each worker (Wi) for each task (Ti) -
There are many ways to finish the task, some of them are -
All the tasks are done by the W1 ===> total time taken = 1+2+2+3 = 8
All the tasks are done by the W2 ===> total time taken = 3+1+3+2 = 9
All the tasks are done by the W3 ===> total time taken = 1+1+6+6 = 14
T1,T2,T3 done by W1 and T4 done by W2 ===> total time taken = 1+2+2+2 = 7
T1,T2 done by W1 and T3,T4 done by W3 ===> total time taken = 1+2+6+6 = 15
T1,T2 done by W3, T3 done by W1 and T4 done by W2 ===> total time taken = 1+1+2+2 = 6
There are more possible ways but the one that gives the smallest time taken is the 6th one (also shown in the picture below).
I was just able to understand how to do it when the number of workers are only 2. I did it this way -
#include<iostream>
using namespace std;
int N=4,M=2;
int main()
{
int i,j,min=INT_MAX;
int sum,sum1;
int w0[N] = {1,2,2,3};
int w1[N] = {3,1,3,2};
for(i=0;i<N;i++)
{
sum=0;
sum1=0;
for(j=0;j<i;j++)
{
sum+=w0[j];
sum1+=w1[j];
}
for(j=N-1;j>=i;j--)
{
sum+=w1[j];
sum1+=w0[j];
}
if(sum<sum1)
{
if(min>sum)
min = sum;
}
else
{
if(min>sum1)
min = sum1;
}
}
cout<<min;
return 0;
}
I tried to explain it using another table below -
But this way I can only find min value for 2 workers. I need help to to understand the approach for more than 2 workers.
Can there also be a DP solution possible for this?

The best way I can think to solve this is using recursion. I would implement this by having a list of unusable workers and a running sum passed to each call, and a global variable of minimum value.
This would also work best if you had a matrix of values. So like matrix[0] = {1, 2, 3}; matrix[1] = {3, 4, 5}. I haven't hardcoded a matrix in a while so please forgive me if the syntax is a little off.
So, using global variables for the matrix this would look something like
int matrix[m][n];
int runningMinimum = INT_MAX; //set the runningMinimum to max so any value compared will be lower
void minimum(int i, vector<int> bannedWorkers, int currentWorker, int sum){
//test the end condition here
if (i == n-1){//last column
if (sum < runningMinimum){runningMinimum = sum;}
return; //we want to return at the end, whether it's found the new lowest value or not
//if we're not at the end, we need to move one step to the right for all possible workers
for (int j = 0; j < m; j++){//For each worker
//check to see if the worker is no longer allowed to work
bool isBanned = false
for (int k = 0; k < bannedWorkers.size(); k++){
if (bannedWorkers[k] == j) isBanned = true;
}
if(!isBanned){
if (j == currentWorker){
minimum(i+1, bannedWorkers, currentWorker, sum+matrix[j][i])
}else{
vector<int> newBannedWorkers = bannedWorkers; //Make sure to copy to a new vector
newBannedWorkers.push_back(currentWorker);
minimum(i+1, newBannedWorkers, j, sum + matrix[j][i])
}
}
}
return; //after we've checked every option we want to end that call
}
This is a rough, untested idea but it should give you a solid start. Hope it helps!

Probably not the best approach if the number of workers is large, but easy to understand and implement I think. I would:
get a list all the possible combination with repetition of W, for example using the algorithm in https://www.geeksforgeeks.org/combinations-with-repetitions/ . This would give you things like [[W1,W3,W2,W3,W1],[W3,W5,W5,W4,W5]...
Discard combinations where workers are not continuous (loop through the lists counting the times each worker appears in total and the times it appears continuously, if its different then discard the list)
Use the filtered list of lists to check times using the table and keep the minimum one
A possible way to discard lists could be
bool isValid=true;
for (int kk = 0; kk < workerOrder.Length; kk++)
{
int state=0;
for (int mm = 0; mm < workerOrder.Length; mm++)
{
if (workerOrder[mm] == kk && state == 0) { state = 1; } //it has appeard
if (workerOrder[mm] != kk && state == 1 ) { state = 2; } //it is not contious
if (workerOrder[mm] == kk && state == 2) { isValid = false; break; } //it appeard again
}
if (isValid==false){break;}
}

Related

Optimizing code with a lookup table

This portion of my code takes too long to run and I was looking for a way to optimize it. I think a lookup table would be the fastest way but I could be wrong. My program has a main for loop and for each iteration in the main for loop, a nested loop goes through 1,233,487 iterations and then goes through the if statements if the conditions are met. The main for loop goes through 898,281 iterations so it must go through 898,281 * 1,233,487 calculations. How would I go about creating a lookup table to optimize these calculations/is there a better way to optimize my code.
for (int i = 0; i < all_neutrinos.size(); i++)
{ //all_neutrinos.size() = 898281
int MC_count = 0; //counts per window in the Monte Carlo simulation
int count = 0; //count per window for real data
if (cosmic_ray_events.size() == MC_cosmic_ray_events.size())
{
for (int j = 0; j < cosmic_ray_events.size(); j++)
{ //cosmic_ray_events.size() = 1233487
if ((MC_cosmic_ray_events[j][1] >= (all_neutrinos[i][3] - band_width))
&& (MC_cosmic_ray_events[j][1] <= (all_neutrinos[i][3] + band_width)))
{
if ((earth_radius * fabs(all_neutrinos[i][2] - MC_cosmic_ray_events[j][0]))
<= test_arc_length)
{
MC_count++;
}
}
if ((cosmic_ray_events[j][7] >= (all_neutrinos[i][3] - band_width))
&& (cosmic_ray_events[j][7] <= (all_neutrinos[i][3] + band_width)))
{
if(earth_radius * fabs(all_neutrinos[i][2] - cosmic_ray_events[j][6])
<= test_arc_length)
{
count++;
}
}
}
MCcount_out << i << " " << MC_count << endl;
count_out << i << " " << count << endl;
}
}
First cosmic_raw_events and MC_cosmic_ray_events are utterly unrelated. Make it two loops.
Sort MC_cosmic_ray_events by [1]. Sort cosmic_ray_events by [7]. Sort all_neutrinos by [3].
This doesn't have to be in-place sorting -- you can sort an array of pointers or indexes into them if you want.
Start with a highwater and lowwater index into your cosmic ray events set to 0.
Now, walk over all_neutrinos. For each one, advance highwater until
MC_cosmic_ray_events[highwater][1] > all_neutrinos[i][3] + band_width). Then advance lowwater until MC_cosmic_ray_events[lowwater][1] >= all_neutrinos[i][3] - band_width).
On the half-open range j = lowwater upto but not including highwater, run:
if (
(earth_radius * fabs(all_neutrinos[i][2] - MC_cosmic_ray_events[j][0]))
<= test_arc_length
) {
MC_count++;
}
Now repeat until i reaches the end of all_neutrinos.
Then repeat this process, using cosmic_ray_events and [7].
Your code takes O(NM) time. This code takes O(N lg N + M lg M + N * (average bandwidth intersect rate) time. If relatively few pass the bandwidth test, you are going to be insanely faster.
Assuming you get an average of 0.5 intersects per all_neutrinos, this will be on the order of 100000x faster.
There is not much to optimize. The counts are really high, and there is not much hard computation going on. There are some obvious optimizations you could do, such as storing (all_neutrinos[i][3] +/- bandwitdth) in local variables before entering the j-loop. You compiler probably already does this, though, but this would certainly improve performance in debug mode.
Have you tried separating the two halves of the j-loop and have two j-loops? as in:
auto all_neutrinos_2 = all_neutrinos[i][2];
//... precompute bandwidth limits
for (int j = 0; j < cosmic_ray_events.size(); j++)
{ //cosmic_ray_events.size() = 1233487
auto MC_events = MC_cosmic_ray_events[j][1];
if ((all_neutrinos_lower <= MC_events) &&(MC_cosmic_ray_events[j][1] <= all_neutrinos_higher))
{
if ((earth_radius * fabs(all_neutrinos_2 - MC_cosmic_ray_events[j][0]))
<= test_arc_length)
{
MC_count++;
}
}
}
for (int j = 0; j < cosmic_ray_events.size(); j++)
{ //cosmic_ray_events.size() = 1233487
auto events = cosmic_ray_events[j][7];
if ((all_neutrinos_lower <= events) && (events <= all_neutrinos_higher))
{
if(earth_radius * fabs(all_neutrinos_2 - cosmic_ray_events[j][6])
<= test_arc_length)
{
count++;
}
}
}
I have the feeling you could get some improvement from improved memory cache hits this way.
Any improvement beyond that would involve packing the input data to reduce memory cache misses, and would involve modifying the structure and code generating the MC_cosmic_ray_events and cosmic_ray_events arrays
Slicing the counts in severals smaller tasks running on different threads is also a route I would look at seriously at this point. Data access is read only, and each thread can have its own counter, which can all be summed in the end.

Increment the value of a map

need your help and better if you can help me fast. It is very trivial problem but still can't understand what exactly i need to put in one line.
The following code i have
for (busRequest = apointCollection.begin(); busRequest != apointCollection.end(); busRequest++)
{
double Min = DBL_MAX;
int station = 0;
for (int i = 0; i < newStations; i++)
{
distance = sqrt(pow((apointCollection2[i].x - busRequest->x1), 2) + pow((apointCollection2[i].y - busRequest->y1), 2));
if (distance < Min)
{
Min = distance;
station = i;
}
}
if (people.find(station) == people.end())
{
people.insert(pair<int, int>(station, i));
}
else
{
how can i increment "i" if the key of my statation is already in the map.
}
}
Just briefly what i do , i take the first busrequest go to the second loop take the first station and find the minimum distance. After i go over the second loop , i add that station with minimum distance to my map . After i proceed with all my loops and if there is the same station , i need to increment it , so it means that that station is using two times and etc.
I need the help just give me hint or provide the line that i need to add.
I thank you in advance and waiting for your help.
And I think you meant Min Distance instead of i? Check and let me know.
for (busRequest = apointCollection.begin(); busRequest != apointCollection.end(); busRequest++)
{
double Min = DBL_MAX;
int station = 0;
for (int i = 0; i < newStations; i++)
{
distance = sqrt(pow((apointCollection2[i].x - busRequest->x1), 2) + pow((apointCollection2[i].y - busRequest->y1), 2));
if (distance < Min)
{
Min = distance;
station = i;
}
}
if (people.find(station) == people.end())
{
people.insert(pair<int, int>(station, i)); // here???
}
else
{
// This routine will increment the value if the key already exists. If it doesn't exist it will create it for you
YourMap[YourKey]++;
}
}
In C++ you can directly access a map key without inserting it. C++ will automatically create it with default value.
In your case, if a station is not present in people map and you will access people[station] then people[station] will automatically be set to 0 ( default value of int is 0 ).
So you can just do this:
if (people[station] == 0)
{
// Do something
people[station] = station; // NOTE: i is not accessible here! check ur logic
}
else
{
people[station]++;
}
Also: In your code i cannot be accessed inside IF condition to insert into people map.

Program crashes, Tree too large

I'm trying to answer this problem as an exercise:
here are set of coins of {50,25,10,5,1} cents in a box.Write a program to find the number of ways a 1 dollar can be created by grouping the coins.
My solution involves making a tree with each edge having one of the values above. Each node would then hold a sum of the coins. I could then populate this tree and look for leaves that add up to 100. So here is my code
class TrieNode
{
public:
TrieNode(TrieNode* Parent=NULL,int sum=0,TrieNode* FirstChild=NULL,int children=0, bool key =false )
:pParent(Parent),pChild(FirstChild),isKey(key),Sum(sum),NoChildren(children)
{
if(Sum==100)
isKey=true;
}
void SetChildren(int children)
{
pChild = new TrieNode[children]();
NoChildren=children;
}
~TrieNode(void);
//pointers
TrieNode* pParent;
TrieNode* pChild;
int NoChildren;
bool isKey;
int Sum;
};
void Populate(TrieNode* Root, int coins[],int size)
{
//Set children
Root->SetChildren(size);
//add children
for(int i=0;i<size;i++)
{
TrieNode* child = &Root->pChild[0];
int c = Root->Sum+coins[i];
if(c<=100)
{
child = new TrieNode(Root,c);
if(!child->isKey) //recursively populate if not a key
Populate(child,coins,size);
}
else
child = NULL;
}
}
int getNumKeys(TrieNode* Root)
{
int keys=0;
if(Root == NULL)
return 0;
//increment keys if this is a key
if(Root->isKey)
keys++;
for(int i=0; i<Root->NoChildren;i++)
{
keys+= getNumKeys(&Root->pChild[i]);
}
return keys;
}
int _tmain(int argc, _TCHAR* argv[])
{
TrieNode* RootNode = new TrieNode(NULL,0);
int coins[] = {50,25,10,5,1};
int size = 5;
Populate(RootNode,coins,size);
int combos = getNumKeys(RootNode);
printf("%i",combos);
return 0;
}
The problem is that the tree is so huge that after a few seconds the program crashes. I'm running this on a windows 7, quad core, with 8gb ram. A rough calculation tells me I should have enough memory.
Are my calculations incorrect?
Does the OS limit how much memory I have access to?
Can I fix it while still using this solution?
All feedback is appreciated. Thanks.
Edit1:
I have verified that the above approach is wrong. By trying to build a tree with a set of only 1 coin.
coins[] = {1};
I found that the algorithm still failed.
After reading the post from Lenik and from João Menighin
I came up with this solution that ties both Ideas together to make a recursive solution
which takes any sized array
//N is the total the coins have to amount to
int getComobs(int coins[], int size,int N)
{
//write base cases
//if array empty | coin value is zero or N is zero
if(size==0 || coins[0]==0 ||N==0)
return 0;
int thisCoin = coins[0];
int atMost = N / thisCoin ;
//if only 1 coin denomination
if(size==1)
{
//if all coins fit in N
if(N%thisCoin==0)
return 1;
else
return 0;
}
int combos =0;
//write recursion
for(int denomination =0; denomination<atMost;denomination++)
{
coins++;//reduce array ptr
combos+= getComobs(coins, size-1,N-denomination*thisCoin);
coins--;//increment array ptr
}
return combos;
}
Thanks for all the feedback
Tree solution is totally wrong for this problem. It's like catching 10e6 tigers and then let go all of them but one, just because you need a single tiger. Very time and memory consuming -- 99.999% of your nodes are useless and should be ignored in the first place.
Here's another approach:
notice your cannot make a dollar to contain more than two 50 cents
notice again your cannot make a dollar to contain more than four 25 cent coins
notice... (you get the idea?)
Then your solution is simple:
for( int fifty=0; fifty<3; fifty++) {
for( int quarters=0; quarters<5; quarters++) {
for( int dimes=0; dimes<11; dimes++) {
for( int nickels=0; nickels<21; nickels++) {
int sum = fifty * 50 + quarters * 25 + dimes * 10 + nickels * 5;
if( sum <= 100 ) counter++; // here's a combination!!
}
}
}
}
You may ask, why did not I do anything about single cent coins? The answer is simple, as soon as the sum is less than 100, the rest is filled with 1 cents.
ps. hope this solution is not too simple =)
Ok, this is not a full answer but might help you.
You can try perform (what i call) a sanity check.
Put a static counter in TrieNode for every node created, and see how large it grows. If you did some calculations you should be able to tell if it goes to some insane values.
The system can limit the memory available, however it would be really bizarre. Usually the user/admin can set such limits for some purposes. This happens often in dedicated multi-user systems. Other thing could be having a 32bit app in 64bit windows environment. Then mem limit would be 4GB, however this would also be really strange. Any I don't think being limited by the OS is an issue here.
On a side note. I hope you do realize that you kinda defeated all object oriented programming concept with this code :).
I need more time to analyze your code, but for now I can tell that this is a classic Dynamic Programming problem. You may find some interesting texts here:
http://www.algorithmist.com/index.php/Coin_Change
and here
http://www.ccs.neu.edu/home/jaa/CSG713.04F/Information/Handouts/dyn_prog.pdf
There is a much easier way to find a solution:
#include <iostream>
#include <cstring>
using namespace std;
int main() {
int w[101];
memset(w, 0, sizeof(w));
w[0] = 1;
int d[] = {1, 5, 10, 25, 50};
for (int i = 0 ; i != 5 ; i++) {
for (int k = d[i] ; k <= 100 ; k++) {
w[k] += w[k-d[i]];
}
}
cout << w[100] << endl;
return 0;
}
(link to ideone)
The idea is to incrementally build the number of ways to make change by adding coins in progressively larger denomination. Each iteration of the outer loop goes through the results that we already have, and for each amount that can be constructed using the newly added coin adds the number of ways the combination that is smaller by the value of the current coin can be constructed. For example, if the current coin is 5 and the current amount is 7, the algorithm looks up the number of ways that 2 can be constructed, and adds it to the number of ways that 7 can be constructed. If the current coin is 25 and the current amount is 73, the algorithm looks up the number of ways to construct 48 (73-25) to the previously found number of ways to construct 73. In the end, the number in w[100] represents the number of ways to make one dollar (292 ways).
I really do believe someone has to put the most efficient and simple possible implementation, it is an improvement on lenik's answer:
Memory: Constant
Running time: Considering 100 as n, then running time is about O(n (lg(n))) <-I am unsure
for(int fifty=0; fifty <= 100; fifty+=50)
for(int quarters=0; quarters <= (100 - fifty); quarters+=25)
for(int dimes=0; dimes <= (100 - fifty - quarters); dimes+=10)
counter += 1 + (100 - fifty - quarters - dimes)/5;
I think this can be solved in constant time, because any sequence sum can be represented with a linear formula.
Problem might be infinite recursion. You are not incrementing c any where and loop runs with c<=100
Edit 1: I am not sure if
int c = Root->Sum+coins[i];
is actually taking it beyond 100. Please verify that
Edit 2: I missed the Sum being initialized correctly and it was corrected in the comments below.
Edit 3: Method to debug -
One more thing that you can do to help is, Write a print function for this tree or rather print on each level as it progresses deeper in the existing code. Add a counter which terminates loop after say total 10 iterations. The prints would tell you if you are getting garbage values or your c is gradually increasing in a right direction.

Need to find a logic error in a card shuffling method

I'm trying to write a method that takes an array of integers (0-51, in that order), cuts it into two separate arrays (A and B in the below function by using the cut method, which I know for sure works) and then re-fuses the two arrays together by randomly selecting 0, 1 or 2 cards from the BOTTOM of either A or B and then adding them to the deck.
(ps- by "array" I mean linked list, I just said array because I thought it would be conceptually easier)
This is my code so far, it works, but there's a definite bias when it comes to where the cards land. Can anybody spot my logic error?
[code]
void Deck::shuffle(){
IntList *A = new IntList();
IntList *B = new IntList();
cut(A, B);
IntListNode *aMarker = new IntListNode;
aMarker = A->getSentinel()->next;
//cout<< A->getSentinel()->prev->prev->data <<'\n'<<'\n';
IntListNode *bMarker = new IntListNode;
bMarker = B->getSentinel()->next;
//cout<< B->getSentinel()->prev->data;
deckList.clear();
srand(time(NULL));
int randNum = 0, numCards = 0, totalNumCards = 0;
bool selector = true, aisDone = false, bisDone = false;
while(totalNumCards < 52){
randNum = rand() % 3;
if(randNum == 0){
selector = !selector;
continue;
}
numCards = randNum;
if(!aisDone && !bisDone){
if(selector){
for(int i = 0; i < numCards; i++){
deckList.push_back(aMarker->data);
aMarker = (aMarker->next);
if(aMarker == A->getSentinel()){
aisDone = true;
break;
}
}
selector = false;
}else{
for(int i = 0; i < numCards; i++){
deckList.push_back(bMarker->data);
bMarker = (bMarker->next);
if(bMarker == B->getSentinel()){
bisDone = true;
break;
}
}
selector = true;
}
}
if(aisDone && !bisDone){
for(int i = 0; i < (52 - totalNumCards); i++){
deckList.push_back(bMarker->data);
bMarker = (bMarker->next);
if(bMarker == B->getSentinel()){
bisDone = true;
break;
}
}
//return;
}
if(bisDone && !aisDone){
for(int i = 0; i < (52 - totalNumCards); i++){
deckList.push_back(aMarker->data);
aMarker = (aMarker->next);
if(aMarker == A->getSentinel()){
aisDone = true;
break;
}
}
//return;
}
totalNumCards += numCards;
}
int tempSum = 0;
IntListNode *tempNode = deckList.head();
for(int j = 0; j < 52; j++){
//cout<< (tempNode->data) << '\n';
tempSum += (tempNode->data);
tempNode = (tempNode ->next);
}
if(tempSum != 1326)
system("PAUSE");
return;
}
[/code]
What about just using std::random_shuffle? Yeah, it won't work for linked list, but you can change it to vector :)
If your instructor would have the moral to teach you programming the way it should be done then they'd encourage you to solve the problem like so, with four lines of code:
#include<algorithm>
#include<vector>
// ...
std::vector<int> cards; // fill it in ...
std::random_shuffle(cards.begin(), cards.end());
Using the standard library is the right way of doing things. Writing code on your own when you can solve the problem with the standard library is the wrong way of doing things. Your instructor doesn't teach you right. If they want to get a point across (say, have you practice using pointers) then they should be more attentive in selecting the exercise they give you.
That speech given, here is a solution worse than the above but better than your instructor's:
52 times do the following:
Choose two random none-equal integers in the range [0,52).
Swap the values in the array corresponding to these positions.
For most random number generators, the low bits are the least random ones. So your line
randNum = rand() % 3;
should be modified to get its value more from the high- to middle-order bits from rand.
Your expectations may be off. I notice that you swap the selector if your random value is 0. Coupled with the relative non-randomness of randNum, this may be your problem. Perhaps you need to make things less random to make them appear more random, such as swapping the selector every time through the loop, and always taking 1 or more cards from the selected deck.
Comments:
srand(time(NULL));
This should only be called once during an applications run. This it is usally best to call it in main() as you start.
int randNum = 0, numCards = 0, totalNumCards = 0;
bool selector = true, aisDone = false, bisDone = false;
One identifier per line. Every coding standard written has this rule. It also prevents some subtle errors that can creep in when using pointers. Get used to it.
randNum = rand() % 3;
The bottom bits of rand are the lest random.
rand Num = rand() / (MAX_RAND / 3.0);
Question:
if(!aisDone && !bisDone)
{
This can execute
and set one of the above to isDone
Example:
Exit state aisDone == false bsiDone == false // OK
Exit state aisDone == true bsiDone == false // Will run below
Exit state aisDone == false bsiDone == ture // Will run below
}
if(aisDone && !bisDone)
{
Is this allowed to run if the first block above is run?
}
if(bisDone && !aisDone)
{
Is this allowed to run if the first block above is run?
}
The rest is too complicated and I don't understand.
I can think of simpler techniques to get a good shuffle of a deck of cards:
for(loop = 0 .. 51)
{
rand = rand(51 - loop);
swap(loop, loop+rand);
}
The above simulates picking a card at random from the deck A and putting it on the top of deck B (deck B initially being empty). When the loop completes B is now A (as it was done in place).
Thus each card (from A) has the same probability of being placed at any position in B.

Unusual Speed Difference between Python and C++

I recently wrote a short algorithm to calculate happy numbers in python. The program allows you to pick an upper bound and it will determine all the happy numbers below it. For a speed comparison I decided to make the most direct translation of the algorithm I knew of from python to c++.
Surprisingly, the c++ version runs significantly slower than the python version. Accurate speed tests between the execution times for discovering the first 10,000 happy numbers indicate the python program runs on average in 0.59 seconds and the c++ version runs on average in 8.5 seconds.
I would attribute this speed difference to the fact that I had to write helper functions for parts of the calculations (for example determining if an element is in a list/array/vector) in the c++ version which were already built in to the python language.
Firstly, is this the true reason for such an absurd speed difference, and secondly, how can I change the c++ version to execute more quickly than the python version (the way it should be in my opinion).
The two pieces of code, with speed testing are here: Python Version, C++ Version. Thanks for the help.
#include <iostream>
#include <vector>
#include <string>
#include <ctime>
#include <windows.h>
using namespace std;
bool inVector(int inQuestion, vector<int> known);
int sum(vector<int> given);
int pow(int given, int power);
void calcMain(int upperBound);
int main()
{
while(true)
{
int upperBound;
cout << "Pick an upper bound: ";
cin >> upperBound;
long start, end;
start = GetTickCount();
calcMain(upperBound);
end = GetTickCount();
double seconds = (double)(end-start) / 1000.0;
cout << seconds << " seconds." << endl << endl;
}
return 0;
}
void calcMain(int upperBound)
{
vector<int> known;
for(int i = 0; i <= upperBound; i++)
{
bool next = false;
int current = i;
vector<int> history;
while(!next)
{
char* buffer = new char[10];
itoa(current, buffer, 10);
string digits = buffer;
delete buffer;
vector<int> squares;
for(int j = 0; j < digits.size(); j++)
{
char charDigit = digits[j];
int digit = atoi(&charDigit);
int square = pow(digit, 2);
squares.push_back(square);
}
int squaresum = sum(squares);
current = squaresum;
if(inVector(current, history))
{
next = true;
if(current == 1)
{
known.push_back(i);
//cout << i << "\t";
}
}
history.push_back(current);
}
}
//cout << "\n\n";
}
bool inVector(int inQuestion, vector<int> known)
{
for(vector<int>::iterator it = known.begin(); it != known.end(); it++)
if(*it == inQuestion)
return true;
return false;
}
int sum(vector<int> given)
{
int sum = 0;
for(vector<int>::iterator it = given.begin(); it != given.end(); it++)
sum += *it;
return sum;
}
int pow(int given, int power)
{
int original = given;
int current = given;
for(int i = 0; i < power-1; i++)
current *= original;
return current;
}
#!/usr/bin/env python
import timeit
upperBound = 0
def calcMain():
known = []
for i in range(0,upperBound+1):
next = False
current = i
history = []
while not next:
digits = str(current)
squares = [pow(int(digit), 2) for digit in digits]
squaresum = sum(squares)
current = squaresum
if current in history:
next = True
if current == 1:
known.append(i)
##print i, "\t",
history.append(current)
##print "\nend"
while True:
upperBound = input("Pick an upper bound: ")
result = timeit.Timer(calcMain).timeit(1)
print result, "seconds.\n"
For 100000 elements, the Python code took 6.9 seconds while the C++ originally took above 37 seconds.
I did some basic optimizations on your code and managed to get the C++ code above 100 times faster than the Python implementation. It now does 100000 elements in 0.06 seconds. That is 617 times faster than the original C++ code.
The most important thing is to compile in Release mode, with all optimizations. This code is literally orders of magnitude slower in Debug mode.
Next, I will explain the optimizations I did.
Moved all vector declarations outside of the loop; replaced them by a clear() operation, which is much faster than calling the constructor.
Replaced the call to pow(value, 2) by a multiplication : value * value.
Instead of having a squares vector and calling sum on it, I sum the values in-place using just an integer.
Avoided all string operations, which are very slow compared to integer operations. For instance, it is possible to compute the squares of each digit by repeatedly dividing by 10 and fetching the modulus 10 of the resulting value, instead of converting the value to a string and then each character back to int.
Avoided all vector copies, first by replacing passing by value with passing by reference, and finally by eliminating the helper functions completely.
Eliminated a few temporary variables.
And probably many small details I forgot. Compare your code and mine side-by-side to see exactly what I did.
It may be possible to optimize the code even more by using pre-allocated arrays instead of vectors, but this would be a bit more work and I'll leave it as an exercise to the reader. :P
Here's the optimized code :
#include <iostream>
#include <vector>
#include <string>
#include <ctime>
#include <algorithm>
#include <windows.h>
using namespace std;
void calcMain(int upperBound, vector<int>& known);
int main()
{
while(true)
{
vector<int> results;
int upperBound;
cout << "Pick an upper bound: ";
cin >> upperBound;
long start, end;
start = GetTickCount();
calcMain(upperBound, results);
end = GetTickCount();
for (size_t i = 0; i < results.size(); ++i) {
cout << results[i] << ", ";
}
cout << endl;
double seconds = (double)(end-start) / 1000.0;
cout << seconds << " seconds." << endl << endl;
}
return 0;
}
void calcMain(int upperBound, vector<int>& known)
{
vector<int> history;
for(int i = 0; i <= upperBound; i++)
{
int current = i;
history.clear();
while(true)
{
int temp = current;
int sum = 0;
while (temp > 0) {
sum += (temp % 10) * (temp % 10);
temp /= 10;
}
current = sum;
if(find(history.begin(), history.end(), current) != history.end())
{
if(current == 1)
{
known.push_back(i);
}
break;
}
history.push_back(current);
}
}
}
There's a new, radically faster version as a separate answer, so this answer is deprecated.
I rewrote your algorithm by making it cache whenever it finds the number to be happy or unhappy. I also tried to make it as pythonic as I could, for example by creating separate functions digits() and happy(). Sorry for using Python 3, but I get to show off a couple a useful things from it as well.
This version is much faster. It runs at 1.7s which is 10 times faster than your original program that takes 18s (well, my MacBook is quite old and slow :) )
#!/usr/bin/env python3
from timeit import Timer
from itertools import count
print_numbers = False
upperBound = 10**5 # Default value, can be overidden by user.
def digits(x:'nonnegative number') -> "yields number's digits":
if not (x >= 0): raise ValueError('Number should be nonnegative')
while x:
yield x % 10
x //= 10
def happy(number, known = {1}, happies = {1}) -> 'True/None':
'''This function tells if the number is happy or not, caching results.
It uses two static variables, parameters known and happies; the
first one contains known happy and unhappy numbers; the second
contains only happy ones.
If you want, you can pass your own known and happies arguments. If
you do, you should keep the assumption commented out on the 1 line.
'''
# assert 1 in known and happies <= known # <= is expensive
if number in known:
return number in happies
history = set()
while True:
history.add(number)
number = sum(x**2 for x in digits(number))
if number in known or number in history:
break
known.update(history)
if number in happies:
happies.update(history)
return True
def calcMain():
happies = {x for x in range(upperBound) if happy(x) }
if print_numbers:
print(happies)
if __name__ == '__main__':
upperBound = eval(
input("Pick an upper bound [default {0}]: "
.format(upperBound)).strip()
or repr(upperBound))
result = Timer(calcMain).timeit(1)
print ('This computation took {0} seconds'.format(result))
It looks like you're passing vectors by value to other functions. This will be a significant slowdown because the program will actually make a full copy of your vector before it passes it to your function. To get around this, pass a constant reference to the vector instead of a copy. So instead of:
int sum(vector<int> given)
Use:
int sum(const vector<int>& given)
When you do this, you'll no longer be able to use the vector::iterator because it is not constant. You'll need to replace it with vector::const_iterator.
You can also pass in non-constant references, but in this case, you don't need to modify the parameter at all.
This is my second answer; which caches things like sum of squares for values <= 10**6:
happy_list[sq_list[x%happy_base] + sq_list[x//happy_base]]
That is,
the number is split into 3 digits + 3 digits
the precomputed table is used to get sum of squares for both parts
these two results are added
the precomputed table is consulted to get the happiness of number:
I don't think Python version can be made much faster than that (ok, if you throw away fallback to old version, that is try: overhead, it's 10% faster).
I think this is an excellent question which shows that, indeed,
things that have to be fast should be written in C
however, usually you don't need things to be fast (even if you needed the program to run for a day, it would be less then the combined time of programmers optimizing it)
it's easier and faster to write programs in Python
but for some problems, especially computational ones, a C++ solution, like the ones above, are actually more readable and more beautiful than an attempt to optimize Python program.
Ok, here it goes (2nd version now...):
#!/usr/bin/env python3
'''Provides slower and faster versions of a function to compute happy numbers.
slow_happy() implements the algorithm as in the definition of happy
numbers (but also caches the results).
happy() uses the precomputed lists of sums of squares and happy numbers
to return result in just 3 list lookups and 3 arithmetic operations for
numbers less than 10**6; it falls back to slow_happy() for big numbers.
Utilities: digits() generator, my_timeit() context manager.
'''
from time import time # For my_timeit.
from random import randint # For example with random number.
upperBound = 10**5 # Default value, can be overridden by user.
class my_timeit:
'''Very simple timing context manager.'''
def __init__(self, message):
self.message = message
self.start = time()
def __enter__(self):
return self
def __exit__(self, *data):
print(self.message.format(time() - self.start))
def digits(x:'nonnegative number') -> "yields number's digits":
if not (x >= 0): raise ValueError('Number should be nonnegative')
while x:
yield x % 10
x //= 10
def slow_happy(number, known = {1}, happies = {1}) -> 'True/None':
'''Tell if the number is happy or not, caching results.
It uses two static variables, parameters known and happies; the
first one contains known happy and unhappy numbers; the second
contains only happy ones.
If you want, you can pass your own known and happies arguments. If
you do, you should keep the assumption commented out on the 1 line.
'''
# This is commented out because <= is expensive.
# assert {1} <= happies <= known
if number in known:
return number in happies
history = set()
while True:
history.add(number)
number = sum(x**2 for x in digits(number))
if number in known or number in history:
break
known.update(history)
if number in happies:
happies.update(history)
return True
# This will define new happy() to be much faster ------------------------.
with my_timeit('Preparation time was {0} seconds.\n'):
LogAbsoluteUpperBound = 6 # The maximum possible number is 10**this.
happy_list = [slow_happy(x)
for x in range(81*LogAbsoluteUpperBound + 1)]
happy_base = 10**((LogAbsoluteUpperBound + 1)//2)
sq_list = [sum(d**2 for d in digits(x))
for x in range(happy_base + 1)]
def happy(x):
'''Tell if the number is happy, optimized for smaller numbers.
This function works fast for numbers <= 10**LogAbsoluteUpperBound.
'''
try:
return happy_list[sq_list[x%happy_base] + sq_list[x//happy_base]]
except IndexError:
return slow_happy(x)
# End of happy()'s redefinition -----------------------------------------.
def calcMain(print_numbers, upper_bound):
happies = [x for x in range(upper_bound + 1) if happy(x)]
if print_numbers:
print(happies)
if __name__ == '__main__':
while True:
upperBound = eval(input(
"Pick an upper bound [{0} default, 0 ends, negative number prints]: "
.format(upperBound)).strip() or repr(upperBound))
if not upperBound:
break
with my_timeit('This computation took {0} seconds.'):
calcMain(upperBound < 0, abs(upperBound))
single = 0
while not happy(single):
single = randint(1, 10**12)
print('FYI, {0} is {1}.\n'.format(single,
'happy' if happy(single) else 'unhappy'))
print('Nice to see you, goodbye!')
I can see that you have quite a few heap allocations that are unnecessary
For example:
while(!next)
{
char* buffer = new char[10];
This doesn't look very optimized. So, you probably want to have the array pre-allocated and using it inside your loop. This is a basic optimizing technique which is easy to spot and to do. It might become into a mess too, so be careful with that.
You are also using the atoi() function, which I don't really know if it is really optimized. Maybe doing a modulus 10 and getting the digit might be better (you have to measure thou, I didn't test this).
The fact that you have a linear search (inVector) might be bad. Replacing the vector data structure with a std::set might speed things up. A hash_set could do the trick too.
But I think that the worst problem is the string and this allocation of stuff on the heap inside that loop. That doesn't look good. I would try at those places first.
Well, I also gave it a once-over. I didn't test or even compile, though.
General rules for numerical programs:
Never process numbers as text. That's what makes lesser languages than Python slow, so if you do it in C, the program will be slower than Python.
Don't use data structures if you can avoid them. You were building an array just to add the numbers up. Better keep a running total.
Keep a copy of the STL reference open so you can use it rather than writing your own functions.
void calcMain(int upperBound)
{
vector<int> known;
for(int i = 0; i <= upperBound; i++)
{
int current = i;
vector<int> history;
do
{
squaresum = 0
for ( ; current; current /= 10 )
{
int digit = current % 10;
squaresum += digit * digit;
}
current = squaresum;
history.push_back(current);
} while ( ! count(history.begin(), history.end() - 1, current) );
if(current == 1)
{
known.push_back(i);
//cout << i << "\t";
}
}
//cout << "\n\n";
}
Just to get a little more closure on this issue by seeing how fast I could truely find these numbers, I wrote a multithreaded C++ implementation of Dr_Asik's algorithm. There are two things that are important to realize about the fact that this implementation is multithreaded.
More threads does not necessarily lead to better execution times, there is a happy medium for every situation depending on the volume of numbers you want to calculate.
If you compare the times between this version running with one thread and the original version, the only factors that could cause a difference in time are the overhead from starting the thread and variable system performance issues. Otherwise, the algorithm is the same.
The code for this implementation (all credit for the algorithm goes to Dr_Asik) is here. Also, I wrote some speed tests with a double check for each test to help back up those 3 points.
Calculation of the first 100,000,000 happy numbers:
Original - 39.061 / 39.000 (Dr_Asik's original implementation)
1 Thread - 39.000 / 39.079
2 Threads - 19.750 / 19.890
10 Threads - 11.872 / 11.888
30 Threads - 10.764 / 10.827
50 Threads - 10.624 / 10.561 <--
100 Threads - 11.060 / 11.216
500 Threads - 13.385 / 12.527
From these results it looks like our happy medium is about 50 threads, plus or minus ten or so.
Other optimizations: by using arrays and direct access using the loop index rather than searching in a vector, and by caching prior sums, the following code (inspired by Dr Asik's answer but probably not optimized at all) runs 2445 times faster than the original C++ code, about 400 times faster than the Python code.
#include <iostream>
#include <windows.h>
#include <vector>
void calcMain(int upperBound, std::vector<int>& known)
{
int tempDigitCounter = upperBound;
int numDigits = 0;
while (tempDigitCounter > 0)
{
numDigits++;
tempDigitCounter /= 10;
}
int maxSlots = numDigits * 9 * 9;
int* history = new int[maxSlots + 1];
int* cache = new int[upperBound+1];
for (int jj = 0; jj <= upperBound; jj++)
{
cache[jj] = 0;
}
int current, sum, temp;
for(int i = 0; i <= upperBound; i++)
{
current = i;
while(true)
{
sum = 0;
temp = current;
bool inRange = temp <= upperBound;
if (inRange)
{
int cached = cache[temp];
if (cached)
{
sum = cached;
}
}
if (sum == 0)
{
while (temp > 0)
{
int tempMod = temp % 10;
sum += tempMod * tempMod;
temp /= 10;
}
if (inRange)
{
cache[current] = sum;
}
}
current = sum;
if(history[current] == i)
{
if(current == 1)
{
known.push_back(i);
}
break;
}
history[current] = i;
}
}
}
int main()
{
while(true)
{
int upperBound;
std::vector<int> known;
std::cout << "Pick an upper bound: ";
std::cin >> upperBound;
long start, end;
start = GetTickCount();
calcMain(upperBound, known);
end = GetTickCount();
for (size_t i = 0; i < known.size(); ++i) {
std::cout << known[i] << ", ";
}
double seconds = (double)(end-start) / 1000.0;
std::cout << std::endl << seconds << " seconds." << std::endl << std::endl;
}
return 0;
}
Stumbled over this page whilst bored and thought I'd golf it in js. The algorithm is my own, and I haven't checked it thoroughly against anything other than my own calculations (so it could be wrong). It calculates the first 1e7 happy numbers and stores them in h. If you want to change it, change both the 7s.
m=1e7,C=7*81,h=[1],t=true,U=[,,,,t],n=w=2;
while(n<m){
z=w,s=0;while(z)y=z%10,s+=y*y,z=0|z/10;w=s;
if(U[w]){if(n<C)U[n]=t;w=++n;}else if(w<n)h.push(n),w=++n;}
This will print the first 1000 items for you in console or a browser:
o=h.slice(0,m>1e3?1e3:m);
(!this.document?print(o):document.load=document.write(o.join('\n')));
155 characters for the functional part and it appears to be as fast* as Dr. Asik's offering on firefox or v8 (350-400 times as fast as the original python program on my system when running time d8 happygolf.js or js -a -j -p happygolf.js in spidermonkey).
I shall be in awe of the analytic skills anyone who can figure out why this algorithm is doing so well without referencing the longer, commented, fortran version.
I was intrigued by how fast it was, so I learned fortran to get a comparison of the same algorithm, be kind if there are any glaring newbie mistakes, it's my first fortran program. http://pastebin.com/q9WFaP5C
It's static memory wise, so to be fair to the others, it's in a self-compiling shell script, if you don't have gcc/bash/etc strip out the preprocessor and bash stuff at the top, set the macros manually and compile it as fortran95.
Even if you include compilation time it beats most of the others here. If you don't, it's about ~3000-3500 times as fast as the original python version (and by extension >40,000 times as fast as the C++*, although I didn't run any of the C++ programs).
Surprisingly many of the optimizations I tried in the fortran version (incl some like loop unrolling which I left out of the pasted version due to small effect and readability) were detrimental to the js version. This exercise shows that modern trace compilers are extremely good (within a factor of 7-10 of carefully optimized, static memory fortran) if you get out of their way and don't try any tricky stuff.
get out of their way, and trying to do tricky stuff
Finally, here's a much nicer, more recursive js version.
// to s, then integer divides x by 10.
// Repeats until x is 0.
function sumsq(x) {
var y,s=0;
while(x) {
y = x % 10;
s += y * y;
x = 0| x / 10;
}
return s;
}
// A boolean cache for happy().
// The terminating happy number and an unhappy number in
// the terminating sequence.
var H=[];
H[1] = true;
H[4] = false;
// Test if a number is happy.
// First check the cache, if that's empty
// Perform one round of sumsq, then check the cache
// For that. If that's empty, recurse.
function happy(x) {
// If it already exists.
if(H[x] !== undefined) {
// Return whatever is already in cache.
return H[x];
} else {
// Else calc sumsq, set and return cached val, or if undefined, recurse.
var w = sumsq(x);
return (H[x] = H[w] !== undefined? H[w]: happy(w));
}
}
//Main program loop.
var i, hN = [];
for(i = 1; i < 1e7; i++) {
if(happy(i)) { hN.push(i); }
}
Surprisingly, even though it is rather high level, it did almost exactly as well as the imperative algorithm in spidermonkey (with optimizations on), and close (1.2 times as long) in v8.
Moral of the story I guess, spend a bit of time thinking about your algorithm if it's important. Also high level languages already have a lot of overhead (and sometimes have tricks of their own to reduce it) so sometimes doing something more straightforwared or utilizing their high level features is just as fast. Also micro-optimization doesn't always help.
*Unless my python installation is unusually slow, direct times are somewhat meaningless as this is a first generation eee.
Times are:
12s for fortran version, no output, 1e8 happy numbers.
40s for fortran version, pipe output through gzip to disk.
8-12s for both js versions. 1e7 happy numbers, no output with full optimization
10-100s for both js versions 1e7 with less/no optimization (depending on definition of no optimization, the 100s was with eval()) no output
I'd be interested to see times for these programs on a real computer.
I am not an expert at C++ optimization, but I believe the speed difference may be due to the fact that Python lists have preallocated more space at the beginning while your C++ vectors must reallocate and possibly copy every time it grows.
As for GMan's comment about find, I believe that the Python "in" operator is also a linear search and is the same speed.
Edit
Also I just noticed that you rolled your own pow function. There is no need to do that and the stdlib is likely faster.
Here is another way that relies on memorising all the numbers already explored.
I obtain a factor x4-5, which is oddly stable against DrAsik's code for 1000 and 1000000, I expected the cache to be more efficient the more numbers we were exploring. Otherwise, the same kind of classic optimizations have been applied. BTW, if the compiler accepts NRVO (/RNVO ? I never remember the exact term) or rvalue references, we wouldn't need to pass the vector as an out parameter.
NB: micro-optimizations are still possible IMHO, and moreover the caching is naive as it allocates much more memory than really needed.
enum Status {
never_seen,
being_explored,
happy,
unhappy
};
char const* toString[] = { "never_seen", "being_explored", "happy", "unhappy" };
inline size_t sum_squares(size_t i) {
size_t s = 0;
while (i) {
const size_t digit = i%10;
s += digit * digit;
i /= 10;
}
return s ;
}
struct Cache {
Cache(size_t dim) : m_cache(dim, never_seen) {}
void set(size_t n, Status status) {
if (m_cache.size() <= n) {
m_cache.resize(n+1, never_seen);
}
m_cache[n] = status;
// std::cout << "(c[" << n << "]<-"<<toString[status] << ")";
}
Status operator[](size_t n) const {
if (m_cache.size() <= n) {
return never_seen;
} else {
return m_cache[n];
}
}
private:
std::vector<Status> m_cache;
};
void search_happy_lh(size_t upper_bound, std::vector<size_t> & happy_numbers)
{
happy_numbers.clear();
happy_numbers.reserve(upper_bound); // it doesn't improve much the performances
Cache cache(upper_bound+1);
std::vector<size_t> current_stack;
cache.set(1,happy);
happy_numbers.push_back(1);
for (size_t i = 2; i<=upper_bound ; ++i) {
// std::cout << "\r" << i << std::flush;
current_stack.clear();
size_t s= i;
while ( s != 1 && cache[s]==never_seen)
{
current_stack.push_back(s);
cache.set(s, being_explored);
s = sum_squares(s);
// std::cout << " - " << s << std::flush;
}
const Status update_with = (cache[s]==being_explored ||cache[s]==unhappy) ? unhappy : happy;
// std::cout << " => " << s << ":" << toString[update_with] << std::endl;
for (size_t j=0; j!=current_stack.size(); ++j) {
cache.set(current_stack[j], update_with);
}
if (cache[i] == happy) {
happy_numbers.push_back(i);
}
}
}
Here's a C# version:
using System;
using System.Collections.Generic;
using System.Text;
namespace CSharp
{
class Program
{
static void Main (string [] args)
{
while (true)
{
Console.Write ("Pick an upper bound: ");
String
input = Console.ReadLine ();
uint
upper_bound;
if (uint.TryParse (input, out upper_bound))
{
DateTime
start = DateTime.Now;
CalcHappyNumbers (upper_bound);
DateTime
end = DateTime.Now;
TimeSpan
span = end - start;
Console.WriteLine ("Time taken = " + span.TotalSeconds + " seconds.");
}
else
{
Console.WriteLine ("Error in input, unable to parse '" + input + "'.");
}
}
}
enum State
{
Happy,
Sad,
Unknown
}
static void CalcHappyNumbers (uint upper_bound)
{
SortedDictionary<uint, State>
happy = new SortedDictionary<uint, State> ();
SortedDictionary<uint, bool>
happy_numbers = new SortedDictionary<uint, bool> ();
happy [1] = State.Happy;
happy_numbers [1] = true;
for (uint current = 2 ; current < upper_bound ; ++current)
{
FindState (ref happy, ref happy_numbers, current);
}
//foreach (KeyValuePair<uint, bool> pair in happy_numbers)
//{
// Console.Write (pair.Key.ToString () + ", ");
//}
//Console.WriteLine ("");
}
static State FindState (ref SortedDictionary<uint, State> happy, ref SortedDictionary<uint,bool> happy_numbers, uint value)
{
State
current_state;
if (happy.TryGetValue (value, out current_state))
{
if (current_state == State.Unknown)
{
happy [value] = State.Sad;
}
}
else
{
happy [value] = current_state = State.Unknown;
uint
new_value = 0;
for (uint i = value ; i != 0 ; i /= 10)
{
uint
lsd = i % 10;
new_value += lsd * lsd;
}
if (new_value == 1)
{
current_state = State.Happy;
}
else
{
current_state = FindState (ref happy, ref happy_numbers, new_value);
}
if (current_state == State.Happy)
{
happy_numbers [value] = true;
}
happy [value] = current_state;
}
return current_state;
}
}
}
I compared it against Dr_Asik's C++ code. For an upper bound of 100000 the C++ version ran in about 2.9 seconds and the C# version in 0.35 seconds. Both were compiled using Dev Studio 2005 using default release build options and both were executed from a command prompt.
Here's some food for thought: If given the choice of running a 1979 algorithm for finding prime numbers in a 2009 computer or a 2009 algorithm on a 1979 computer, which would you choose?
The new algorithm on ancient hardware would be the better choice by a huge margin. Have a look at your "helper" functions.
There are quite a few optimizations possible:
(1) Use const references
bool inVector(int inQuestion, const vector<int>& known)
{
for(vector<int>::const_iterator it = known.begin(); it != known.end(); ++it)
if(*it == inQuestion)
return true;
return false;
}
int sum(const vector<int>& given)
{
int sum = 0;
for(vector<int>::const_iterator it = given.begin(); it != given.end(); ++it)
sum += *it;
return sum;
}
(2) Use counting down loops
int pow(int given, int power)
{
int current = 1;
while(power--)
current *= given;
return current;
}
Or, as others have said, use the standard library code.
(3) Don't allocate buffers where not required
vector<int> squares;
for (int temp = current; temp != 0; temp /= 10)
{
squares.push_back(pow(temp % 10, 2));
}
With similar optimizations as PotatoSwatter I got time for 10000 numbers down from 1.063 seconds to 0.062 seconds (except I replaced itoa with standard sprintf in the original).
With all the memory optimizations (don't pass containers by value - in C++ you have to explicitly decide whether you want a copy or a reference; move operations that allocate memory out of inner loops; if you already have the number in a char buffer, what's the point of copying it to std::string etc) I got it down to 0.532.
The rest of the time came from using %10 to access digits, rather than converting numbers to string.
I suppose there might be another algorithmic level optimization (numbers that you have encountered while finding a happy number are themselves also happy numbers?) but I don't know how much that gains (there is not that many happy numbers in the first place) and this optimization is not in the Python version either.
By the way, by not using string conversion and a list to square digits, I got the Python version from 0.825 seconds down to 0.33 too.
#!/usr/bin/env python
import timeit
upperBound = 0
def calcMain():
known = set()
for i in xrange(0,upperBound+1):
next = False
current = i
history = set()
while not next:
squaresum=0
while current > 0:
current, digit = divmod(current, 10)
squaresum += digit * digit
current = squaresum
if current in history:
next = True
if current == 1:
known.add(i)
history.add(current)
while True:
upperBound = input("Pick an upper bound: ")
result = timeit.Timer(calcMain).timeit(1)
print result, "seconds.\n"
I made a couple of minor changes to your original python code example that make a better than 16x improvement to the performance of the code.
The changes I made took the 100,000 case from about 9.64 seconds to about 3.38 seconds.
The major change was to make the mod 10 and accumulator changes to run in a while loop. I made a couple of other changes that improved execution time in only fractions of hundredths of seconds. The first minor change was changing the main for loop from a range list comprehension to an xrange iterator. The second minor change was substituting the set class for the list class for both the known and history variables.
I also experimented with iterator comprehensions and precalculating the squares but they both had negative effects on the efficiency.
I seem to be running a slower version of python or on a slower processor than some of the other contributers. I would be interest in the results of someone else's timing comparison of my python code against one of the optimized C++ versions of the same algorithm.
I also tried using the python -O and -OO optimizations but they had the reverse of the intended effect.
Why is everyone using a vector in the c++ version? Lookup time is O(N).
Even though it's not as efficient as the python set, use std::set. Lookup time is O(log(N)).