c++ - Solution to 2-sum using unordered_map - c++

Okay so I am trying to solve the 2-SUM problem in c++. Given a file of 1000000 numbers in arbitrary order, I need to determine if there exist pairs of integers whose sum is t where t is each of [-10000, 10000]. So this basically the 2-SUM problem.
So, I coded up my solution in C++ wherein I used unordered_map as my hash table. I am ensuring low load on the hash table. But still this takes around 1hr 15mins to finish(successful). Now, I am wondering if it should be that slow. Further reducing the load factor did not give any considerable performance boost.
I have no idea where I can optimise the code. I tried different load factors, doesn't help. This is question from a MOOC and people have been able to get this done in around 30 mins using the same hash table approach. Can anybody help me make this code faster. Or at least give a hint as to where the code might be slowing down.
Here is the code -
#include <iostream>
#include <unordered_map>
#include <fstream>
int main(int argc, char *argv[]){
if(argc != 2){
std::cerr << "Usage: ./2sum <filename>" << std::endl;
exit(1);
}
std::ifstream input(argv[1]);
std::ofstream output("log.txt");
std::unordered_map<long, int> data_map;
data_map.max_load_factor(0.05);
long tmp;
while(input >> tmp){
data_map[tmp] += 1;
}
std::cerr << "input done!" << std::endl;
std::cerr << "load factor " << data_map.load_factor() << std::endl;
//debug print.
for(auto iter = data_map.begin(); iter != data_map.end(); ++iter){
output << iter->first << " " << iter->second << std::endl;
}
std::cerr << "debug print done!" << std::endl;
//solve
long ans = 0;
for(long i = -10000; i <= 10000; ++i){
//try to find a pair whose sum = i.
//debug print.
if(i % 100 == 0)
std::cerr << i << std::endl;
for(auto iter = data_map.begin(); iter != data_map.end(); ++iter){
long x = iter->first;
long y = i - x;
if(x == y)
continue;
auto search_y = data_map.find(y);
if(search_y != data_map.end()){
++ans;
break;
}
}
}
std::cout << ans << std::endl;
return 0;
}

On a uniform set with all sums equally probable, the below will finish in seconds. Otherwise, for any missing sums, on my laptop takes about 0.75 secs to check for a missing sum.
The solution has a minor improvement in comparison with the OP's code: checking for duplicates and eliminating them.
Then it opens through a Monte Carlo heuristic: for about 1% of the total numbers, randomly picks one from the set and searches for all the sums in the [minSum, maxSum] range that can be made having one term as the randomly picked number and the rest of them. This will pre-populate the sums set with... say... 'sum that can be found trivially'. In my tests, using 1M numbers generated randonly between -10M and 10M, this is the single step necessary and takes a couple of seconds.
For pathological numbers distributions, in which some of the sum values are missing (or have not been found through the random heuristic), the second part uses a targeted exhaustive search over the not-found sum values, very much on the same line as the solution in the OP.
Extra explanations for the random/Monte Carlo heuristic(to address #AneeshDandime's comment of):
Though i do not fully understand it at the moment
Well, it's simple. Think like this: the naive approach is to take all the input values and add them in pairs, but retain only the sum in the [-10k, 10k]. It is however terrible expensive (O[N^2]). An immediate refinement would be: pick a value v0, then determine which other v1 values stand a chance to give a sum in the [-10k, 10k] range. If the input values are sorted, it's easier: you only need to select v1-s in the [-10k-v0, 10k-v0]; a good improvement, but if you keep this as the only approach, an exhaustive search would still be O(log2(N)N[-10k, 10k]).
However, this approach still has its value: if the input values are uniformly distributed, it will quickly populate the known sums set with the most common values (and spend the rest of time trying to find infrequent or missing sum values).
To capitalize, instead of using this 'til the end, one can proceed with a limited number of steps, hope to populate the majority of the sums. After that, we can switch the focus and enter the 'targeted search for sum values', but only for the sum value not found at this step.
[Edited: prev bug corrected. Now the algo is stable in regards with values present multiple times or single occurrences in input]
#include <algorithm>
#include <vector>
#include <random>
#include <unordered_set>
#include <unordered_map>
int main() {
typedef long long value_type;
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++
// substitute this with your input sequence from the file
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<value_type> initRnd(-5500, 10000000);
std::vector<value_type> sorted_vals;
for(ulong i=0; i<1000000; i++) {
int rnd=initRnd(gen);
sorted_vals.push_back(rnd);
}
std::cout << "Initialization end" << std::endl;
// end of input
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++
// use some constants instead of magic values
const value_type sumMin=-10000, sumMax=10000;
// Mapping val->number of occurrences
std::unordered_map<value_type, size_t> hashed_vals;
for(auto val : sorted_vals) {
hashed_vals[val]=hashed_vals[val]++;
}
// retain only the unique values and sort them
sorted_vals.clear();
for(auto val=hashed_vals.begin(); val!=hashed_vals.end(); ++val) {
sorted_vals.push_back(val->first);
}
std::sort(sorted_vals.begin(), sorted_vals.end());
// Store the encountered sums here
std::unordered_set<int> sums;
// some 1% iterations, looking at random for pair of numbers which will contribute with
// sum in the [-10000, 10000] range, and we'll collect those sums.
// We'll use the sorted vector of values for this purpose.
// If we are lucky, most of the sums (if not all) will be already filled in
std::uniform_int_distribution<size_t> rndPick(0, sorted_vals.size());
size_t numRandomPicks=size_t(sorted_vals.size()*0.1);
if(numRandomPicks > 75000) {
numRandomPicks=75000;
}
for(size_t i=0; i<numRandomPicks;i++) {
// pick a value index at random
size_t randomIx=rndPick(gen);
value_type val=sorted_vals[randomIx];
// now search for the values between -val-minSum and -val+maxSum;
auto low=std::lower_bound(sorted_vals.begin(), sorted_vals.end(), sumMin-val);
if(low==sorted_vals.end()) {
continue;
}
auto high=std::upper_bound(sorted_vals.begin(), sorted_vals.end(), sumMax-val);
if(high==sorted_vals.begin()) {
continue;
}
for(auto rangeIt=low; rangeIt!=high; rangeIt++) {
if(*rangeIt!=val || hashed_vals[val] > 1) {
// if not the same as the randomly picked value
// or if it is the same but that value occurred more than once in input
auto sum=val+*rangeIt;
sums.insert(sum);
}
}
if(sums.size()==size_t(sumMax-sumMin+1)) {
// lucky us, we found them all
break;
}
}
// after which, if some sums are not present, we'll search for them specifically
if(sums.size()!=size_t(sumMax-sumMin+1)) {
std::cout << "Number of sums still missing: "
<< size_t(sumMax-sumMin+1)-sums.size()
<< std::endl
;
for(int sum=sumMin; sum<=sumMax; sum++) {
if(sums.find(sum)==sums.end()) {
std::cout << "looking for sum: " << sum ;
// we couldn't find the sum, so we'll need to search for it.
// We'll use the unique_vals hash map this time to search for the other value
bool found=false;
for(auto i=sorted_vals.begin(); !found && i!=sorted_vals.end(); ++i) {
value_type v=*i;
value_type other_val=sum-v;
if( // v---- either two unequal terms to be summed or...
(other_val != v || hashed_vals[v] > 1) // .. the value occurred more than once
&& hashed_vals.find(other_val)!=hashed_vals.end() // and the other term exists
) {
// found. Record it as such and break
sums.insert(sum);
found=true;
}
}
std::cout << (found ? " found" : " not found") << std::endl;
}
}
}
std::cout << "Total number of distinct sums found: " << sums.size() << std:: endl;
}

You can reserve the space earlier for unordered map. It should increase performance a bit

What about sorting the array first and then for each element in the array, use binary search to find the number which would make it closer to -10000 and keep going "right" until you reached a sum +10000
This way you will avoid going through the array 20000 times.

Related

Random generating of numbers doesn't work properly

I'm trying to create a programm that makes sudoku's. But when I try to let the programm place numbers at random spots it doesnt use every position.
I tried to use rand(); with srand(time(0));
and random number generators from <random>.
In the Constructor i use this:
mt19937_64 randomGeneratorTmp(time(0));
randomGenerator = randomGeneratorTmp;
uniform_int_distribution<int> numGetterTmp(0, 8);
numGetter = numGetterTmp;
While I have randomGenerator and numGetter variable so i can use them in another function of the sudoku object.
And this is the function where i use the random numbers:
bool fillInNumber(int n){
int placedNums = 0, tries=0;
int failedTries[9][9];
for(int dim1=0;dim1<9;dim1++){
for(int dim2=0;dim2<9;dim2++){
failedTries[dim1][dim2] = 0;
}
}
while(placedNums<9){
int dim1 = numGetter(randomGenerator);
int dim2 = numGetter(randomGenerator);
if(nums[dim1][dim2]==0){
if(allowedLocation(n,dim1,dim2)){
nums[dim1][dim2] = n;
placedNums++;
} else {
failedTries[dim1][dim2]++;
tries++;
}
}
if(tries>100000000){
if(placedNums == 8){
cout<< "Number: " << n << endl;
cout<< "Placing number: " << placedNums << endl;
cout<< "Dim1: " << dim1 << endl;
cout<< "Dim2: " << dim2 << endl;
printArray(failedTries);
}
return false;
}
}
return true;
}
(The array failedTries just shows me which positions the program tried.
and most of the fields have been tried millions of times, while others not once)
I think that the random generation just repeats itself before it used every number combination, but i don't know what i'm doing wrong.
Don't expect random numbers to have an even distribution over your matrix - there's no guarantee they will. That would be like having a routine to randomly generate cards from a deck, and waiting until you see all 52 values - you may wait a very very long time to get every single card.
That's especially true since "random" numbers are actually generated by pseudorandom number generators, which, generally utilize multiplying a very large number and adding an arbitrary constant. Depending on the algorithm, this might cluster in unanticipated ways.
If I may make a suggestion: create an array of all of the possible matrix positions, and then shuffle that array. That's how deck shuffling algorithms are able to guarantee you have all the cards in the deck covered, and it's the same problem you're having.
For a shuffle, generate two random positions in the array and exchange the values - repeat as many times as it takes to get a suitably random result. (Since your array is limited to 9x9, I might shuffle an array of ints 0..80: extract the columns and rows with a /9 and a % 9 for each int).
I wrote a simple program that should be equivalent to your code, and it works without issues:
#include <iostream>
#include <random>
#include <vector>
using namespace std;
int main()
{
std::default_random_engine engine;
std::uniform_int_distribution<int> distr(0,80);
std::vector<bool> vals(81,false);
int attempts = 0;
int trueCount = 0;
while(trueCount < 81)
{
int newNum = distr(engine);
if(!vals[newNum])
{
vals[newNum] = true;
trueCount++;
}
attempts++;
}
std::cout << "attempts: " << attempts;
return 0;
}
Usually it prints around 400 attempts which is the statistical average.
You most likely have a bug in your code. I am not sure where though, as you don't show all of your code.

C++ source code bug -- Computing Differences in Distance and Total Sums

The purpose of this program is to be able to input a set of integer double values, and for it to output the total distance as a sum. It's also meant to recognize the smallest and largest distances -- as well as calculate the mean of two or more distances.
I would also like to be able to remove the repetitive block of code in my program, which I've literally copied to get the second part of the source code working. Apparently there's a way to remove the replication -- but I don't know how.
Here's the source:
/* These includes are all part of a custom header designed
by Bjarne Stroustrup as part of Programming: Principles and Practice
using c++
*/
#include<iostream>
#include<iomanip>
#include<fstream>
#include<sstream>
#include<cmath>
#include<cstdlib>
#include<string>
#include<list>
#include <forward_list>
#include<vector>
#include<unordered_map>
#include<algorithm>
#include <array>
#include <regex>
#include<random>
#include<stdexcept>
// I am also using the "stdafx.h" header.
// reading a sequence of integer doubles into a vector.
This could be the distance between two areas with different paths
int main()
{
vector<double> dist; // vector, double integer value
double sum = 0; // sum of two doubles
double min = 0; // min dist
double max = 0; // max dist
cout << "Please enter a sequence of integer doubles (representing distances): \n";
double val = 0;
while (cin >> val)
{
if (val <= 0)
{
if (dist.size() == 0)
error("no distances");
cout << "The total distance is: " << sum << "\n";
cout << "The smallest distance is: " << min << "\n";
cout << "The greatest distance is: " << max << "\n";
cout << "The average (mean) distance is: " << sum / dist.size() << "\n";
keep_window_open();
return 0;
}
dist.push_back(val); // stores vector value
// updating the runtime values
sum += val;
if (val > min)
min = val;
if (max < val)
max = val;
}
if (dist.size() == 0)
error("no distances");
cout << "The total distance is: " << sum << "\n";
cout << "The smallest distance is: " << min << "\n";
cout << "The greatest distance is: " << max << "\n";
cout << "The average (mean) distance is: " << sum / dist.size() << "\n";
keep_window_open();
}
Additionally, I have been trying to input a small block of source code in the form of something like "catch (runtime_error e)" but it expects a declaration of some sort and I don't know how to get it to compile without errors.
Help with removing the replicated/repeating block of code to reduce bloat would be great -- on top of everything else.
Instead of having the if statement inside the while, you should combine the two conditions to avoid duplicating that code:
while ( (cin >> val) && (val > 0) )
Also, you need to initialize min to a largest value, rather than zero, if you want the first comparison to capture the first possible value for min.
Making a function out of duplicated code is a general purpose solution that isn't a good choice in your case for two reasons: First, it isn't necessary, since it is easier and better to combine the flow of control so there is no need to invoke that code in two places. Second there are too many local variables used in the duplicated code, so if there were a reason to make the duplicated code into a function, good design would also demand collecting some or all of those local variables into an object.
If it had not been cleaner and easier to merge the two conditions, it still would be better to merge the flow of control than to invent the function to call from two places. You cold have used:
if (val <= 0)
{
break;
}

Program will not output data to console when using a data input size greater than 30 million

I'm trying to make a program that will eventually show the runtime differences with large data inputs by using a binary search tree and a vector. But before I get to that, I'm testing to see if the insertion and search functions are working properly. It seems to be fine but whenever I assign SIZE to be 30 million or more, after about 10-20 seconds, it will only display Press any key to continue... with no output. However if I assign SIZE to equal to 20 million or less, it will output the search results as I programmed it. So what do you think is causing this problem?
Some side notes:
I'm storing a unique, (no duplicates) randomly generated value into the tree as well as the vector. So at the end, the tree and the vector will both have the exact same values. When the program runs the search portion, if a value is found in the BST, then it should be found in the vector as well. So far this has worked with no problems when using 20 million values or less.
Also, I'm using randValue = rand() * rand(); to generate the random values because I know the maximum value of rand() is 32767. So multiplying it by itself will guarantee a range of numbers from 0 - 1,073,741,824. I know the insertion and searching methods I'm using are inefficient because I'm making sure there are no duplicates but it's not my concern right now. This is just for my own practice.
I'm only posting up my main.cpp for the sake of simplicity. If you think the problem lies in one of my other files, I'll post the rest up.
Here's my main.cpp:
#include <iostream>
#include <time.h>
#include <vector>
#include "BSTTemplate.h"
#include "functions.h"
using namespace std;
int main()
{
const long long SIZE = 30000000;
vector<long long> vector1(SIZE);
long long randNum;
binarySearchTree<long long> bst1;
srand(time(NULL));
//inserts data into BST and into the vector AND makes sure there are no duplicates
for(long long i = 0; i < SIZE; i++)
{
randNum = randLLNum();
bst1.insert(randNum);
if(bst1.numDups == 1)//if the random number generated is duplicated, don't count it and redo that iteration
{
i--;
bst1.numDups = 0;
continue;
}
vector1[i] = randNum;
}
//search for a random value in both the BST and the vector
for(int i = 0; i < 5; i++)
{
randNum = randLLNum();
cout << endl << "The random number chosen is: " << randNum << endl << endl;
//searching with BST
cout << "Searching for " << randNum << " in BST..." << endl;
if(bst1.search(randNum))
cout << randNum << " = found" << endl;
else
cout << randNum << " = not found" << endl;
//searching with linear search using vectors
cout << endl << "Searching for " << randNum << " in vector..." << endl;
if(containsInVector(vector1, SIZE, randNum))
cout << randNum << " = found" << endl;
else
cout << randNum << " = not found" << endl;
}
cout << endl;
return 0;
}
(Comments reposted as answer at OP's request)
Options include: compile 64 bit (if you're not already - may make it better or worse depending on whether RAM or address space are the issue), buy more memory, adjust your operating system's swap memory settings (letting it use more disk), design a more memory-efficient tree (but at best you'll probably only get an order of magnitude improvement, maybe less, and it could impact other things like performance characteristics), redesign your tree so it manually saves data out to disk and reads it back (e.g. with an LRU).
Here's a how-to for compiling 64 bit on VC++: msdn.microsoft.com/en-us/library/9yb4317s.aspx

How Can I Speed My C++ Program Up?

Basically I am relearning C++ and decided to create a lotto number generator.
The code creates the ticket and if that ticket does not already exist, it is added to a vector to store every possible combination.
The program works, but its just far too slow, adding an entry roughly every second, and It will get slower as it finds it more difficult to add unique combinations out of over 13 million possible combinations.
Anyway here is my code, any optimization tips would appreciated:
#include <iostream>
#include <cstdlib>
#include <ctime>
#include <string>
#include <sstream>
#include <vector>
#include <algorithm>
using namespace std;
vector<string> lottoCombos;
const int NUMBERS_PER_TICKET = 6;
const int NUMBERS = 49;
const int POSSIBLE_COMBOS = 13983816;
string createTicket();
void startUp();
void getAllCombinations();
int main()
{
lottoCombos.reserve(POSSIBLE_COMBOS);
cout<< "Random Ticket: "<< createTicket()<< endl;
getAllCombinations();
for (int i = 0; i < POSSIBLE_COMBOS; i++)
{
cout << endl << lottoCombos[i];
}
system("PAUSE");
return 0;
}
string createTicket()
{
srand(static_cast<unsigned int>(time(0)));
vector<int> ticket;
vector<int> numbers;
vector<int>::iterator numberIterator;
//ADD AVAILABLE NUMBERS TO VECTOR
for (int i = 0; i < NUMBERS; i++)
{
numbers.push_back(i + 1);
}
for (int j = 0; j < NUMBERS_PER_TICKET; j++)
{
int ticketNumber = rand() % numbers.size();
numberIterator = numbers.begin()+ ticketNumber;
int nm = *numberIterator;
numbers.erase(numberIterator);
ticket.push_back(nm);
}
sort(ticket.begin(), ticket.end());
string result;
ostringstream convert;
convert << ticket[0] << ", " << ticket[1] << ", " << ticket[2] << ", " << ticket[3] << ", " << ticket[4] << ", " << ticket[5];
result = convert.str();
return result;
}
void getAllCombinations()
{
int i = 0;
cout << "Max Vector Size: " << lottoCombos.max_size() << endl;
cout << "Creating Entries" << endl;
while ( i != POSSIBLE_COMBOS )
{
bool matchFound = true;
string newNumbers = createTicket();
for (int j = 0; j < lottoCombos.size(); j++)
{
if ( newNumbers == lottoCombos[j] )
{
matchFound = false;
break;
}
}
if (matchFound != false)
{
lottoCombos.push_back(createTicket());
i++;
cout << "Entries: "<< i << endl;
}
}
sort(lottoCombos.begin(), lottoCombos.end());
cout << "\nCombination generation complete!!!\n\n";
}
The reason each lottery ticket is taking a second to generate is because you are misusing srand(). By calling srand(time(0)) every time createTicket() is called, you ensure that createTicket() returns the same numbers every time it is called, until the next time the value returned by time() changes, i.e. once per second. So your reject-duplicates algorithm will almost always find a duplicate until the next second goes by. You should move your srand(time(0)) call to the top of main() instead.
That said, there are perhaps larger issues to confront here: my first question would be, is it really necessary to generate and store every possible lottery ticket? (and if so, why?) IIRC real lotteries don't do that when issuing a ticket; they just generate some random numbers and print them out (and if there are multiple winning tickets printed with the same numbers, the owners of those tickets share the prize money).
Assuming you do need to generate every possible lottery ticket for some reason, there are better ways to do it than randomly. If you've ever watched the odometer increment while driving a car, you'll get the idea for how to do it linearly; just imagine an odometer with 6 wheels, where each wheel has 49 different possible positions it can be in (rather than the traditional 10).
Finally, a vector has O(N) lookup time, and if you are doing a lookup in the vector for every value you generate, then your algorithm has O(N^2) time, which is to say, it's going to get really slow really quickly as you generate more tickets. So if you have to store all known tickets in a data structure, you should definitely use a data structure with quicker lookup times, for example a std::map or a std::unordered_set, or even a std::bitset as suggested by #RedAlert.

Searching Multiple Vectors

I have 4 vectors with about 45,000 records each right now. Looking for an efficient method to run through these 4 vectors and output how many times it matches the users input. Data needs to match on the same index of each vector.
Multiple for loops? Vector find?
Thanks!
If the elements need to match at the same location, it seems that a std::find() or std::find_if() combined with a check for the other vectors at the position is a reasonable approach:
std::vector<A> a(...);
std::vector<B> b(...);
std::vector<C> c(...);
std::vector<D> d(...);
std::size_t match(0);
for (auto it = a.begin(), end = a.end(); it != end; ) {
it = std::find_if(it, end, conditionA));
if (it != end) {
if (conditionB[it - a.begin()]
&& conditionC[it - a.begin()]
&& conditionD[it - a.begin()]) {
++match;
}
++it;
}
}
What I got from description is that, you have 4 vectors and lots of user data, you need to find out how many of times it matches with vectors at same index
so here goes the code ( i am writing a c++4.3.2 code)
#include<iostream>
#include<vector>
#include<algorithm>
using namespace std;
int main(){
vector<typeT>a;
vector<typeT>b;
vector<typeT>c;
vector<typeT>d;
vector<typeT>matched;
/*i am assuming you have initialized a,b,c and d;
now we are going to do pre-calculation for matching user data and store
that in vector matched */
int minsize=min(a.size(),b.size(),c.size(),d.size());
for(int i=0;i<minsize;i++)
{
if(a[i]==b[i]&&b[i]==c[i]&&c[i]==d[i])matched.push_back(a[i]);
}
return 0;
}
this was the precalculation part. now next depend on data type you are using, Use binary search with little bit of extra counting or using a better data structure which stores a pair(value,recurrence) and then applying binary search.
Time complexity will be O(n+n*log(n)+m*log(n)) where n is minsize in code and m is number of user input
Honestly, I would have a couple of methods to maintain your database(vectors).
Essentially, do a QuickSort to start out with.
Then ever so often consistently run a insertion sort (Faster then QuickSort for partially sorted lists)
Then just run binary search on those vectors.
edit:
I think a better way to store this is instead of using multiple vectors per entry. Have one class vector that stores all the values. (your current vectors)
class entry {
public:
variable data1;
variable data2;
variable data3;
variable data4;
}
Make this into a single vector. Then use my method I described above to sort through these vectors.
You will have to sort through by what type of data it is first. Then after call binary search on that data.
You can create a lookup table for the vector with std::unordered_multimap in O(n). Then you can use unordered_multimap::count() to get the number of times the item appears in the vector and unordered_multimap::equal_range() to get the indices of the items inside your vector.
std::vector<std::string> a = {"ab", "ba", "ca", "ab", "bc", "ba"};
std::vector<std::string> b = {"fg", "fg", "ba", "eg", "gf", "ge"};
std::vector<std::string> c = {"pq", "qa", "ba", "fg", "de", "gf"};
std::unordered_multimap<std::string,int> lookup_table;
for (int i = 0; i < a.size(); i++) {
lookup_table.insert(std::make_pair(a[i], i));
lookup_table.insert(std::make_pair(b[i], i));
lookup_table.insert(std::make_pair(c[i], i));
}
// count
std::string userinput;
std::cin >> userinput;
int count = lookup_table.count(userinput);
std::cout << userinput << " shows up " << count << " times" << std::endl;
// print all the places where the key shows up
auto range = lookup_table.equal_range(userinput);
for (auto it = range.first; it != range.second; it++) {
int ind = it->second;
std::cout << " " << it->second << " "
<< a[ind] << " "
<< b[ind] << " "
<< c[ind] << std::endl;
}
This will be the most efficient if you will be searching the lookup table many items. If you only need to search one time, then Dietmar Kühl's approach would be most efficient.