Searching Multiple Vectors

Searching Multiple Vectors - c++

I have 4 vectors with about 45,000 records each right now. Looking for an efficient method to run through these 4 vectors and output how many times it matches the users input. Data needs to match on the same index of each vector.
Multiple for loops? Vector find?
Thanks!

If the elements need to match at the same location, it seems that a std::find() or std::find_if() combined with a check for the other vectors at the position is a reasonable approach:
std::vector<A> a(...);
std::vector<B> b(...);
std::vector<C> c(...);
std::vector<D> d(...);
std::size_t match(0);
for (auto it = a.begin(), end = a.end(); it != end; ) {
it = std::find_if(it, end, conditionA));
if (it != end) {
if (conditionB[it - a.begin()]
&& conditionC[it - a.begin()]
&& conditionD[it - a.begin()]) {
++match;
}
++it;
}
}

What I got from description is that, you have 4 vectors and lots of user data, you need to find out how many of times it matches with vectors at same index
so here goes the code ( i am writing a c++4.3.2 code)
#include<iostream>
#include<vector>
#include<algorithm>
using namespace std;
int main(){
vector<typeT>a;
vector<typeT>b;
vector<typeT>c;
vector<typeT>d;
vector<typeT>matched;
/*i am assuming you have initialized a,b,c and d;
now we are going to do pre-calculation for matching user data and store
that in vector matched */
int minsize=min(a.size(),b.size(),c.size(),d.size());
for(int i=0;i<minsize;i++)
{
if(a[i]==b[i]&&b[i]==c[i]&&c[i]==d[i])matched.push_back(a[i]);
}
return 0;
}
this was the precalculation part. now next depend on data type you are using, Use binary search with little bit of extra counting or using a better data structure which stores a pair(value,recurrence) and then applying binary search.
Time complexity will be O(n+n*log(n)+m*log(n)) where n is minsize in code and m is number of user input

Honestly, I would have a couple of methods to maintain your database(vectors).
Essentially, do a QuickSort to start out with.
Then ever so often consistently run a insertion sort (Faster then QuickSort for partially sorted lists)
Then just run binary search on those vectors.
edit:
I think a better way to store this is instead of using multiple vectors per entry. Have one class vector that stores all the values. (your current vectors)
class entry {
public:
variable data1;
variable data2;
variable data3;
variable data4;
}
Make this into a single vector. Then use my method I described above to sort through these vectors.
You will have to sort through by what type of data it is first. Then after call binary search on that data.

You can create a lookup table for the vector with std::unordered_multimap in O(n). Then you can use unordered_multimap::count() to get the number of times the item appears in the vector and unordered_multimap::equal_range() to get the indices of the items inside your vector.
std::vector<std::string> a = {"ab", "ba", "ca", "ab", "bc", "ba"};
std::vector<std::string> b = {"fg", "fg", "ba", "eg", "gf", "ge"};
std::vector<std::string> c = {"pq", "qa", "ba", "fg", "de", "gf"};
std::unordered_multimap<std::string,int> lookup_table;
for (int i = 0; i < a.size(); i++) {
lookup_table.insert(std::make_pair(a[i], i));
lookup_table.insert(std::make_pair(b[i], i));
lookup_table.insert(std::make_pair(c[i], i));
}
// count
std::string userinput;
std::cin >> userinput;
int count = lookup_table.count(userinput);
std::cout << userinput << " shows up " << count << " times" << std::endl;
// print all the places where the key shows up
auto range = lookup_table.equal_range(userinput);
for (auto it = range.first; it != range.second; it++) {
int ind = it->second;
std::cout << " " << it->second << " "
<< a[ind] << " "
<< b[ind] << " "
<< c[ind] << std::endl;
}
This will be the most efficient if you will be searching the lookup table many items. If you only need to search one time, then Dietmar Kühl's approach would be most efficient.

Related

Iterating over arma::mat and retrieving element locations

I'm trying to set the values of an arma::mat element-wise and the value of each element depends on the multi-index (row, column) of each element.
Is there a way to retrieve the current location of an element during iteration?
Basically, I'd like to be able to do something like in the sparse matrix iterator, where it.col() and it.row() allow to retrieve the current element's location.
For illustration, the example given in the arma::sp_mat iterator documentation) is:
sp_mat X = sprandu<sp_mat>(1000, 2000, 0.1);
sp_mat::const_iterator it = X.begin();
sp_mat::const_iterator it_end = X.end();
for (; it != it_end; ++it) {
cout << "val: " << (*it) << endl;
cout << "row: " << it.row() << endl; // only available for arma::sp_mat, not arma::mat
cout << "col: " << it.col() << endl; // only available for arma::sp_mat, not arma::mat
}
Of course there are a number of workarounds to get element locations for arma::mat iteration, the most straight-forward ones perhaps being:
Use nested for-loops over the row and column sizes.
Use a single for loop and, using the matrix size, transform the iteration number to a row and column index.
Some form of "zipped" iteration with an object containing or computing the corresponding indices.
However, these seem rather hacky and error-prone to me, because they require to work with the matrix sizes or even do manual index juggling.
I'm looking for a cleaner (and perhaps internally optimised) solution. It feels to me like there should be a way to achieve this ...
Apart from the solution used for arma::sp_mat, other such "nice" solution for me would be using .imbue or .for_each but with a functor that accepts not only the element's current value but also its location as additional argument; this doesn't seem to be possible currently.

Looking at the armadillo source code, row_col_iterator provides row and column indices of each element. This works like the sparse matrix iterator, but doesn't skip zeros. Adapting your code:
mat X(10,10,fill::randu);
mat::const_row_col_iterator it = X.begin_row_col();
mat::const_row_col_iterator it_end = X.end_row_col();
for (; it != it_end; ++it) {
cout << "val: " << (*it) << endl;
cout << "row: " << it.row() << endl;
cout << "col: " << it.col() << endl;
}

You seem to have already answered your question yourself. I wish armadillo provided us with a .imbue method overload that received a functor with as many arguments as the dimension of the armadillo object, but currently it only accepts a functor without arguments. Then the probably simplest option (in my opinion) is using a lambda capturing the necessary information, such as the code below
arma::umat m(3, 3);
{
int i = 0;
m.imbue([&i, num_rows = m.n_rows, num_cols = m.n_cols]() {
arma::uvec sub = arma::ind2sub(arma::SizeMat{num_rows, num_cols}, i++);
return 10 * (sub[0] + 1) + sub[1];
});
}
In this example each element is computed as 10 times its row index plus its column index. I capture ì here to be the linear indexing and put it as well as the lambda inside curly brackets to delimit its scope.
I also wish I could write something like auto [row_idx, col_idx] = arma::ind2sub( ... ), but unfortunately what ind2sub returns does not work with structured binding.
You can also capture m by const reference and use arma::size(m) as the first argument of arma::ind2sub, if you prefer.

c++ - Solution to 2-sum using unordered_map

Okay so I am trying to solve the 2-SUM problem in c++. Given a file of 1000000 numbers in arbitrary order, I need to determine if there exist pairs of integers whose sum is t where t is each of [-10000, 10000]. So this basically the 2-SUM problem.
So, I coded up my solution in C++ wherein I used unordered_map as my hash table. I am ensuring low load on the hash table. But still this takes around 1hr 15mins to finish(successful). Now, I am wondering if it should be that slow. Further reducing the load factor did not give any considerable performance boost.
I have no idea where I can optimise the code. I tried different load factors, doesn't help. This is question from a MOOC and people have been able to get this done in around 30 mins using the same hash table approach. Can anybody help me make this code faster. Or at least give a hint as to where the code might be slowing down.
Here is the code -
#include <iostream>
#include <unordered_map>
#include <fstream>
int main(int argc, char *argv[]){
if(argc != 2){
std::cerr << "Usage: ./2sum <filename>" << std::endl;
exit(1);
}
std::ifstream input(argv[1]);
std::ofstream output("log.txt");
std::unordered_map<long, int> data_map;
data_map.max_load_factor(0.05);
long tmp;
while(input >> tmp){
data_map[tmp] += 1;
}
std::cerr << "input done!" << std::endl;
std::cerr << "load factor " << data_map.load_factor() << std::endl;
//debug print.
for(auto iter = data_map.begin(); iter != data_map.end(); ++iter){
output << iter->first << " " << iter->second << std::endl;
}
std::cerr << "debug print done!" << std::endl;
//solve
long ans = 0;
for(long i = -10000; i <= 10000; ++i){
//try to find a pair whose sum = i.
//debug print.
if(i % 100 == 0)
std::cerr << i << std::endl;
for(auto iter = data_map.begin(); iter != data_map.end(); ++iter){
long x = iter->first;
long y = i - x;
if(x == y)
continue;
auto search_y = data_map.find(y);
if(search_y != data_map.end()){
++ans;
break;
}
}
}
std::cout << ans << std::endl;
return 0;
}

On a uniform set with all sums equally probable, the below will finish in seconds. Otherwise, for any missing sums, on my laptop takes about 0.75 secs to check for a missing sum.
The solution has a minor improvement in comparison with the OP's code: checking for duplicates and eliminating them.
Then it opens through a Monte Carlo heuristic: for about 1% of the total numbers, randomly picks one from the set and searches for all the sums in the [minSum, maxSum] range that can be made having one term as the randomly picked number and the rest of them. This will pre-populate the sums set with... say... 'sum that can be found trivially'. In my tests, using 1M numbers generated randonly between -10M and 10M, this is the single step necessary and takes a couple of seconds.
For pathological numbers distributions, in which some of the sum values are missing (or have not been found through the random heuristic), the second part uses a targeted exhaustive search over the not-found sum values, very much on the same line as the solution in the OP.
Extra explanations for the random/Monte Carlo heuristic(to address #AneeshDandime's comment of):
Though i do not fully understand it at the moment
Well, it's simple. Think like this: the naive approach is to take all the input values and add them in pairs, but retain only the sum in the [-10k, 10k]. It is however terrible expensive (O[N^2]). An immediate refinement would be: pick a value v0, then determine which other v1 values stand a chance to give a sum in the [-10k, 10k] range. If the input values are sorted, it's easier: you only need to select v1-s in the [-10k-v0, 10k-v0]; a good improvement, but if you keep this as the only approach, an exhaustive search would still be O(log2(N)N[-10k, 10k]).
However, this approach still has its value: if the input values are uniformly distributed, it will quickly populate the known sums set with the most common values (and spend the rest of time trying to find infrequent or missing sum values).
To capitalize, instead of using this 'til the end, one can proceed with a limited number of steps, hope to populate the majority of the sums. After that, we can switch the focus and enter the 'targeted search for sum values', but only for the sum value not found at this step.
[Edited: prev bug corrected. Now the algo is stable in regards with values present multiple times or single occurrences in input]
#include <algorithm>
#include <vector>
#include <random>
#include <unordered_set>
#include <unordered_map>
int main() {
typedef long long value_type;
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++
// substitute this with your input sequence from the file
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<value_type> initRnd(-5500, 10000000);
std::vector<value_type> sorted_vals;
for(ulong i=0; i<1000000; i++) {
int rnd=initRnd(gen);
sorted_vals.push_back(rnd);
}
std::cout << "Initialization end" << std::endl;
// end of input
// +++++++++++++++++++++++++++++++++++++++++++++++++++++++
// use some constants instead of magic values
const value_type sumMin=-10000, sumMax=10000;
// Mapping val->number of occurrences
std::unordered_map<value_type, size_t> hashed_vals;
for(auto val : sorted_vals) {
hashed_vals[val]=hashed_vals[val]++;
}
// retain only the unique values and sort them
sorted_vals.clear();
for(auto val=hashed_vals.begin(); val!=hashed_vals.end(); ++val) {
sorted_vals.push_back(val->first);
}
std::sort(sorted_vals.begin(), sorted_vals.end());
// Store the encountered sums here
std::unordered_set<int> sums;
// some 1% iterations, looking at random for pair of numbers which will contribute with
// sum in the [-10000, 10000] range, and we'll collect those sums.
// We'll use the sorted vector of values for this purpose.
// If we are lucky, most of the sums (if not all) will be already filled in
std::uniform_int_distribution<size_t> rndPick(0, sorted_vals.size());
size_t numRandomPicks=size_t(sorted_vals.size()*0.1);
if(numRandomPicks > 75000) {
numRandomPicks=75000;
}
for(size_t i=0; i<numRandomPicks;i++) {
// pick a value index at random
size_t randomIx=rndPick(gen);
value_type val=sorted_vals[randomIx];
// now search for the values between -val-minSum and -val+maxSum;
auto low=std::lower_bound(sorted_vals.begin(), sorted_vals.end(), sumMin-val);
if(low==sorted_vals.end()) {
continue;
}
auto high=std::upper_bound(sorted_vals.begin(), sorted_vals.end(), sumMax-val);
if(high==sorted_vals.begin()) {
continue;
}
for(auto rangeIt=low; rangeIt!=high; rangeIt++) {
if(*rangeIt!=val || hashed_vals[val] > 1) {
// if not the same as the randomly picked value
// or if it is the same but that value occurred more than once in input
auto sum=val+*rangeIt;
sums.insert(sum);
}
}
if(sums.size()==size_t(sumMax-sumMin+1)) {
// lucky us, we found them all
break;
}
}
// after which, if some sums are not present, we'll search for them specifically
if(sums.size()!=size_t(sumMax-sumMin+1)) {
std::cout << "Number of sums still missing: "
<< size_t(sumMax-sumMin+1)-sums.size()
<< std::endl
;
for(int sum=sumMin; sum<=sumMax; sum++) {
if(sums.find(sum)==sums.end()) {
std::cout << "looking for sum: " << sum ;
// we couldn't find the sum, so we'll need to search for it.
// We'll use the unique_vals hash map this time to search for the other value
bool found=false;
for(auto i=sorted_vals.begin(); !found && i!=sorted_vals.end(); ++i) {
value_type v=*i;
value_type other_val=sum-v;
if( // v---- either two unequal terms to be summed or...
(other_val != v || hashed_vals[v] > 1) // .. the value occurred more than once
&& hashed_vals.find(other_val)!=hashed_vals.end() // and the other term exists
) {
// found. Record it as such and break
sums.insert(sum);
found=true;
}
}
std::cout << (found ? " found" : " not found") << std::endl;
}
}
}
std::cout << "Total number of distinct sums found: " << sums.size() << std:: endl;
}

You can reserve the space earlier for unordered map. It should increase performance a bit

What about sorting the array first and then for each element in the array, use binary search to find the number which would make it closer to -10000 and keep going "right" until you reached a sum +10000
This way you will avoid going through the array 20000 times.

Nested loop using iterator C++

Stuck in very interesting problem.
You might have done this before in C/C++
map<string, string> dict;
dsz = dict.size();
vector<string> words;
int sz = words.size();
for(int i = 0; i < sz; ++i)
{
for(int j = i + 1; j < dsz; ++j)
{
}
}
How I will achieve the same thing using iterator.
Please suggest.

Ok.
I figure it out.
more precisely I wanted both i and j in inner loop.
here i did with iterator, sorry I have to move to multimap instead of map due to change in requirement.
vector<string>::iterator vit;
multimap<string, string>::iterator top = dict.begin();
multimap<string, string>::iterator mit;
for(vit = words.begin(); vit != words.end(); ++vit)
{
string s = *vit;
++top;
mit = top;
for(; mit != dict.end(); ++mit)
{
/* compare the value with other val in dictionary, if found, print their keys */
if(dict.find(s)->second == mit->second)
cout << s <<" = "<< mit->first << endl;
}
}
Any other efficient way to do this will be grateful.

Your final intent is not fully clear, because you start the j loop on i+1 (see comments at the end). Until you give clarity on this relationship, I propose you two interim solutions
Approach 1: easy and elegant:
You use the new C++11 range based for(). It makes use of an iterator starting with begin() and going until end(), without you having to bother with this iterator:
for (auto x : words) { // loop on word (size sz)
for (auto y : dict) { // loop on dict (size dsz)
// do something with x and y, for example:
if (x==y.first)
cout << "word " << x << " matches dictionary entry " << y.second << endl;
}
}
Approach 2: traditional use of iterators
You cas also specify explicitely iterators to be used. This is a little bit more wordy as the previous example, but it allows you to choose the best suitable iterator, for example if you want constant iterator like cbegin() instead of begin(), if you want to skip some elements or use an adaptator on the iterator, suc as for example reverse_iterator, etc.:
for (auto itw = words.begin(); itw != words.end(); itw++) {
for (auto itd = dict.begin(); itd != dict.end(); itd++) {
// do simething with *itw and *itd, for example:
if (*itw == itd->first)
cout << "word " << *itw << " matches dictionary entry " << itd->second << endl;
}
}
Remarks:
The starting of intter loop with j=i+1 makes sense only if elements of word vector are related to elements in dict map (ok, they are cerainly words as well), AND if the order of elements you access in the map is related to the order in the vector. As map is ordered according to the key, this would make sense only word would be ordered as well following the same key. Is it the case ?
If you'd still want to skip elements or make calculation based on distance between elements , you'd rather consider the second approach propose above. It makes it easier to use distance(itw, words.begin()) which would be the equivalent of i.
However, it's best to use containters taking advantage of their design. So instead of iterating trough a dictionaly map to find a word entry, it's better to do use the map as follows:
for (auto x : words) { // loop on word (size sz)
if (dict.count(x)) // if x is in dictionary
cout << "word " << x << " matches dictionary entry " << dict[x] << endl;
}

c++ List Elements Randomly Not Deleted By Loop

I have a function that randomly deletes elements from a map called bunnies (the map contains class objects) and a list called names ( the list contains the keys to the map) when the number of elements reach more than 250. However, randomly the map element will be deleted but the list entry will not (I think this is what is going on, though clearly part of the map element survives). The outcome is that when I use the second section of code to iterate through the list and display the mapped values associated with those keys, I get large negative values like the example at the bottom.
Clearly the list element isn't being deleted, but why?
void cull(std::map<std::string, Bunny> &bunnies, std::list<std::string> &names,int n)
{
int number = n, position = 0;
for (number = n; number > 125; number--)
{
position = rand() % names.size();
std::list<std::string>::iterator it = names.begin();
std::advance(it, position);
bunnies.erase(*it);
names.erase(it);
it = names.begin();
}
std::cout << "\n" << n - 125 << "rabbits culled";
}
I use this code to print out the map values.
for (std::list<std::string>::iterator it = names.begin(); it != names.end(); it++)
{
n++;
std::cout << n << "\t" << " " << *it << "\t" << bunnies[*it].a() << "\t" << bunnies[*it].s() << "\t" << bunnies[*it].c() << "\t" << bunnies[*it].st() << "\n";
This is the output. The top is what it should display, the bottom is what happens when the program fails.
165 Tom_n 14 1 0 1
166 Lin_c -842150451 -842150451 -842150451 -842150451

The problem seems to be this:
std::advance(it, number);
This should be position, not number.
The other problem is that a map stores unique names. What if there is more than one Bunny with the same name? For example, if the list has 3 bunnies names "John", the map will be able to hold only one "John", since the key in a map must be unique.
Either use a multimap if names can be duplicated, or use a std::set instead of a std::list if Bunnies must have unique names.
Maybe overall, you can just use std::map<std::string, Bunny>, and forget about the std::list. The map by itself has all the information you need. Unless there is something I'm missing, I don't see the need for a std::list to do redundant work.

Given a sequence of numbers report which have been repeated and how many times

for example
1 1 1 1 2 3 3 4 5 5 5
1 repeated 3 times,
3 repeated 1 time,
5 repeated 2 times
here's the code but it has some troubles
int i, k, m, number, number_prev, e;
cout << "Insert how many numbers: ";
cin >> m;
cout << "insert number";
cin >> number;
number_prev = number;
int num_rep[m]; //array of repeated numbers
int cant_rep[m]; // array of correspondent number of repetitions
e = 0;
for (i=1; i<m; i++)
{
cin >> number;
if (number == number_prev)
{
if (number == num_rep[e-1])
cant_rep[e-1]++;
else
{
num_rep[e] = number;
cant_rep[e] = e + 1;
e++;
}
}
else
e = 0;
number_prev = number;
}
for (k = 0; k < e; k++)
cout << "\nnumber " << num_rep[k] << " repeated " << cant_rep[k] << " times.\n";

You should learn algorithms and data structures. This make your code simpler. just using associative container that saves pair
a number --> how many times it repeats
can simplify your program sufficiently
int main()
{
std::map<int, int> map;
int v;
while(std::cin >> v) {
map[v]++;
}
for (auto it = map.cbegin(); it != map.cend(); ++it) {
if (it->second > 1) {
std::cout << it->first << " repeats " << it->second - 1 << " times\n";
}
}
}
std::map is an associative container.
You can think about it as a key-->value storage with unique keys.
The example in real word is a dictionary:
There you have word and its definition. The word is a key and the definition is a value.
std::map<int, int> map;
^^^ ^^^
| |
key type value type
You can refer to values using [] operator.
this works like usual array, except instead of index you use your key.
You can also examine all key-value pairs, storied in the map using iterators.
it = map.cbegin(); // refers to the first key-value pair in the map
++it; // moves to the next key-value pair
it != map.cend(); // checks, if we at the end of map, so examined all elements already
As, I pointed out, map saves key-value pairs.
And in Standard C++ library struct std::pair is used to express pair.
It has first and second members, that represents first and second values, storied in a pair.
In the case of map, first is a key, and second is a value.
Again, we are storing a number as a key and how many times it repeats in a value.
Then, we read user input and increase value for the given key.
After that, we just examine all elements stored in a map and print them.

int num_rep[m]; //array of repeated numbers
int cant_rep[m]; // array of correspondent number of repetitions
Here, m is only known when running, array sizes must be known in compile time. use std::vector instead.
The code seems like a C-style C++ program:
1.You don't need to declare variables in the beginning of block. Declare them before use, it's more readable.
2.Use the STL types like std::vector can save you a lot of trouble in programs like this.

you say "Insert m numbers", but for (i=1; i<m; i++) will loop m-1 times, which might not is what you want.
as a supplement advice, you should do input check for the variables get from external world.like cin >> m;, for it can be zero or negative.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Searching Multiple Vectors - c++

I have 4 vectors with about 45,000 records each right now. Looking for an efficient method to run through these 4 vectors and output how many times it matches the users input. Data needs to match on the same index of each vector. Multiple for loops? Vector find? Thanks!

Related

Iterating over arma::mat and retrieving element locations

c++ - Solution to 2-sum using unordered_map

Nested loop using iterator C++

c++ List Elements Randomly Not Deleted By Loop

Given a sequence of numbers report which have been repeated and how many times

Categories

Resources